Question 1

How much VRAM do I need to run an LLM?

Accepted Answer

Three things add up: the model weights (parameters × bytes per parameter, e.g. 2 bytes at FP16), the KV cache (which grows with context length and the number of concurrent requests), and ~15% overhead for activations and the CUDA context. A 7-8B model at FP16 needs roughly 16-20 GiB for the weights alone; quantizing to 4-bit roughly quarters that. Use the calculator above to size your exact setup.

Question 2

What is the KV cache and why does it grow?

Accepted Answer

The KV cache stores the attention keys and values for every token already in the context, for every layer, so the model does not recompute them each step. Its size scales with context length × batch (concurrent requests) × layers × hidden size. On long contexts or high concurrency it can rival or exceed the weights, which is why a model that "fits" at short context can run out of memory at long context.

Question 3

Does quantization reduce GPU requirements?

Accepted Answer

Yes, a lot. FP16 uses 2 bytes per weight; INT8 uses 1; INT4 (GPTQ/AWQ) uses about 0.5. Going from FP16 to INT4 cuts weight memory roughly 4×, often the difference between needing two GPUs and one, with modest quality loss for most workloads. It does not shrink the KV cache, which you control with context length and KV precision.

Question 4

Should I self-host or use an API?

Accepted Answer

Self-hosting wins when you have steady, high volume, strict data-residency needs, or want a fine-tuned open model; an API wins for spiky or low traffic, where you would pay for idle GPUs. Compare the monthly GPU cost here against the same workload in the LLM cost calculator. The break-even is usually a question of utilization, not sticker price.

Question 5

Is this calculator accurate?

Accepted Answer

It is a solid serving-side estimate, not a guarantee. It assumes typical architectures per size class and representative GPU prices; real numbers shift with grouped-query attention, the serving stack, and your exact GPU and region. For a precise plan tied to your model and traffic, a short call is the fastest path.

LLM GPU & VRAM calculator

Breakdown

3 ways to fit a smaller GPU

Deciding self-host vs API, or sizing a cluster?

Frequently asked questions