LLM GPU & VRAM calculator
Work out how much GPU memory it takes to serve an open-source LLM, which GPU fits, and what self-hosting costs per month. Pick the model size, quantization, context length and concurrency, and see VRAM broken down into model weights and KV cache. Free, no signup, and it runs in your browser.
Breakdown
- Model weights
- KV cache
- Overhead & activations
3 ways to fit a smaller GPU
4-bit quantization could free up to 0 GiB of VRAM.
You’re already at 4-bit weights — the smallest common format. The wins below still apply to the KV cache.
- Quantize the weights INT4 (GPTQ/AWQ) cuts weight memory ~4× versus FP16 with little quality loss — often the difference between two GPUs and one.
- Cap context & KV cache KV cache grows with context length × concurrency. Shorter context windows and FP8 KV cache shrink it directly.
- Use a serving stack vLLM or TGI with paged attention pack the KV cache far more tightly and raise throughput, so you need fewer GPUs for the same traffic.
Deciding self-host vs API, or sizing a cluster?
Send me your model, traffic and latency target and you’ll get a right-sized GPU plan, a realistic monthly cost, and an honest read on whether self-hosting beats an API. First call is 30 minutes, no charge.
Email me →Assumptions. VRAM = model weights + KV cache + ~15% overhead for activations and the CUDA context. Architectures (layers, hidden size) are typical for each size class; real models vary, and techniques like grouped-query attention can shrink the KV cache. GPU prices are representative cloud list rates per hour. Treat this as a serving-side ballpark. Rates last updated 2026-06-26.
Comparing against API pricing? Try the LLM cost calculatorFrequently asked questions
How much VRAM do I need to run an LLM?
Three things add up: the model weights (parameters × bytes per parameter, e.g. 2 bytes at FP16), the KV cache (which grows with context length and the number of concurrent requests), and ~15% overhead for activations and the CUDA context. A 7-8B model at FP16 needs roughly 16-20 GiB for the weights alone; quantizing to 4-bit roughly quarters that. Use the calculator above to size your exact setup.
What is the KV cache and why does it grow?
The KV cache stores the attention keys and values for every token already in the context, for every layer, so the model does not recompute them each step. Its size scales with context length × batch (concurrent requests) × layers × hidden size. On long contexts or high concurrency it can rival or exceed the weights, which is why a model that "fits" at short context can run out of memory at long context.
Does quantization reduce GPU requirements?
Yes, a lot. FP16 uses 2 bytes per weight; INT8 uses 1; INT4 (GPTQ/AWQ) uses about 0.5. Going from FP16 to INT4 cuts weight memory roughly 4×, often the difference between needing two GPUs and one, with modest quality loss for most workloads. It does not shrink the KV cache, which you control with context length and KV precision.
Should I self-host or use an API?
Self-hosting wins when you have steady, high volume, strict data-residency needs, or want a fine-tuned open model; an API wins for spiky or low traffic, where you would pay for idle GPUs. Compare the monthly GPU cost here against the same workload in the LLM cost calculator. The break-even is usually a question of utilization, not sticker price.
Is this calculator accurate?
It is a solid serving-side estimate, not a guarantee. It assumes typical architectures per size class and representative GPU prices; real numbers shift with grouped-query attention, the serving stack, and your exact GPU and region. For a precise plan tied to your model and traffic, a short call is the fastest path.