Running a large language model on your own machine is mostly a question of memory. If the model's weights (plus a little overhead) fit in your GPU's VRAM, it runs fast. If they don't, you either offload to slower system RAM or can't run it at all. This guide gives you practical numbers, and a free VRAM calculator to check any model instantly.
VRAM needed β (number of parameters) Γ (bytes per parameter) + a context/overhead buffer. Bytes per parameter depend on the quantization:
| Precision | Bytes / param | Notes |
|---|---|---|
| FP16 / BF16 | 2.0 | Full quality, largest |
| 8-bit (Q8) | ~1.0 | Near-lossless |
| 4-bit (Q4, AWQ, GGUF Q4_K_M) | ~0.5 | Best size/quality trade-off |
| Model | 4-bit VRAM | Fits on |
|---|---|---|
| 7B | ~5β6 GB | RTX 3060 / most laptops |
| 13B | ~9β10 GB | RTX 3080 / 4070 |
| 34B | ~22 GB | RTX 3090 / 4090 |
| 70B | ~42β48 GB | 2Γ 24 GB or an A100 |
| 405B | ~230 GB | Multi-GPU server |
Add headroom for context length: a long 32K-token context can add several GB on top of the weights.
π₯οΈ Check any model in 1 second β1. Quantize β moving from FP16 to 4-bit roughly quarters the memory with minimal quality loss for most uses. 2. Use GGUF with CPU offload (llama.cpp / Ollama) to split layers between GPU and RAM. 3. Shorten the context window if you're memory-bound. 4. Pick the right format β AWQ and GGUF Q4_K_M are popular sweet spots.
Can I run a 70B model on a 24 GB card? Not fully in VRAM at 4-bit (it needs ~42 GB), but you can with CPU offload β it'll just be slower.
Does context length matter? Yes β the KV cache grows with context and model size, so long prompts need extra memory.