Hardware Guide & Expected Performance

A reference guide to help you choose the right model size based on your hardware configuration (Apple Silicon and NVIDIA GPUs), including required memory (VRAM) and expected tokens per second (t/s).

1. Understanding Memory and Quantization

Running an LLM locally primarily depends on your available memory (VRAM on NVIDIA, Unified Memory on Mac). Most local setups use 4-bit or 8-bit quantization to reduce memory usage significantly with minimal impact on output quality.

Rule of thumb for 4-bit: 1 Billion parameters (1B) ≈ 0.7 GB of RAM + 1-2 GB for context window.
Tokens per second (t/s): Generally, 10-15 t/s is reading speed. 30+ t/s feels fast and conversational.

2. Apple Silicon (macOS)

Macs with M-series chips (M1, M2, M3, M4) use Unified Memory. Because the CPU and GPU share the same RAM, you can run very large models without buying expensive specialized graphics cards, though speed will heavily depend on memory bandwidth (Pro, Max, or Ultra chips are significantly faster).

Unified Memory	Recommended Model Size	Example Models	Expected Speed (M2/M3/M4)
8 GB	1.5B – 4B (4-bit)	Qwen2.5 1.5B, Gemma 3 4B	30 - 60 t/s
16 / 18 GB	7B – 14B (4-bit)	Llama 3.1 8B, Qwen 2.5 14B	25 - 50 t/s
32 / 36 GB	32B – 35B (4-bit)	QwQ 32B, Qwen 2.5 32B	15 - 30 t/s (faster on Max chips)
64 GB+ (Max/Ultra)	70B+ (4-bit)	Llama 3.3 70B, DeepSeek-R1 (Distill)	10 - 20 t/s

3. NVIDIA GPUs (Windows / Linux)

NVIDIA GPUs offer the fastest inference speeds thanks to massive CUDA parallel processing and fast GDDR6/GDDR6X VRAM. However, you are strictly limited by your dedicated Video RAM (VRAM). If a model is larger than your VRAM, your system will offload it to system RAM, which drastically slows down generation (often below 5 t/s).

GPU VRAM	Recommended Model Size	Example GPUs	Expected Speed (4-bit)
8 GB	1.5B – 8B	RTX 3060 Ti, 4060	40 - 80+ t/s
12 GB	8B – 14B	RTX 3060 12G, 4070	50 - 90+ t/s
16 GB	14B – 32B (Lower Context)	RTX 4080 (Super)	40 - 70 t/s
24 GB	32B (Full Context)	RTX 3090, RTX 4090	60 - 100+ t/s
2x 24GB (48 GB)	70B	Dual RTX 3090/4090	15 - 30 t/s

4. Hardware Buying Advice for Local AI

For best value: An Apple Mac Mini or MacBook Pro with 16GB or 32GB of RAM is an excellent, cost-effective all-rounder for running 8B to 32B models decently fast.
For maximum speed: A desktop with an NVIDIA RTX 4090 (24GB VRAM) or even a used RTX 3090 (24GB VRAM) is unbeatable for fast inference and training/fine-tuning.
For 70B+ massive models: A Mac Studio with 128GB Unified Memory is often much cheaper and more energy-efficient than buying multiple high-end NVIDIA GPUs to get equivalent VRAM.