Hardware Guide & Expected Performance

A reference guide to help you choose the right model size based on your hardware configuration (Apple Silicon and NVIDIA GPUs), including required memory (VRAM) and expected tokens per second (t/s).

1. Understanding Memory and Quantization

Running an LLM locally primarily depends on your available memory (VRAM on NVIDIA, Unified Memory on Mac). Most local setups use 4-bit or 8-bit quantization to reduce memory usage significantly with minimal impact on output quality.

  • Rule of thumb for 4-bit: 1 Billion parameters (1B) ≈ 0.7 GB of RAM + 1-2 GB for context window.
  • Tokens per second (t/s): Generally, 10-15 t/s is reading speed. 30+ t/s feels fast and conversational.

2. Apple Silicon (macOS)

Macs with M-series chips (M1, M2, M3, M4) use Unified Memory. Because the CPU and GPU share the same RAM, you can run very large models without buying expensive specialized graphics cards, though speed will heavily depend on memory bandwidth (Pro, Max, or Ultra chips are significantly faster).

Unified Memory Recommended Model Size Example Models Expected Speed (M2/M3/M4)
8 GB 1.5B – 4B (4-bit) Qwen2.5 1.5B, Gemma 3 4B 30 - 60 t/s
16 / 18 GB 7B – 14B (4-bit) Llama 3.1 8B, Qwen 2.5 14B 25 - 50 t/s
32 / 36 GB 32B – 35B (4-bit) QwQ 32B, Qwen 2.5 32B 15 - 30 t/s (faster on Max chips)
64 GB+ (Max/Ultra) 70B+ (4-bit) Llama 3.3 70B, DeepSeek-R1 (Distill) 10 - 20 t/s

3. NVIDIA GPUs (Windows / Linux)

NVIDIA GPUs offer the fastest inference speeds thanks to massive CUDA parallel processing and fast GDDR6/GDDR6X VRAM. However, you are strictly limited by your dedicated Video RAM (VRAM). If a model is larger than your VRAM, your system will offload it to system RAM, which drastically slows down generation (often below 5 t/s).

GPU VRAM Recommended Model Size Example GPUs Expected Speed (4-bit)
8 GB 1.5B – 8B RTX 3060 Ti, 4060 40 - 80+ t/s
12 GB 8B – 14B RTX 3060 12G, 4070 50 - 90+ t/s
16 GB 14B – 32B (Lower Context) RTX 4080 (Super) 40 - 70 t/s
24 GB 32B (Full Context) RTX 3090, RTX 4090 60 - 100+ t/s
2x 24GB (48 GB) 70B Dual RTX 3090/4090 15 - 30 t/s

4. Hardware Buying Advice for Local AI

  • For best value: An Apple Mac Mini or MacBook Pro with 16GB or 32GB of RAM is an excellent, cost-effective all-rounder for running 8B to 32B models decently fast.
  • For maximum speed: A desktop with an NVIDIA RTX 4090 (24GB VRAM) or even a used RTX 3090 (24GB VRAM) is unbeatable for fast inference and training/fine-tuning.
  • For 70B+ massive models: A Mac Studio with 128GB Unified Memory is often much cheaper and more energy-efficient than buying multiple high-end NVIDIA GPUs to get equivalent VRAM.