Benchmarks & Insights
Hard numbers, no hype — measured on real hardware
Inference Speed
Tokens/second · Llama 3.1 8B Instruct · batch=1
Perplexity vs Quantization Level
WikiText-2 PPL · Llama 3.1 8B · lower = better (FP16 baseline = 6.14)
Hardware × Format Matrix
Tokens/second · Llama 3.1 8B Instruct · batch=1
| Hardware | Framework | Quant | Speed (tok/s) | VRAM Used | Notes |
|---|---|---|---|---|---|
| RTX 4090 24G | ExLlamaV2 | EXL2 4.65bpw | 235 | 5.4 GB | Peak consumer performance |
| RTX 4090 24G | vLLM | AWQ INT4 | 218 | 4.9 GB | Best for batch API |
| RTX 4090 24G | llama.cpp | GGUF Q4_K_M | 148 | 5.7 GB | Easiest setup |
| RTX 4060 Ti 16G | ExLlamaV2 | EXL2 4.65bpw | 98 | 5.4 GB | Great budget option |
| RTX 4060 Ti 16G | llama.cpp | GGUF Q4_K_M | 78 | 5.7 GB | Budget-friendly |
| RTX 3090 24G | ExLlamaV2 | EXL2 4.65bpw | 175 | 5.4 GB | Older but capable |
| M3 Max 48G | Ollama | GGUF Q4_K_M | 68 | 5.7 GB | Unified memory advantage |
| M2 Ultra 192G | llama.cpp | GGUF Q4_K_M | 90 | 5.7 GB | Can run 70B models solo |