Back to Cookbook
IntermediateServer / VPS 7 min read
CPU Inference: OpenBLAS Tuning for llama.cpp
Maximize tokens/sec on a CPU-only VPS with thread count and BLAS backend tuning.
CPUllama.cppOpenBLASVPS
Thread count
Set -t to physical core count (not hyperthreads). Use -tb 1 for single-batch interactive use.
bash
./build/bin/llama-server \
-m ./models/Llama-3.1-8B-Q4_K_M.gguf \
-t 8 -tb 1 -c 4096 \
--host 0.0.0.0 --port 8080Expected performance
A 8-core VPS with OpenBLAS achieves ~8–15 tok/s on 8B Q4_K_M. Usable for personal API, not production throughput.
text
Hetzner CX32 (8 vCPU, 32GB): ~12 tok/s
AWS c7i.2xlarge (8 vCPU): ~15 tok/sDeployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.