Back to Cookbook
BeginnerServer / VPS 8 min read
Run Llama 3.1 8B on a €20/month VPS
A complete guide to running a private LLM API on a budget Linux VPS using llama.cpp server mode.
llama.cppVPSLinuxGGUFAPI
Requirements
You need a Linux VPS with at least 16 GB RAM (32 GB recommended). CPU-only inference is surprisingly usable for personal use.
bash
# Tested on Ubuntu 22.04 LTS
# RAM: 16–32 GB | CPU: 4–8 cores
# Monthly cost: ~€15–25 (Hetzner CX32 / OVH Advance)Install llama.cpp
Build from source for best CPU performance with OpenBLAS acceleration.
bash
sudo apt update && sudo apt install -y build-essential cmake libopenblas-dev
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release -j$(nproc)Download the model
Use Q4_K_M for the best accuracy/size tradeoff on limited RAM. The 8B model fits easily in 16 GB.
bash
pip install huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
--local-dir ./modelsStart the server
Run llama.cpp in server mode on port 8080. Add an API key for basic auth.
bash
./build/bin/llama-server \
-m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-c 8192 \
-t $(nproc) \
--api-key "your-secret-key"Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.