BeginnerServer / VPS 8 min read

Run Llama 3.1 8B on a €20/month VPS

A complete guide to running a private LLM API on a budget Linux VPS using llama.cpp server mode.

llama.cppVPSLinuxGGUFAPI

Requirements

You need a Linux VPS with at least 16 GB RAM (32 GB recommended). CPU-only inference is surprisingly usable for personal use.

bash

# Tested on Ubuntu 22.04 LTS
# RAM: 16–32 GB | CPU: 4–8 cores
# Monthly cost: ~€15–25 (Hetzner CX32 / OVH Advance)

Install llama.cpp

Build from source for best CPU performance with OpenBLAS acceleration.

bash

sudo apt update && sudo apt install -y build-essential cmake libopenblas-dev
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release -j$(nproc)

Download the model

Use Q4_K_M for the best accuracy/size tradeoff on limited RAM. The 8B model fits easily in 16 GB.

bash

pip install huggingface_hub
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

Start the server

Run llama.cpp in server mode on port 8080. Add an API key for basic auth.

bash

./build/bin/llama-server \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 \
  -c 8192 \
  -t $(nproc) \
  --api-key "your-secret-key"

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.