AdvancedServer / VPS 10 min read

vLLM + AWQ in Production: Tuning Guide

gpu-memory-utilization, max-model-len, and batching knobs for stable API serving.

vLLMAWQproductionAPI

Memory tuning

Start at 0.85 gpu-memory-utilization. Lower to 0.75 if you see OOM on long contexts.

bash

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct-AWQ \
  --quantization awq \
  --gpu-memory-utilization 0.85 \
  --max-model-len 32768 \
  --max-num-seqs 32

Deployment guides are educational. Each model is subject to its own license — read the official Hugging Face model card before downloading or deploying.