Background
Running LLM inference locally via llama.cpp on a homelab LXC container with a GTX 1080 GPU. The goal: better speed than Ollama with full GPU acceleration. Along the way, hit a broken thinking token bug, a Pascal GPU compatibility issue, and a zombie GPU memory leak — all resolved.
Environment
| Component | Detail |
|---|---|
| Host | Proxmox VE, Debian LXC |
| GPU | NVIDIA GTX 1080 (8GB, Pascal / compute 6.1) |
| CUDA | 12.4 (required — CUDA 13+ dropped Pascal support) |
| llama.cpp | Built from master (Jun 21 2026) |
1. Build llama.cpp from Source
Pre-built binaries are too old. Clone master and build with CUDA support.
# Install build dependencies
apt-get update && apt-get install -y cmake build-essential git
# Clone llama.cpp
git clone --depth 1 https://github.com/ggerganov/llama.cpp.git /tmp/llama-fresh
# Build with CUDA for Pascal GPU (compute capability 6.1)
cd /tmp/llama-fresh
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-12 \
-DCMAKE_CUDA_ARCHITECTURES=61 \
'-DCMAKE_CUDA_FLAGS=--allow-unsupported-compiler' \
-DGGML_CUDA=ON \
-DGGML_CUDA_FORCE_MMQ=ON \
-DGGML_NATIVE=OFF \
-DGGML_LTO=OFF
cmake --build build --target llama-cli llama-server -j$(nproc)CMAKE_CUDA_HOST_COMPILER. The --allow-unsupported-compiler flag is required when mixing CUDA nvcc with newer host compilers.2. Verify GPU Detection
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-cli --list-devicesExpected output:
Available devices:
CUDA0: NVIDIA GeForce GTX 1080 (8104 MiB, 7249 MiB free)3. Download the Model
Download from your local machine (HuggingFace does not require login for downloads):
- Go to https://huggingface.co/bartowski/Qwen_Qwen3.5-9B-GGUF
- Click Files tab
- Download
Qwen3.5-9B-Q4_K_M.gguf(6.17GB)
scp ~/Downloads/Qwen3.5-9B-Q4_K_M.gguf [email protected]:/opt/models/HauhauCS variant (Qwen3.5-9B-Uncensored-HauhauCS-Aggressive). Its aggressive quantization destroys the special thinking token IDs, causing an infinite ???? loop in llama.cpp output. The official Qwen_Qwen3.5-9B-GGUF works correctly.4. Run llama-cli with GPU Acceleration
ssh [email protected]
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-cli \
-m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf \
-ngl 99 \
-c 512 \
--reasoning off \
--log-disableType your prompt and press Enter. Type /exit to quit.
Key Flags
| Flag | Description |
|---|---|
-ngl 99 | Offload all layers to GPU (use -ngl 20 for partial if VRAM is limited) |
-c 512 | Context size — adjust based on available VRAM |
--reasoning off | Disables Qwen's built-in thinking mode |
--log-disable | Suppresses verbose logging |
Performance on GTX 1080 (8GB)
| Metric | Value |
|---|---|
| Prompt processing | ~140 t/s |
| Generation | ~30 t/s |
| GPU memory usage (full offload) | ~6.3GB |
5. Direct HuggingFace Download via -hf Flag
llama.cpp can download GGUF files directly without manual download:
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-cli \
-hf bartowski/Qwen_Qwen3.5-9B-GGUF:Q4_K_M \
-ngl 99 \
-c 512 \
--reasoning off \
-p "Your prompt here" \
-n 200Format: user/model:quant. Downloads to ~/.cache/huggingface/hub/.
6. The HauhauCS Thinking Loop Bug
Initial testing with the HauhauCS-Aggressive variant produced infinite ???? output. Root cause: aggressive quantization destroyed the special thinking token IDs (<|thought|> / <|im_end|>). The tokenizer mapped them to U+003F ? characters instead of triggering the thinking block.
The --reasoning off API flag was ignored because llama.cpp upstream deprecated template-parameter-based reasoning control in recent builds. The solution: switch to the official Qwen_Qwen3.5-9B-GGUF model, which has thinking disabled by default.
7. GPU Memory Zombie Issue
If nvidia-smi shows high memory usage but no processes:
nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv
# If 6-7GB used with 1-2GB free but "No running processes found" → zombie contextFix: Reboot the LXC. The zombie GPU context cannot be freed without a reboot.
# From Proxmox host
pct reboot 1008. Ollama as Stable Alternative
If llama.cpp proves unreliable, Ollama provides a stable alternative with thinking control at the API level:
# Install
curl -fsSL https://ollama.com/install.sh | sh
# Pull model
ollama pull qwen3.5:9b-q4_K_M
# API call with thinking disabled
curl -s http://localhost:11434/api/chat \
-d '{
"model": "qwen3.5:9b-q4_K_M",
"messages": [{"role": "user", "content": "Your prompt"}],
"stream": false,
"think": false
}'Ollama patches the model at load time, handling thinking tokens correctly. Typical generation speed: ~10 t/s on GTX 1080.
File Locations Summary
| File | Path |
|---|---|
| llama.cpp source | /tmp/llama-fresh/ |
| llama-cli binary | /tmp/llama-fresh/build/bin/llama-cli |
| llama-server binary | /tmp/llama-fresh/build/bin/llama-server |
| Model (Q4_K_M) | /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf |
| CUDA libraries | /usr/local/cuda-12.4/targets/x86_64-linux/lib/ |
9. Using llama-server as an OpenWebUI and AI Agent Backend
llama.cpp ships with llama-server, a drop-in REST API server compatible with the OpenAI API format. This lets you use it as a backend for OpenWebUI, anythingLLM, LibreChat, or any OpenAI API-compatible agent.
9.1 Start the Server
ssh [email protected]
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-server \
-m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf \
-ngl 99 \
-c 4096 \
--reasoning off \
--host 0.0.0.0 \
--port 8080 \
&Run it in the background or under systemd. Test it:
curl http://localhost:8080/v1/modelsExpected response:
{"object":"list","data":[{"id":"Qwen_Qwen3.5-9B-Q4_K_M.gguf","object":"model","created":1234567890,"owned_by":"local"}]}9.2 OpenWebUI
OpenWebUI supports llama.cpp as a direct backend. In Settings → Admin Panel → Models, add:
Base URL: http://192.168.2.50:8080/v1
API Key: any-non-empty-string (llama.cpp ignores this by default)
Model name: Qwen_Qwen3.5-9B-Q4_K_M.ggufOr set via environment variable:
OPENAI_API_BASE=http://192.168.2.50:8080/v1
OPENAI_API_KEY=any-string
OLLAMA_BASE_URL=http://192.168.2.50:8080/v1OpenWebUI will list the model and you can chat with it directly. The full OpenAI-compatible endpoint (/v1/chat/completions) is supported.
9.3 AI Agent Integration (OpenAI-Compatible)
Any agent framework that supports OpenAI API works out of the box:
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://192.168.2.50:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="Qwen_Qwen3.5-9B-Q4_K_M.gguf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Python decorators in 2 sentences."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)JavaScript / TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://192.168.2.50:8080/v1',
apiKey: 'not-needed'
});
const response = await client.chat.completions.create({
model: 'Qwen_Qwen3.5-9B-Q4_K_M.gguf',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain Python decorators in 2 sentences.' }
]
});
console.log(response.choices[0].message.content);cURL
curl http://192.168.2.50:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen_Qwen3.5-9B-Q4_K_M.gguf",
"messages": [
{"role": "user", "content": "What is 2+2?"}
],
"temperature": 0.7,
"max_tokens": 100
}'9.4 Chat Completions vs Completions
llama-server supports both endpoints:
| Endpoint | Use Case |
|---|---|
POST /v1/chat/completions | Chat-based models (Qwen, Llama Instruct). Recommended. |
POST /v1/completions | Raw text completion for any model. |
GET /v1/models | List available models. |
POST /v1/embeddings | Generate embeddings (if built with embedding support). |
9.5 Streaming Responses
curl http://192.168.2.50:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen_Qwen3.5-9B-Q4_K_M.gguf",
"messages": [{"role": "user", "content": "Count to 5"}],
"stream": true
}'Returns Server-Sent Events (SSE). Most OpenAI-compatible clients support streaming natively.
9.6 Systemd Service (Auto-Start)
Create /etc/systemd/system/llama-server.service:
[Unit]
Description=llama.cpp server
After=network.target
[Service]
Type=simple
User=root
WorkingDirectory=/opt
ExecStart=/bin/bash -c 'LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH /tmp/llama-fresh/build/bin/llama-server -m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf -ngl 99 -c 4096 --reasoning off --host 0.0.0.0 --port 8080'
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetEnable and start:
systemctl daemon-reload
systemctl enable llama-server
systemctl start llama-server
systemctl status llama-server9.7 Performance Tuning
| Flag | Effect | GTX 1080 Recommendation |
|---|---|---|
-ngl | GPU layers to offload (0=all CPU, 99=all GPU) | 99 (6.3GB VRAM used) |
-c | Context size (tokens held in memory) | 4096 (use -c 2048 if OOM) |
--parallel | Max concurrent requests | 4 (shared VRAM) |
--threads | CPU threads for pre/post processing | 4 |
--mlock | Lock model in RAM (no swap) | Enable if RAM is sufficient |
9.8 Connecting from Another Machine
If the LXC is on a private network, ensure the port is accessible:
# Check firewall
iptables -L -n | grep 8080
# If blocked, allow access
iptables -A INPUT -p tcp --dport 8080 -j ACCEPTFrom any machine on the network, the API is available at http://192.168.2.50:8080/v1.