Deploying llama.cpp with CUDA GPU Acceleration on LXC (GTX 1080)

2026-06-21

Background

Running LLM inference locally via llama.cpp on a homelab LXC container with a GTX 1080 GPU. The goal: better speed than Ollama with full GPU acceleration. Along the way, hit a broken thinking token bug, a Pascal GPU compatibility issue, and a zombie GPU memory leak — all resolved.

Environment

Component	Detail
Host	Proxmox VE, Debian LXC
GPU	NVIDIA GTX 1080 (8GB, Pascal / compute 6.1)
CUDA	12.4 (required — CUDA 13+ dropped Pascal support)
llama.cpp	Built from master (Jun 21 2026)

1. Build llama.cpp from Source

Pre-built binaries are too old. Clone master and build with CUDA support.

# Install build dependencies
apt-get update && apt-get install -y cmake build-essential git

# Clone llama.cpp
git clone --depth 1 https://github.com/ggerganov/llama.cpp.git /tmp/llama-fresh

# Build with CUDA for Pascal GPU (compute capability 6.1)
cd /tmp/llama-fresh
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-12 \
  -DCMAKE_CUDA_ARCHITECTURES=61 \
  '-DCMAKE_CUDA_FLAGS=--allow-unsupported-compiler' \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DGGML_NATIVE=OFF \
  -DGGML_LTO=OFF

cmake --build build --target llama-cli llama-server -j$(nproc)

Note: GCC 14 causes build failures with CUDA nvcc. Use GCC 12 via CMAKE_CUDA_HOST_COMPILER. The --allow-unsupported-compiler flag is required when mixing CUDA nvcc with newer host compilers.

2. Verify GPU Detection

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-cli --list-devices

Expected output:

Available devices:
  CUDA0: NVIDIA GeForce GTX 1080 (8104 MiB, 7249 MiB free)

3. Download the Model

Download from your local machine (HuggingFace does not require login for downloads):

Go to https://huggingface.co/bartowski/Qwen_Qwen3.5-9B-GGUF
Click Files tab
Download Qwen3.5-9B-Q4_K_M.gguf (6.17GB)

scp ~/Downloads/Qwen3.5-9B-Q4_K_M.gguf [email protected]:/opt/models/

Important: Do not use the HauhauCS variant (Qwen3.5-9B-Uncensored-HauhauCS-Aggressive). Its aggressive quantization destroys the special thinking token IDs, causing an infinite ???? loop in llama.cpp output. The official Qwen_Qwen3.5-9B-GGUF works correctly.

4. Run llama-cli with GPU Acceleration

ssh [email protected]
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

/tmp/llama-fresh/build/bin/llama-cli \
  -m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf \
  -ngl 99 \
  -c 512 \
  --reasoning off \
  --log-disable

Type your prompt and press Enter. Type /exit to quit.

Key Flags

Flag	Description
`-ngl 99`	Offload all layers to GPU (use `-ngl 20` for partial if VRAM is limited)
`-c 512`	Context size — adjust based on available VRAM
`--reasoning off`	Disables Qwen's built-in thinking mode
`--log-disable`	Suppresses verbose logging

Performance on GTX 1080 (8GB)

Metric	Value
Prompt processing	~140 t/s
Generation	~30 t/s
GPU memory usage (full offload)	~6.3GB

5. Direct HuggingFace Download via `-hf` Flag

llama.cpp can download GGUF files directly without manual download:

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-cli \
  -hf bartowski/Qwen_Qwen3.5-9B-GGUF:Q4_K_M \
  -ngl 99 \
  -c 512 \
  --reasoning off \
  -p "Your prompt here" \
  -n 200

Format: user/model:quant. Downloads to ~/.cache/huggingface/hub/.

6. The HauhauCS Thinking Loop Bug

Initial testing with the HauhauCS-Aggressive variant produced infinite ???? output. Root cause: aggressive quantization destroyed the special thinking token IDs (<|thought|> / <|im_end|>). The tokenizer mapped them to U+003F ? characters instead of triggering the thinking block.

The --reasoning off API flag was ignored because llama.cpp upstream deprecated template-parameter-based reasoning control in recent builds. The solution: switch to the official Qwen_Qwen3.5-9B-GGUF model, which has thinking disabled by default.

7. GPU Memory Zombie Issue

If nvidia-smi shows high memory usage but no processes:

nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv
# If 6-7GB used with 1-2GB free but "No running processes found" → zombie context

Fix: Reboot the LXC. The zombie GPU context cannot be freed without a reboot.

# From Proxmox host
pct reboot 100

8. Ollama as Stable Alternative

If llama.cpp proves unreliable, Ollama provides a stable alternative with thinking control at the API level:

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull model
ollama pull qwen3.5:9b-q4_K_M

# API call with thinking disabled
curl -s http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3.5:9b-q4_K_M",
    "messages": [{"role": "user", "content": "Your prompt"}],
    "stream": false,
    "think": false
  }'

Ollama patches the model at load time, handling thinking tokens correctly. Typical generation speed: ~10 t/s on GTX 1080.

File Locations Summary

File	Path
llama.cpp source	`/tmp/llama-fresh/`
llama-cli binary	`/tmp/llama-fresh/build/bin/llama-cli`
llama-server binary	`/tmp/llama-fresh/build/bin/llama-server`
Model (Q4_K_M)	`/opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf`
CUDA libraries	`/usr/local/cuda-12.4/targets/x86_64-linux/lib/`

9. Using llama-server as an OpenWebUI and AI Agent Backend

llama.cpp ships with llama-server, a drop-in REST API server compatible with the OpenAI API format. This lets you use it as a backend for OpenWebUI, anythingLLM, LibreChat, or any OpenAI API-compatible agent.

9.1 Start the Server

ssh [email protected]
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

/tmp/llama-fresh/build/bin/llama-server \
  -m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --reasoning off \
  --host 0.0.0.0 \
  --port 8080 \
  &

Run it in the background or under systemd. Test it:

curl http://localhost:8080/v1/models

Expected response:

{"object":"list","data":[{"id":"Qwen_Qwen3.5-9B-Q4_K_M.gguf","object":"model","created":1234567890,"owned_by":"local"}]}

9.2 OpenWebUI

OpenWebUI supports llama.cpp as a direct backend. In Settings → Admin Panel → Models, add:

Base URL: http://192.168.2.50:8080/v1
API Key: any-non-empty-string (llama.cpp ignores this by default)
Model name: Qwen_Qwen3.5-9B-Q4_K_M.gguf

Or set via environment variable:

OPENAI_API_BASE=http://192.168.2.50:8080/v1
OPENAI_API_KEY=any-string
OLLAMA_BASE_URL=http://192.168.2.50:8080/v1

OpenWebUI will list the model and you can chat with it directly. The full OpenAI-compatible endpoint (/v1/chat/completions) is supported.

9.3 AI Agent Integration (OpenAI-Compatible)

Any agent framework that supports OpenAI API works out of the box:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.2.50:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen_Qwen3.5-9B-Q4_K_M.gguf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Python decorators in 2 sentences."}
    ],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)

JavaScript / TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://192.168.2.50:8080/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'Qwen_Qwen3.5-9B-Q4_K_M.gguf',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain Python decorators in 2 sentences.' }
  ]
});

console.log(response.choices[0].message.content);

cURL

curl http://192.168.2.50:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen_Qwen3.5-9B-Q4_K_M.gguf",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

9.4 Chat Completions vs Completions

llama-server supports both endpoints:

Endpoint	Use Case
`POST /v1/chat/completions`	Chat-based models (Qwen, Llama Instruct). Recommended.
`POST /v1/completions`	Raw text completion for any model.
`GET /v1/models`	List available models.
`POST /v1/embeddings`	Generate embeddings (if built with embedding support).

9.5 Streaming Responses

curl http://192.168.2.50:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen_Qwen3.5-9B-Q4_K_M.gguf",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Returns Server-Sent Events (SSE). Most OpenAI-compatible clients support streaming natively.

9.6 Systemd Service (Auto-Start)

Create /etc/systemd/system/llama-server.service:

[Unit]
Description=llama.cpp server
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt
ExecStart=/bin/bash -c 'LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH /tmp/llama-fresh/build/bin/llama-server -m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf -ngl 99 -c 4096 --reasoning off --host 0.0.0.0 --port 8080'
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

systemctl daemon-reload
systemctl enable llama-server
systemctl start llama-server
systemctl status llama-server

9.7 Performance Tuning

Flag	Effect	GTX 1080 Recommendation
`-ngl`	GPU layers to offload (0=all CPU, 99=all GPU)	`99` (6.3GB VRAM used)
`-c`	Context size (tokens held in memory)	`4096` (use `-c 2048` if OOM)
`--parallel`	Max concurrent requests	`4` (shared VRAM)
`--threads`	CPU threads for pre/post processing	`4`
`--mlock`	Lock model in RAM (no swap)	Enable if RAM is sufficient

9.8 Connecting from Another Machine

If the LXC is on a private network, ensure the port is accessible:

# Check firewall
iptables -L -n | grep 8080

# If blocked, allow access
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT

From any machine on the network, the API is available at http://192.168.2.50:8080/v1.