← Blog index

Deploying llama.cpp with CUDA GPU Acceleration on LXC (GTX 1080)

2026-06-21

Background

Running LLM inference locally via llama.cpp on a homelab LXC container with a GTX 1080 GPU. The goal: better speed than Ollama with full GPU acceleration. Along the way, hit a broken thinking token bug, a Pascal GPU compatibility issue, and a zombie GPU memory leak — all resolved.

Environment

ComponentDetail
HostProxmox VE, Debian LXC
GPUNVIDIA GTX 1080 (8GB, Pascal / compute 6.1)
CUDA12.4 (required — CUDA 13+ dropped Pascal support)
llama.cppBuilt from master (Jun 21 2026)

1. Build llama.cpp from Source

Pre-built binaries are too old. Clone master and build with CUDA support.

# Install build dependencies
apt-get update && apt-get install -y cmake build-essential git

# Clone llama.cpp
git clone --depth 1 https://github.com/ggerganov/llama.cpp.git /tmp/llama-fresh

# Build with CUDA for Pascal GPU (compute capability 6.1)
cd /tmp/llama-fresh
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc \
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-12 \
  -DCMAKE_CUDA_ARCHITECTURES=61 \
  '-DCMAKE_CUDA_FLAGS=--allow-unsupported-compiler' \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DGGML_NATIVE=OFF \
  -DGGML_LTO=OFF

cmake --build build --target llama-cli llama-server -j$(nproc)
Note: GCC 14 causes build failures with CUDA nvcc. Use GCC 12 via CMAKE_CUDA_HOST_COMPILER. The --allow-unsupported-compiler flag is required when mixing CUDA nvcc with newer host compilers.

2. Verify GPU Detection

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-cli --list-devices

Expected output:

Available devices:
  CUDA0: NVIDIA GeForce GTX 1080 (8104 MiB, 7249 MiB free)

3. Download the Model

Download from your local machine (HuggingFace does not require login for downloads):

  1. Go to https://huggingface.co/bartowski/Qwen_Qwen3.5-9B-GGUF
  2. Click Files tab
  3. Download Qwen3.5-9B-Q4_K_M.gguf (6.17GB)
scp ~/Downloads/Qwen3.5-9B-Q4_K_M.gguf [email protected]:/opt/models/
Important: Do not use the HauhauCS variant (Qwen3.5-9B-Uncensored-HauhauCS-Aggressive). Its aggressive quantization destroys the special thinking token IDs, causing an infinite ???? loop in llama.cpp output. The official Qwen_Qwen3.5-9B-GGUF works correctly.

4. Run llama-cli with GPU Acceleration

ssh [email protected]
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

/tmp/llama-fresh/build/bin/llama-cli \
  -m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf \
  -ngl 99 \
  -c 512 \
  --reasoning off \
  --log-disable

Type your prompt and press Enter. Type /exit to quit.

Key Flags

FlagDescription
-ngl 99Offload all layers to GPU (use -ngl 20 for partial if VRAM is limited)
-c 512Context size — adjust based on available VRAM
--reasoning offDisables Qwen's built-in thinking mode
--log-disableSuppresses verbose logging

Performance on GTX 1080 (8GB)

MetricValue
Prompt processing~140 t/s
Generation~30 t/s
GPU memory usage (full offload)~6.3GB

5. Direct HuggingFace Download via -hf Flag

llama.cpp can download GGUF files directly without manual download:

export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/tmp/llama-fresh/build/bin/llama-cli \
  -hf bartowski/Qwen_Qwen3.5-9B-GGUF:Q4_K_M \
  -ngl 99 \
  -c 512 \
  --reasoning off \
  -p "Your prompt here" \
  -n 200

Format: user/model:quant. Downloads to ~/.cache/huggingface/hub/.

6. The HauhauCS Thinking Loop Bug

Initial testing with the HauhauCS-Aggressive variant produced infinite ???? output. Root cause: aggressive quantization destroyed the special thinking token IDs (<|thought|> / <|im_end|>). The tokenizer mapped them to U+003F ? characters instead of triggering the thinking block.

The --reasoning off API flag was ignored because llama.cpp upstream deprecated template-parameter-based reasoning control in recent builds. The solution: switch to the official Qwen_Qwen3.5-9B-GGUF model, which has thinking disabled by default.

7. GPU Memory Zombie Issue

If nvidia-smi shows high memory usage but no processes:

nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv
# If 6-7GB used with 1-2GB free but "No running processes found" → zombie context

Fix: Reboot the LXC. The zombie GPU context cannot be freed without a reboot.

# From Proxmox host
pct reboot 100

8. Ollama as Stable Alternative

If llama.cpp proves unreliable, Ollama provides a stable alternative with thinking control at the API level:

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull model
ollama pull qwen3.5:9b-q4_K_M

# API call with thinking disabled
curl -s http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3.5:9b-q4_K_M",
    "messages": [{"role": "user", "content": "Your prompt"}],
    "stream": false,
    "think": false
  }'

Ollama patches the model at load time, handling thinking tokens correctly. Typical generation speed: ~10 t/s on GTX 1080.

File Locations Summary

FilePath
llama.cpp source/tmp/llama-fresh/
llama-cli binary/tmp/llama-fresh/build/bin/llama-cli
llama-server binary/tmp/llama-fresh/build/bin/llama-server
Model (Q4_K_M)/opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf
CUDA libraries/usr/local/cuda-12.4/targets/x86_64-linux/lib/

9. Using llama-server as an OpenWebUI and AI Agent Backend

llama.cpp ships with llama-server, a drop-in REST API server compatible with the OpenAI API format. This lets you use it as a backend for OpenWebUI, anythingLLM, LibreChat, or any OpenAI API-compatible agent.

9.1 Start the Server

ssh [email protected]
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

/tmp/llama-fresh/build/bin/llama-server \
  -m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --reasoning off \
  --host 0.0.0.0 \
  --port 8080 \
  &

Run it in the background or under systemd. Test it:

curl http://localhost:8080/v1/models

Expected response:

{"object":"list","data":[{"id":"Qwen_Qwen3.5-9B-Q4_K_M.gguf","object":"model","created":1234567890,"owned_by":"local"}]}

9.2 OpenWebUI

OpenWebUI supports llama.cpp as a direct backend. In Settings → Admin Panel → Models, add:

Base URL: http://192.168.2.50:8080/v1
API Key: any-non-empty-string (llama.cpp ignores this by default)
Model name: Qwen_Qwen3.5-9B-Q4_K_M.gguf

Or set via environment variable:

OPENAI_API_BASE=http://192.168.2.50:8080/v1
OPENAI_API_KEY=any-string
OLLAMA_BASE_URL=http://192.168.2.50:8080/v1

OpenWebUI will list the model and you can chat with it directly. The full OpenAI-compatible endpoint (/v1/chat/completions) is supported.

9.3 AI Agent Integration (OpenAI-Compatible)

Any agent framework that supports OpenAI API works out of the box:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.2.50:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="Qwen_Qwen3.5-9B-Q4_K_M.gguf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain Python decorators in 2 sentences."}
    ],
    temperature=0.7,
    max_tokens=500
)
print(response.choices[0].message.content)

JavaScript / TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://192.168.2.50:8080/v1',
  apiKey: 'not-needed'
});

const response = await client.chat.completions.create({
  model: 'Qwen_Qwen3.5-9B-Q4_K_M.gguf',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain Python decorators in 2 sentences.' }
  ]
});

console.log(response.choices[0].message.content);

cURL

curl http://192.168.2.50:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen_Qwen3.5-9B-Q4_K_M.gguf",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

9.4 Chat Completions vs Completions

llama-server supports both endpoints:

EndpointUse Case
POST /v1/chat/completionsChat-based models (Qwen, Llama Instruct). Recommended.
POST /v1/completionsRaw text completion for any model.
GET /v1/modelsList available models.
POST /v1/embeddingsGenerate embeddings (if built with embedding support).

9.5 Streaming Responses

curl http://192.168.2.50:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen_Qwen3.5-9B-Q4_K_M.gguf",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Returns Server-Sent Events (SSE). Most OpenAI-compatible clients support streaming natively.

9.6 Systemd Service (Auto-Start)

Create /etc/systemd/system/llama-server.service:

[Unit]
Description=llama.cpp server
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt
ExecStart=/bin/bash -c 'LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib:$LD_LIBRARY_PATH /tmp/llama-fresh/build/bin/llama-server -m /opt/models/Qwen_Qwen3.5-9B-Q4_K_M.gguf -ngl 99 -c 4096 --reasoning off --host 0.0.0.0 --port 8080'
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Enable and start:

systemctl daemon-reload
systemctl enable llama-server
systemctl start llama-server
systemctl status llama-server

9.7 Performance Tuning

FlagEffectGTX 1080 Recommendation
-nglGPU layers to offload (0=all CPU, 99=all GPU)99 (6.3GB VRAM used)
-cContext size (tokens held in memory)4096 (use -c 2048 if OOM)
--parallelMax concurrent requests4 (shared VRAM)
--threadsCPU threads for pre/post processing4
--mlockLock model in RAM (no swap)Enable if RAM is sufficient

9.8 Connecting from Another Machine

If the LXC is on a private network, ensure the port is accessible:

# Check firewall
iptables -L -n | grep 8080

# If blocked, allow access
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT

From any machine on the network, the API is available at http://192.168.2.50:8080/v1.