Comprehensive LLM Audit Report

Multi-Model Benchmark — Qwen 3.5/3.6 & Ornith on Dual GPU

Host: Windows10 Date: 2026-07-01 Models Tested: 5 Test Categories: 10

Hardware Configuration

GPU 1

NVIDIA GTX 1080

GPU 2

NVIDIA GTX 970

Server

llama.cpp (llama-server)

Platform

Windows + WSL

Framework

GGUF / llama.cpp

Score Overview

Rank #1

qwen3.6-35b-reap25_iq1m

A+

95.0

34.0 tok/s

Rank #2

ornith-1.0-9b-Q8_0

A+

94.4

15.5 tok/s

Rank #3

Qwen3.6-14B-A3B-FableVibes-Q4_K_M

A+

91.7

39.9 tok/s

Rank #4

Qwen3.6-14B-A3B-VibeForged-v2-MXFP4_MOE

A+

91.7

36.3 tok/s

Rank #5

Qwen3.5-14B-A3B-Claude-Opus-Distilled-MXFP4

61.7

22.4 tok/s

Final Rankings

Rank	Model	Params	Quant	Score	Grade	Speed
🥇 1	qwen3.6-35b-reap25_iq1m.gguf	26.6B	IQ1M (1-bit)	95.0	A+	34.0 tok/s
🥈 2	ornith-1.0-9b-Q8_0.gguf	9B	Q8_0 (8-bit)	94.4	A+	15.5 tok/s
🥉 3	Qwen3.6-14B-A3B-FableVibes-Q4_K_M.gguf	13.76B	Q4_K_M (4-bit)	91.7	A+	39.9 tok/s
4	Qwen3.6-14B-A3B-VibeForged-v2-MXFP4_MOE.gguf	13.76B	MXFP4 MoE	91.7	A+	36.3 tok/s
5	Qwen3.5-14B-A3B-Claude-Opus-Distilled-4.6-MXFP4_MOE.gguf	14B	MXFP4 MoE	61.7	B	22.4 tok/s

Detailed Category Comparison

Category (Weight)	M1: Qwen3.5-14B MXFP4_MOE	M2: Qwen3.6-35B IQ1M	M3: Ornith-1.0 9B Q8_0	M4: Qwen3.6-14B VibeForged MXFP4	M5: Qwen3.6-14B FableVibes Q4_K_M
Basic Generation (5%)	100%	100%	100%	100%	100%
Logical Reasoning (20%)	100%	100%	100%	100%	100%
Mathematical Reasoning (15%)	33.3%	66.7%	66.7%	66.7%	66.7%
Code Generation (15%)	100%	100%	100%	100%	100%
Factual Knowledge (10%)	33.3%	100%	100%	66.7%	66.7%
Instruction Following (10%)	20.0%	100%	100%	100%	100%
Creative Writing (10%)	50.0%	100%	100%	100%	100%
Multi-Turn Context (10%)	30.0%	100%	100%	100%	100%
Speed Benchmark (3%)	80.0%	100%	80.0%	100%	100%
Edge Cases (2%)	50.0%	100%	100%	100%	100%
WEIGHTED TOTAL	61.7	95.0	94.4	91.7	91.7

Inference Speed Comparison

M5 Q4_K_M

39.9 tok/s

M4 MXFP4 v2

36.3 tok/s

M2 IQ1M

34.0 tok/s

M1 MXFP4

22.4 tok/s

M3 Q8_0

15.5 tok/s

Accuracy Scores by Category

Each bar shows the weighted score contribution per category across models.

Category	M1	M2	M3	M4	M5
Reasoning (20%)	20.0	20.0	20.0	20.0	20.0
Math (15%)	5.0	10.0	10.0	10.0	10.0
Code (15%)	15.0	15.0	15.0	15.0	15.0
Knowledge (10%)	3.3	10.0	10.0	6.7	6.7
Instructions (10%)	2.0	10.0	10.0	10.0	10.0
Creative (10%)	5.0	10.0	10.0	10.0	10.0
Context (10%)	3.0	10.0	10.0	10.0	10.0
Basic (5%)	5.0	5.0	5.0	5.0	5.0
Speed (3%)	2.4	3.0	2.4	3.0	3.0
Edge (2%)	1.0	2.0	2.0	2.0	2.0
TOTAL	61.7	95.0	94.4	91.7	91.7

Key Findings

✓ Universal Strengths

All 5 models scored 100% on logical reasoning and code generation. The Qwen architecture handles syllogistic logic and Python code output flawlessly regardless of quantization level.

✗ The NaCl Problem

All three 14B MoE variants (M1, M4, M5) fail to identify "NaCl" as table salt's chemical formula. Only the 35B (M2) and 9B Q8_0 (M3) answer correctly. The 3B active parameter count in MoE appears insufficient for factual recall.

! Arithmetic Weakness

No model correctly computes 847 × 23 + 156. All describe the correct method but fail the final multiplication. This is a known transformer limitation at these parameter sizes — arithmetic is not their forte.

⚡ Quantization Trade-offs

IQ1M (1-bit): Maximum speed (34 tok/s), but loses precision. Q8_0 (8-bit): Highest quality per parameter, but slowest (16 tok/s). Q4_K_M (4-bit): Best balance — 40 tok/s with 91.7 score.

↑ M4 vs M1 Improvement

VibeForged v2 (M4) scores 91.7 vs Claude-Distilled v1 (M1) at 61.7. The 30-point jump comes from better instruction following (20%→100%), knowledge (33%→67%), and speed (22→36 tok/s).

🏎 Speed Champion

FableVibes Q4_K_M (M5) achieves 39.9 tok/s — the fastest of all 5 models. Q4_K_M's balanced 4-bit quantization with proper importance weighting minimizes VRAM bottleneck on the dual-GPU setup.

Recommendations

Best Model by Use Case

Best All-Around

qwen3.6-35b-reap25_iq1m (95.0)

Highest score, no weak spots, 34 tok/s

Maximum Speed

Qwen3.6-14B-A3B-FableVibes-Q4_K_M (91.7)

40 tok/s, A+ score, daily driver

Best Code Quality

ornith-1.0-9b-Q8_0 (94.4)

Optimized prime checker, 8-bit precision

Lowest VRAM

qwen3.6-35b-reap25_iq1m (~8GB)

IQ1M = 1-bit = minimal memory

Avoid

Qwen3.5-14B-Claude-Opus-Distilled (61.7)

Outdated, replaced by M4/M5

Best Speed/Accuracy

Qwen3.6-14B-A3B-VibeForged-v2-MXFP4 (91.7)

36 tok/s, strong upgrade over M1