Comprehensive LLM Audit Report

Multi-Model Benchmark — Qwen 3.5/3.6 & Ornith on Dual GPU

Host: Windows10 Date: 2026-07-01 Models Tested: 5 Test Categories: 10
Hardware Configuration
GPU 1
NVIDIA GTX 1080
GPU 2
NVIDIA GTX 970
Server
llama.cpp (llama-server)
Platform
Windows + WSL
Framework
GGUF / llama.cpp
Score Overview
Rank #1
qwen3.6-35b-reap25_iq1m
A+
95.0
34.0 tok/s
Rank #2
ornith-1.0-9b-Q8_0
A+
94.4
15.5 tok/s
Rank #3
Qwen3.6-14B-A3B-FableVibes-Q4_K_M
A+
91.7
39.9 tok/s
Rank #4
Qwen3.6-14B-A3B-VibeForged-v2-MXFP4_MOE
A+
91.7
36.3 tok/s
Rank #5
Qwen3.5-14B-A3B-Claude-Opus-Distilled-MXFP4
B
61.7
22.4 tok/s
Final Rankings
RankModelParamsQuantScoreGradeSpeed
🥇 1 qwen3.6-35b-reap25_iq1m.gguf 26.6BIQ1M (1-bit) 95.0A+ 34.0 tok/s
🥈 2 ornith-1.0-9b-Q8_0.gguf 9BQ8_0 (8-bit) 94.4A+ 15.5 tok/s
🥉 3 Qwen3.6-14B-A3B-FableVibes-Q4_K_M.gguf 13.76BQ4_K_M (4-bit) 91.7A+ 39.9 tok/s
4 Qwen3.6-14B-A3B-VibeForged-v2-MXFP4_MOE.gguf 13.76BMXFP4 MoE 91.7A+ 36.3 tok/s
5 Qwen3.5-14B-A3B-Claude-Opus-Distilled-4.6-MXFP4_MOE.gguf 14BMXFP4 MoE 61.7B 22.4 tok/s
Detailed Category Comparison
Category (Weight) M1: Qwen3.5-14B
MXFP4_MOE
M2: Qwen3.6-35B
IQ1M
M3: Ornith-1.0
9B Q8_0
M4: Qwen3.6-14B
VibeForged MXFP4
M5: Qwen3.6-14B
FableVibes Q4_K_M
Basic Generation (5%) 100% 100% 100% 100% 100%
Logical Reasoning (20%) 100% 100% 100% 100% 100%
Mathematical Reasoning (15%) 33.3% 66.7% 66.7% 66.7% 66.7%
Code Generation (15%) 100% 100% 100% 100% 100%
Factual Knowledge (10%) 33.3% 100% 100% 66.7% 66.7%
Instruction Following (10%) 20.0% 100% 100% 100% 100%
Creative Writing (10%) 50.0% 100% 100% 100% 100%
Multi-Turn Context (10%) 30.0% 100% 100% 100% 100%
Speed Benchmark (3%) 80.0% 100% 80.0% 100% 100%
Edge Cases (2%) 50.0% 100% 100% 100% 100%
WEIGHTED TOTAL 61.7 95.0 94.4 91.7 91.7
Inference Speed Comparison
M5 Q4_K_M
39.9 tok/s
M4 MXFP4 v2
36.3 tok/s
M2 IQ1M
34.0 tok/s
M1 MXFP4
22.4 tok/s
M3 Q8_0
15.5 tok/s
Accuracy Scores by Category

Each bar shows the weighted score contribution per category across models.

CategoryM1M2M3M4M5
Reasoning (20%)20.020.020.020.020.0
Math (15%)5.010.010.010.010.0
Code (15%)15.015.015.015.015.0
Knowledge (10%)3.310.010.06.76.7
Instructions (10%)2.010.010.010.010.0
Creative (10%)5.010.010.010.010.0
Context (10%)3.010.010.010.010.0
Basic (5%)5.05.05.05.05.0
Speed (3%)2.43.02.43.03.0
Edge (2%)1.02.02.02.02.0
TOTAL61.795.094.491.791.7
Key Findings

Universal Strengths

All 5 models scored 100% on logical reasoning and code generation. The Qwen architecture handles syllogistic logic and Python code output flawlessly regardless of quantization level.

The NaCl Problem

All three 14B MoE variants (M1, M4, M5) fail to identify "NaCl" as table salt's chemical formula. Only the 35B (M2) and 9B Q8_0 (M3) answer correctly. The 3B active parameter count in MoE appears insufficient for factual recall.

! Arithmetic Weakness

No model correctly computes 847 × 23 + 156. All describe the correct method but fail the final multiplication. This is a known transformer limitation at these parameter sizes — arithmetic is not their forte.

Quantization Trade-offs

IQ1M (1-bit): Maximum speed (34 tok/s), but loses precision. Q8_0 (8-bit): Highest quality per parameter, but slowest (16 tok/s). Q4_K_M (4-bit): Best balance — 40 tok/s with 91.7 score.

M4 vs M1 Improvement

VibeForged v2 (M4) scores 91.7 vs Claude-Distilled v1 (M1) at 61.7. The 30-point jump comes from better instruction following (20%→100%), knowledge (33%→67%), and speed (22→36 tok/s).

🏎 Speed Champion

FableVibes Q4_K_M (M5) achieves 39.9 tok/s — the fastest of all 5 models. Q4_K_M's balanced 4-bit quantization with proper importance weighting minimizes VRAM bottleneck on the dual-GPU setup.

Recommendations

Best Model by Use Case

Best All-Around
qwen3.6-35b-reap25_iq1m (95.0)

Highest score, no weak spots, 34 tok/s

Maximum Speed
Qwen3.6-14B-A3B-FableVibes-Q4_K_M (91.7)

40 tok/s, A+ score, daily driver

Best Code Quality
ornith-1.0-9b-Q8_0 (94.4)

Optimized prime checker, 8-bit precision

Lowest VRAM
qwen3.6-35b-reap25_iq1m (~8GB)

IQ1M = 1-bit = minimal memory

Avoid
Qwen3.5-14B-Claude-Opus-Distilled (61.7)

Outdated, replaced by M4/M5

Best Speed/Accuracy
Qwen3.6-14B-A3B-VibeForged-v2-MXFP4 (91.7)

36 tok/s, strong upgrade over M1