Multi-Model Benchmark — Qwen 3.5/3.6 & Ornith on Dual GPU
| Rank | Model | Params | Quant | Score | Grade | Speed |
|---|---|---|---|---|---|---|
| 🥇 1 | qwen3.6-35b-reap25_iq1m.gguf | 26.6B | IQ1M (1-bit) | 95.0 | A+ | 34.0 tok/s |
| 🥈 2 | ornith-1.0-9b-Q8_0.gguf | 9B | Q8_0 (8-bit) | 94.4 | A+ | 15.5 tok/s |
| 🥉 3 | Qwen3.6-14B-A3B-FableVibes-Q4_K_M.gguf | 13.76B | Q4_K_M (4-bit) | 91.7 | A+ | 39.9 tok/s |
| 4 | Qwen3.6-14B-A3B-VibeForged-v2-MXFP4_MOE.gguf | 13.76B | MXFP4 MoE | 91.7 | A+ | 36.3 tok/s |
| 5 | Qwen3.5-14B-A3B-Claude-Opus-Distilled-4.6-MXFP4_MOE.gguf | 14B | MXFP4 MoE | 61.7 | B | 22.4 tok/s |
| Category (Weight) | M1: Qwen3.5-14B MXFP4_MOE |
M2: Qwen3.6-35B IQ1M |
M3: Ornith-1.0 9B Q8_0 |
M4: Qwen3.6-14B VibeForged MXFP4 |
M5: Qwen3.6-14B FableVibes Q4_K_M |
|---|---|---|---|---|---|
| Basic Generation (5%) | 100% | 100% | 100% | 100% | 100% |
| Logical Reasoning (20%) | 100% | 100% | 100% | 100% | 100% |
| Mathematical Reasoning (15%) | 33.3% | 66.7% | 66.7% | 66.7% | 66.7% |
| Code Generation (15%) | 100% | 100% | 100% | 100% | 100% |
| Factual Knowledge (10%) | 33.3% | 100% | 100% | 66.7% | 66.7% |
| Instruction Following (10%) | 20.0% | 100% | 100% | 100% | 100% |
| Creative Writing (10%) | 50.0% | 100% | 100% | 100% | 100% |
| Multi-Turn Context (10%) | 30.0% | 100% | 100% | 100% | 100% |
| Speed Benchmark (3%) | 80.0% | 100% | 80.0% | 100% | 100% |
| Edge Cases (2%) | 50.0% | 100% | 100% | 100% | 100% |
| WEIGHTED TOTAL | 61.7 | 95.0 | 94.4 | 91.7 | 91.7 |
All 5 models scored 100% on logical reasoning and code generation. The Qwen architecture handles syllogistic logic and Python code output flawlessly regardless of quantization level.
All three 14B MoE variants (M1, M4, M5) fail to identify "NaCl" as table salt's chemical formula. Only the 35B (M2) and 9B Q8_0 (M3) answer correctly. The 3B active parameter count in MoE appears insufficient for factual recall.
No model correctly computes 847 × 23 + 156. All describe the correct method but fail the final multiplication. This is a known transformer limitation at these parameter sizes — arithmetic is not their forte.
IQ1M (1-bit): Maximum speed (34 tok/s), but loses precision. Q8_0 (8-bit): Highest quality per parameter, but slowest (16 tok/s). Q4_K_M (4-bit): Best balance — 40 tok/s with 91.7 score.
VibeForged v2 (M4) scores 91.7 vs Claude-Distilled v1 (M1) at 61.7. The 30-point jump comes from better instruction following (20%→100%), knowledge (33%→67%), and speed (22→36 tok/s).
FableVibes Q4_K_M (M5) achieves 39.9 tok/s — the fastest of all 5 models. Q4_K_M's balanced 4-bit quantization with proper importance weighting minimizes VRAM bottleneck on the dual-GPU setup.
Highest score, no weak spots, 34 tok/s
40 tok/s, A+ score, daily driver
Optimized prime checker, 8-bit precision
IQ1M = 1-bit = minimal memory
Outdated, replaced by M4/M5
36 tok/s, strong upgrade over M1