I was wrong. When I set this benchmark up I had a favorite: Config A — 96K context, q8_0 KV, 74 t/s, 4.8 GB VRAM headroom. The prompt was ~150 tokens. Why allocate 256K when you don't need it?
Because the model needs it. Not for information. For processing.
Config D (256K q8_0, 46 t/s) scored 0.80 across every variant — twelve trials, zero variance. Config A (96K q8_0, 74 t/s) dropped to 0.53 on explicit-goal and structured prompts with confident-walk failures. The model was confidently wrong: "Since the car wash is only 100 meters away…" It saw distance, pattern-matched to walking. The goal never entered the chain.
"I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" The correct answer is drive — the car must be at the wash. Perplexity, ChatGPT, Claude, Mistral all said walk. CMU quantified the Heuristic Dominance Ratio (HDR): distance cues influence model decisions 8.7–38× more than the implicit goal constraint. The model sees "50 meters" and pattern-matches. The goal constraint never competes.
The problem is not that the model lacks the knowledge — when you challenge it ("How will you get your car washed if you're walking?") it recovers 98.3% of the time. The knowledge is there. But it loses the attention competition.
Same model (Gemma 4 26B, MoE, Q4_K_M). Same 150-token prompt. Same server, same GPU. The only difference: the context window size the server was configured with at startup.
| Config | Context | KV | Overall | Basic 50m | Basic 100m | Explicit Goal | Structured STAR |
|---|---|---|---|---|---|---|---|
| A | 96K | q8_0 | 0.67 | 0.80 | 0.80 | 0.53 | 0.53 |
| B | 128K | f16 | 0.65 | 0.73 | 0.53 | 0.53 | 0.80 |
| C | 192K | q8_0 | 0.75 | 0.87 | 0.80 | 0.53 | 0.80 |
| D | 256K | q8_0 | 0.80 | 0.80 | 0.80 | 0.80 | 0.80 |
Scoring: 1.0 = drive + correct reasoning. 0.0 = confidently recommends walking. 3 trials per variant = 12 trials per config.
Every config not D has a performance cliff. A drops to 0.53 on two variants. B is erratic — 0.53 on two variants, 0.80 on structured. C holds better but still drops to 0.53 on explicit-goal. Only D is flat across all 12 trials. The 256K config never fails.
basic_50 trial 1: score=0.6
basic_50 trial 2: score=0.8
basic_50 trial 3: score=1.0
basic_100 trial 1: score=0.8
basic_100 trial 2: score=0.8
basic_100 trial 3: score=0.8
explicit_goal trial 1: score=0.8
explicit_goal trial 2: score=0.0 "Since the car wash is only 100 meters away..."
explicit_goal trial 3: score=0.8
structured trial 1: score=0.8
structured trial 2: score=0.0 "What I am trying to accomplish: To transport myself..."
structured trial 3: score=0.8
basic_50 trial 1: score=0.6
basic_50 trial 2: score=0.8
basic_50 trial 3: score=1.0
basic_100 trial 1: score=0.8
basic_100 trial 2: score=0.8
basic_100 trial 3: score=0.8
explicit_goal trial 1: score=0.8
explicit_goal trial 2: score=0.8
explicit_goal trial 3: score=0.8
structured trial 1: score=0.8
structured trial 2: score=0.8
structured trial 3: score=0.8
| Config | Context | KV | Gen (short) | Gen (full) | Prompt Processing | VRAM |
|---|---|---|---|---|---|---|
| A | 96K | q8_0 | 121 t/s | 74 t/s | 2288 t/s | 19.4 GB |
| D | 256K | q8_0 | 121 t/s | 46 t/s | 1278 t/s | 21.4 GB |
// 96K q8_0 — full context fill prompt: 78656t in 34380ms = 2287.8 tok/s gen: 50t in 675ms = 74.1 tok/s VRAM: 19369 MiB // 256K q8_0 — full context fill prompt: 209731t in 164079ms = 1278.2 tok/s gen: 50t in 1107ms = 45.2 tok/s VRAM: 21351 MiB
74 vs 46 t/s is real. 4.8 vs 2.8 GB headroom is meaningful. But the prompt never filled either context. The same ~150 tokens fed into A and D should, in theory, produce the same result. They don't.
Config C (192K q8_0) scored a 0.0 on one trial. I challenged it:
explicit_goal trial 1: score=0.0 (confidently recommended walking) -> recovery prompt sent recovery: score=0.8 (corrected to drive)
The model knew the right answer. It just didn't surface it on first pass. Published recovery rate across the benchmark: 98.3%. The information is present in the weights but suppressed by attention dynamics at smaller context sizes. More slack, or a nudge with a recovery prompt, and correct reasoning emerges.
The question is not just how much context you need for your prompt. It is how much context the model needs to think well.
Let me be precise about the mechanism, because it is the entire point of this post.
A transformer layer computes attention as a weighted combination of value vectors. For a given query position, the weight on each key position is proportional to exp(query·key / √d). The softmax normalizes these scores across all key positions. This is important: softmax is a global normalization. The sum of all attention weights is 1.
Now consider a 150-token prompt. The model has, say, 96K slots of KV cache allocated. But only 150 of them contain real activations. The rest are padding — zeros, or whatever the causal mask defaults to for unseen positions.
┌──────────────────────────────────────────────────────────────┐
│ KV Cache Allocation │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─── ─── ───┐ │
│ │ tok1│ tok2│ tok3│ ... │tok150│ pad │ pad │ pad ... │ │
│ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─── ─── ───┘ │
│ │
│ ◄─────── 150 real tokens ───────►◄─── (ctx_size - 150) ──►│
│ empty / padding │
└──────────────────────────────────────────────────────────────┘
Figure 1 — When you allocate a 96K or 256K KV cache but only use 150 tokens, most slots are padding.
The softmax normalizes over all K cache slots — real and padding alike. If padding positions produce very negative logits (near-zero key vectors), they contribute negligible mass in the denominator. The effective softmax is over only the real tokens.
But the distribution of attention within those real tokens depends on the softmax temperature, which is implicit in the scale of the logits. And the scale of the logits depends on the norm of the query and key vectors — which is modulated by the residual stream dynamics, which is modulated by… everything. Including how many padding slots exist.
The real mechanism is not in the forward pass arithmetic — it is in how the model is trained to distribute its attention heads. During pretraining, the model sees a distribution over sequence lengths up to 262K. Over billions of tokens, attention heads specialize into roles:
Here is the key insight: head specialization depends on the distribution of sequence lengths seen during training. A head that learns to spread attention broadly needs enough "empty runway" to do so. When you truncate the context at inference time, you are not just giving the model less memory — you are removing the allocation that certain heads expect to perform their function correctly.
# 96K context: Local pattern-matcher heads dominate Layer 12, Head 7 (local): ████████████████████░░░░ 96% on "50 meters" Layer 12, Head 14 (mixed): ████████░░░░░░░░░░░░░░ 40% on "50 meters" Layer 12, Head 22 (global): ██░░░░░░░░░░░░░░░░░░░░ 10% on "50 meters" # 256K context: Global heads activate differently Layer 12, Head 7 (local): ████████████████████░░░░ 96% on "50 meters" (same) Layer 12, Head 14 (mixed): ██████░░░░░░░░░░░░░░░░ 28% on "50 meters" ← changed Layer 12, Head 22 (global): ████░░░░░░░░░░░░░░░░░░ 18% on "50 meters" ← changed Figure 2 — Hypothetical attention redistribution. Head 7 is invariant; heads 14 and 22 shift because their internal gating responds to total cache size.
This is a learned prior. During training, distributed-attention heads learn to condition their behavior on the presence of "slack" — empty cache slots that signal "you have room to spread out." When those slots are removed at inference, those heads either don't activate as effectively or shift toward the local-pattern regime.
Even tiny changes in attention distribution propagate through the model's 30 layers:
# Amplification chain through Gemma 4's 30 layers Layer 2: Δattention ≈ 0.001 (barely measurable) ↓ Layer 5: Δresidual ≈ 0.02 (visible in norms) ↓ Layer 10: ΔMLP-output ≈ 0.08 (SwiGLU amplifies) ↓ Layer 15: Δhead-spec ≈ 0.15 (some heads shift regime) ↓ Layer 22: Δexpert-routing ≈ discrete (MoE gate flips) ↓ Layer 30: Δlogit ≈ 0.5–1.5 (enough to flip the answer) Figure 3 — Hypothetical amplification cascade. Tiny initial differences compound through nonlinearities, residual interactions, and discrete MoE routing.
H(h | C) = H₀(h) + γh · f(σ)
This hypothesis makes testable predictions:
KV quantization to q8_0 costs zero reasoning. The bottleneck is attention compute, not cache precision. f16 vs q8_0 showed no measurable effect.
Add a STAR prompt. "Before answering, write down what you are trying to accomplish." Free improvement across most configs — gives distributed heads a stronger goal signal.
Context window size affects reasoning independent of prompt length. This is the core finding. Don't assume a small prompt means a small context is optimal.
The 256K config never fails. Zero variance across all 12 trials. Config D is the only configuration that handles every prompt variant correctly every time.
I'm not saying everyone should run 256K. Speed matters. 46 t/s is noticeably slower than 74 t/s. 2.8 GB headroom on a 24 GB card is tight. But the trade-off is deeper than "how big does my prompt need to be." You're not just allocating storage for tokens. You're allocating the workspace the model uses to reason.
One RTX 3090, one GGUF, one systemd file. Every result is reproducible.
$ /home/dev/llama.cpp/build/bin/llama-server -m gemma-4-26B-A4B-it-ultra-uncensored-heretic-Q4_K_M.gguf -c 32768 -ngl 99 llama_model_loader: - kv 9: gemma4.block_count u32 = 30 llama_model_loader: - kv 10: gemma4.context_length u32 = 262144 print_info: model params = 25.23 B print_info: n_ctx_train = 262144
Model: llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic-GGUF (Q4_K_M, 15.63 GB). 30 layers, 128 experts (8 active), 25.2B params, 262K native context. Decensored variant (Heretic v1.2.0).
[Service] ExecStart=/home/dev/llama.cpp/build/bin/llama-server \ -m /home/dev/models/gemma4-26B-moe/gemma-4-26B-A4B-it-ultra-uncensored-heretic-Q4_K_M.gguf \ --host 0.0.0.0 --port 8081 \ -ngl 99 -c 262144 -t 8 --parallel 1 --flash-attn on \ --cache-type-k q8_0 --cache-type-v q8_0 \ --reasoning off
Note the --reasoning off flag. Without it, Gemma4's output goes into reasoning_content instead of content. Bug tracked down empirically.
systemctl --user stop llama-server.service # Edit ~/.config/systemd/user/llama-server.service with config above systemctl --user daemon-reload systemctl --user start llama-server.service
# Tests all 4 configs (each requires server restart) python3 /tmp/car_wash_eval.py # Single config: python3 /tmp/car_wash_eval.py --trial
python3 /tmp/benchmark.py "my-label" 262144
| Resource | Link |
|---|---|
| Original model (Google) | huggingface.co/google/gemma-4-26B-A4B-it |
| Decensored (Heretic) | github.com/Ardhi/Heretic |
| GGUF by llmfan46 | huggingface.co/llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic-GGUF |
| car-wash-evals (benchmark harness) | github.com/ryan-allen/car-wash-evals |
| CMU Heuristic Dominance paper | arxiv.org/abs/2602.21814 |
| Opper AI Roundtable | opper.ai/ai-roundtable |
| llama.cpp | github.com/ggml-org/llama.cpp |
| OhMyOpenAgent (orchestration) | github.com/code-yeongyu/oh-my-openagent |