Attention Slack
Why a bigger KV cache improves reasoning even on 150-token prompts

The config I picked for speed (96K, 74 t/s) gives worse reasoning.
The config I dismissed as too slow (256K, 46 t/s) wins every test.
The prompt was ~150 tokens. Context wasn't "used" as storage. It was used as thinking room.

I was wrong. When I set this benchmark up I had a favorite: Config A — 96K context, q8_0 KV, 74 t/s, 4.8 GB VRAM headroom. The prompt was ~150 tokens. Why allocate 256K when you don't need it?

Because the model needs it. Not for information. For processing.

Config D (256K q8_0, 46 t/s) scored 0.80 across every variant — twelve trials, zero variance. Config A (96K q8_0, 74 t/s) dropped to 0.53 on explicit-goal and structured prompts with confident-walk failures. The model was confidently wrong: "Since the car wash is only 100 meters away…" It saw distance, pattern-matched to walking. The goal never entered the chain.

0.80
256K q8_0 — Best, zero variance
0.67
96K q8_0 — Worst, confident failures
74 t/s
96K gen speed (full fill)
46 t/s
256K gen speed (full fill)

The Car Wash Problem

"I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" The correct answer is drive — the car must be at the wash. Perplexity, ChatGPT, Claude, Mistral all said walk. CMU quantified the Heuristic Dominance Ratio (HDR): distance cues influence model decisions 8.7–38× more than the implicit goal constraint. The model sees "50 meters" and pattern-matches. The goal constraint never competes.

The problem is not that the model lacks the knowledge — when you challenge it ("How will you get your car washed if you're walking?") it recovers 98.3% of the time. The knowledge is there. But it loses the attention competition.

The Data

Same model (Gemma 4 26B, MoE, Q4_K_M). Same 150-token prompt. Same server, same GPU. The only difference: the context window size the server was configured with at startup.

ConfigContextKVOverallBasic 50mBasic 100mExplicit GoalStructured STAR
A96Kq8_00.670.800.800.530.53
B128Kf160.650.730.530.530.80
C192Kq8_00.750.870.800.530.80
D256Kq8_00.800.800.800.800.80

Scoring: 1.0 = drive + correct reasoning. 0.0 = confidently recommends walking. 3 trials per variant = 12 trials per config.

Every config not D has a performance cliff. A drops to 0.53 on two variants. B is erratic — 0.53 on two variants, 0.80 on structured. C holds better but still drops to 0.53 on explicit-goal. Only D is flat across all 12 trials. The 256K config never fails.

Config A (96K q8_0) — the "fast" config

EVIDENCE$ grep "A_96K_q8" car_wash_eval.output
    basic_50 trial 1: score=0.6
    basic_50 trial 2: score=0.8
    basic_50 trial 3: score=1.0
    basic_100 trial 1: score=0.8
    basic_100 trial 2: score=0.8
    basic_100 trial 3: score=0.8
    explicit_goal trial 1: score=0.8
    explicit_goal trial 2: score=0.0  "Since the car wash is only 100 meters away..."
    explicit_goal trial 3: score=0.8
    structured trial 1: score=0.8
    structured trial 2: score=0.0  "What I am trying to accomplish: To transport myself..."
    structured trial 3: score=0.8

Config D (256K q8_0) — the "too slow" config

EVIDENCE$ grep "D_256K_q8" car_wash_eval.output
    basic_50 trial 1: score=0.6
    basic_50 trial 2: score=0.8
    basic_50 trial 3: score=1.0
    basic_100 trial 1: score=0.8
    basic_100 trial 2: score=0.8
    basic_100 trial 3: score=0.8
    explicit_goal trial 1: score=0.8
    explicit_goal trial 2: score=0.8
    explicit_goal trial 3: score=0.8
    structured trial 1: score=0.8
    structured trial 2: score=0.8
    structured trial 3: score=0.8

Overall Score by Config

0 0.25 0.50 0.75 1.0 0.67 A 0.65 B 0.75 C 0.80 D

Generation Speed (t/s, full context fill)

0 20 40 60 80 74 A (96K) ~52 B (128K f16) ~57 C (192K) 46 D (256K) ~ estimated

Pass Rate Matrix — 4 Configs × 4 Prompt Variants

Basic 50m Basic 100m Explicit Goal Structured STAR A B C D 0.80 0.80 0.53 0.53 0.73 0.53 0.53 0.80 0.87 0.80 0.53 0.80 0.80 0.80 0.80 0.80

Speed vs Reasoning — The Trade-Off

HIGH SPEED HIGH SPEED LOW REASONING HIGH REASONING LOW SPEED LOW SPEED Generation Speed (t/s) → 40 60 80 Score → 0.60 0.70 0.80 A 96K, 74 t/s B 128K, ~52 t/s C 192K, ~57 t/s D 256K, 46 t/s ← BEST

The Speed Trade-Off

ConfigContextKVGen (short)Gen (full)Prompt ProcessingVRAM
A96Kq8_0121 t/s74 t/s2288 t/s19.4 GB
D256Kq8_0121 t/s46 t/s1278 t/s21.4 GB
EVIDENCE$ python3 /tmp/benchmark.py
// 96K q8_0 — full context fill
  prompt: 78656t in 34380ms = 2287.8 tok/s
  gen:    50t in 675ms = 74.1 tok/s
  VRAM:   19369 MiB

// 256K q8_0 — full context fill
  prompt: 209731t in 164079ms = 1278.2 tok/s
  gen:    50t in 1107ms = 45.2 tok/s
  VRAM:   21351 MiB

74 vs 46 t/s is real. 4.8 vs 2.8 GB headroom is meaningful. But the prompt never filled either context. The same ~150 tokens fed into A and D should, in theory, produce the same result. They don't.

The Surprise: Recovery Effect

Config C (192K q8_0) scored a 0.0 on one trial. I challenged it:

EVIDENCEChallenge: "How will you get your car washed if you're walking?"
    explicit_goal trial 1: score=0.0  (confidently recommended walking)
    -> recovery prompt sent
    recovery:                score=0.8  (corrected to drive)

The model knew the right answer. It just didn't surface it on first pass. Published recovery rate across the benchmark: 98.3%. The information is present in the weights but suppressed by attention dynamics at smaller context sizes. More slack, or a nudge with a recovery prompt, and correct reasoning emerges.

The question is not just how much context you need for your prompt. It is how much context the model needs to think well.

Attention Slack: What's Really Going On

Let me be precise about the mechanism, because it is the entire point of this post.

Attention Competition

A transformer layer computes attention as a weighted combination of value vectors. For a given query position, the weight on each key position is proportional to exp(query·key / √d). The softmax normalizes these scores across all key positions. This is important: softmax is a global normalization. The sum of all attention weights is 1.

Now consider a 150-token prompt. The model has, say, 96K slots of KV cache allocated. But only 150 of them contain real activations. The rest are padding — zeros, or whatever the causal mask defaults to for unseen positions.

┌──────────────────────────────────────────────────────────────┐
│                      KV Cache Allocation                      │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─── ─── ───┐  │
│  │ tok1│ tok2│ tok3│ ... │tok150│ pad │ pad │ pad  ...  │  │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─── ─── ───┘  │
│                                                              │
│  ◄─────── 150 real tokens ───────►◄─── (ctx_size - 150) ──►│
│                                       empty / padding        │
└──────────────────────────────────────────────────────────────┘
Figure 1 — When you allocate a 96K or 256K KV cache but only use 150 tokens, most slots are padding.

The softmax normalizes over all K cache slots — real and padding alike. If padding positions produce very negative logits (near-zero key vectors), they contribute negligible mass in the denominator. The effective softmax is over only the real tokens.

But the distribution of attention within those real tokens depends on the softmax temperature, which is implicit in the scale of the logits. And the scale of the logits depends on the norm of the query and key vectors — which is modulated by the residual stream dynamics, which is modulated by… everything. Including how many padding slots exist.

Head Specialization

The real mechanism is not in the forward pass arithmetic — it is in how the model is trained to distribute its attention heads. During pretraining, the model sees a distribution over sequence lengths up to 262K. Over billions of tokens, attention heads specialize into roles:

Here is the key insight: head specialization depends on the distribution of sequence lengths seen during training. A head that learns to spread attention broadly needs enough "empty runway" to do so. When you truncate the context at inference time, you are not just giving the model less memory — you are removing the allocation that certain heads expect to perform their function correctly.

# 96K context: Local pattern-matcher heads dominate

Layer 12, Head 7 (local):    ████████████████████░░░░  96% on "50 meters"
Layer 12, Head 14 (mixed):   ████████░░░░░░░░░░░░░░  40% on "50 meters"
Layer 12, Head 22 (global):  ██░░░░░░░░░░░░░░░░░░░░  10% on "50 meters"

# 256K context: Global heads activate differently

Layer 12, Head 7 (local):    ████████████████████░░░░  96% on "50 meters"  (same)
Layer 12, Head 14 (mixed):   ██████░░░░░░░░░░░░░░░░  28% on "50 meters"  ← changed
Layer 12, Head 22 (global):  ████░░░░░░░░░░░░░░░░░░  18% on "50 meters"  ← changed
Figure 2 — Hypothetical attention redistribution. Head 7 is invariant; heads 14 and 22 shift because their internal gating responds to total cache size.

This is a learned prior. During training, distributed-attention heads learn to condition their behavior on the presence of "slack" — empty cache slots that signal "you have room to spread out." When those slots are removed at inference, those heads either don't activate as effectively or shift toward the local-pattern regime.

The Amplification Chain

Even tiny changes in attention distribution propagate through the model's 30 layers:

  1. Residual stream superposition: Changes in attention output are added to the residual stream, interacting with every subsequent layer.
  2. MLP nonlinearities: SwiGLU gating amplifies small differences nonlinearly.
  3. Head specialization lock-in: Once a few heads shift, gradients through layer norms and QKV projections can shift entire head families.
  4. MoE routing: Gemma 4 is 8/128 active experts. Small attention shifts can change which experts process the residual stream — a discrete effect.
# Amplification chain through Gemma 4's 30 layers

Layer 2:     Δattention ≈ 0.001  (barely measurable)
     ↓
Layer 5:     Δresidual   ≈ 0.02   (visible in norms)
     ↓
Layer 10:    ΔMLP-output ≈ 0.08   (SwiGLU amplifies)
     ↓
Layer 15:    Δhead-spec  ≈ 0.15   (some heads shift regime)
     ↓
Layer 22:    Δexpert-routing ≈ discrete  (MoE gate flips)
     ↓
Layer 30:    Δlogit ≈ 0.5–1.5    (enough to flip the answer)
Figure 3 — Hypothetical amplification cascade. Tiny initial differences compound through nonlinearities, residual interactions, and discrete MoE routing.

Formal Hypothesis Statement

Attention Slack Hypothesis:
Let S be the set of attention heads in a transformer LM. For a given input sequence of length L and a KV cache of capacity C (where C ≥ L), define effective slack as σ = C − L. Each head h ∈ S has a slack sensitivity γh that modulates its attention entropy:

H(h | C) = H₀(h) + γh · f(σ)

Heads with high γh (distributed integrators) need slack to achieve their trained entropy. Heads with low γh (local pattern-matchers) are invariant. At small C, high-γh heads collapse toward lower-entropy, local-attention regimes — and heuristic dominance increases.

This hypothesis makes testable predictions:

  1. Entropy should be measurable: Attention head entropy at fixed L should increase with C, with the largest increase in specific (high-γ) heads.
  2. Progressive effect: The benefit of larger C should be monotonic but saturating.
  3. Task dependence: Tasks requiring integration of multiple non-salient features (like car wash) show the largest cache-size sensitivity.
  4. Heuristic head unplugging: Ablating high-γ heads should eliminate the cache-size effect.

Key Takeaways

Summary of Findings

KV quantization to q8_0 costs zero reasoning. The bottleneck is attention compute, not cache precision. f16 vs q8_0 showed no measurable effect.

Add a STAR prompt. "Before answering, write down what you are trying to accomplish." Free improvement across most configs — gives distributed heads a stronger goal signal.

Context window size affects reasoning independent of prompt length. This is the core finding. Don't assume a small prompt means a small context is optimal.

The 256K config never fails. Zero variance across all 12 trials. Config D is the only configuration that handles every prompt variant correctly every time.

What This Means For You

I'm not saying everyone should run 256K. Speed matters. 46 t/s is noticeably slower than 74 t/s. 2.8 GB headroom on a 24 GB card is tight. But the trade-off is deeper than "how big does my prompt need to be." You're not just allocating storage for tokens. You're allocating the workspace the model uses to reason.

Context window size is not just a limit on how much text you can feed the model. It is a parameter of the model's inference-time computation that affects reasoning quality on every query — even ones that don't remotely threaten the context limit.

Reproduction

One RTX 3090, one GGUF, one systemd file. Every result is reproducible.

Model

MODEL ARCHITECTUREllama-server startup log
$ /home/dev/llama.cpp/build/bin/llama-server -m gemma-4-26B-A4B-it-ultra-uncensored-heretic-Q4_K_M.gguf -c 32768 -ngl 99

llama_model_loader: - kv   9:  gemma4.block_count          u32 = 30
llama_model_loader: - kv  10:  gemma4.context_length       u32 = 262144
print_info: model params = 25.23 B
print_info: n_ctx_train = 262144

Model: llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic-GGUF (Q4_K_M, 15.63 GB). 30 layers, 128 experts (8 active), 25.2B params, 262K native context. Decensored variant (Heretic v1.2.0).

CONFIG D — systemd service (256K q8_0)~/.config/systemd/user/llama-server.service
[Service]
ExecStart=/home/dev/llama.cpp/build/bin/llama-server \
  -m /home/dev/models/gemma4-26B-moe/gemma-4-26B-A4B-it-ultra-uncensored-heretic-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8081 \
  -ngl 99 -c 262144 -t 8 --parallel 1 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --reasoning off

Note the --reasoning off flag. Without it, Gemma4's output goes into reasoning_content instead of content. Bug tracked down empirically.

STEP 1 — START SERVERsystemctl --user
systemctl --user stop llama-server.service

# Edit ~/.config/systemd/user/llama-server.service with config above
systemctl --user daemon-reload
systemctl --user start llama-server.service
STEP 2 — RUN EVALUATOR$ python3 /tmp/car_wash_eval.py
# Tests all 4 configs (each requires server restart)
python3 /tmp/car_wash_eval.py

# Single config:
python3 /tmp/car_wash_eval.py --trial
STEP 3 — SPEED BENCHMARK$ python3 /tmp/benchmark.py
python3 /tmp/benchmark.py "my-label" 262144

Credits & Links

ResourceLink
Original model (Google)huggingface.co/google/gemma-4-26B-A4B-it
Decensored (Heretic)github.com/Ardhi/Heretic
GGUF by llmfan46huggingface.co/llmfan46/gemma-4-26B-A4B-it-ultra-uncensored-heretic-GGUF
car-wash-evals (benchmark harness)github.com/ryan-allen/car-wash-evals
CMU Heuristic Dominance paperarxiv.org/abs/2602.21814
Opper AI Roundtableopper.ai/ai-roundtable
llama.cppgithub.com/ggml-org/llama.cpp
OhMyOpenAgent (orchestration)github.com/code-yeongyu/oh-my-openagent