128K Context on a 24GB GPU: What I Got Wrong About VRAM
I spent some time recently looking at the math for running a 26B MoE model at 128K context. The KV cache alone should have been well over 30GB. On an RTX 3090 with only 24GB VRAM, my first-principles calculation was straightforward: impossible. The memory ceiling felt hard and unmovable. I assumed the server would OOM before it even finished loading.
I decided to push it anyway. Started the server with --flash-attn on -c 131072 and measured what actually happened. The result contradicted every number I'd calculated. Not only did it run — it achieved 142.7 tokens per second at 128K context.
How? Paged Attention. llama.cpp's flash-attention implementation swaps KV cache pages to CPU RAM seamlessly. The overhead is incredibly low. At 128K, only about 15% of the KV cache pages live in VRAM at any given moment — the rest are in system RAM on the other side of the PCIe bus. And the kernel orchestrates this so efficiently that decode speed stays flat from 32K all the way to 128K.
Context VRAM used Decode tok/s 32K 19,081 MB 140.9 65K 19,721 MB 141.7 98K 20,361 MB 141.5 128K 21,001 MB 142.7
The model: llmfan46's 26B MoE Q4_K_M ultra-uncensored-heretic. Abliterated (near-zero refusal). Vision via mmproj-BF16. Fits entirely in 21GB VRAM at 128K, leaving 3.6GB headroom on a 24GB card. The 31B alternative? At 8K context it was already at 93.5% VRAM with only 40 tok/s — no room to grow.
This is a reminder that first-principles approximations are powerful, but they're not the whole truth. The real bottleneck isn't raw VRAM capacity — it's how intelligently the kernels orchestrate memory across the PCIe bus. Paged attention turns a hard wall into a soft buffer. We have much more headroom on consumer GPUs than the pure math suggests, provided the implementation is clever enough.
26B MoE Q4_K_M, RTX 3090 24GB 하나로 128K context를 142.7 tok/s로 서빙하는 실험. 결론: paged attention이 VRAM 한계를 소프트 버퍼로 바꿔놓았다.
실험 설정: SOV 서버 (Ubuntu 26.04, RTX 3090 24GB, llama.cpp, flash-attn on). 26B MoE Q4_K_M — 16GB GGUF + 1.2GB mmproj. curl로 /v1/completions 호출, n_predict=100, temperature=0. 각 context length에서 3회 측정 후 평균.
틀렸던 가정: KV cache 공식은 맞다. 128K에서 KV cache만 30GB도 맞다. 하지만 "VRAM에 다 들어가야 한다"는 가정이 틀렸다. Paged attention은 필요할 때만 페이지를 VRAM에 로드한다. 나머지는 CPU RAM에 있다.
의미: 24GB GPU 하나로 128K context + vision + near-zero refusal 모델을 142 tok/s로 실시간 서빙할 수 있다. cloud API 없이 local에서 돌아간다. 프라이버시, 비용, 레이턴시 모두 유리하다.
실험은 Panel Squadron 세 명이 돌아가면서 진행했다. Karpathy가 실험 프로토콜을 설계하고, Dettmers가 quantization trade-off를 검증하고, Hotz가 실제로 서버를 내리고 올리면서 측정했다. 모든 raw 데이터는 로컬 리포트에 저장되어 있다.
v1.0 — 2026-05-13 — 최초 작성
