TurboQuant Explained: How Google's KV Cache Trick Cuts Memory 6x With Zero Quality Loss
📚 More on this topic: VRAM Requirements Guide · What Can You Run on 24GB? · Context Length Explained · llama.cpp vs Ollama vs vLLM
Every time you send a message to a local LLM, the model stores information about every token it has read so far. That storage is the KV cache, and on a 24GB GPU running Qwen 3.5 27B at 32K context, it can eat 4-6GB of your VRAM – memory that could otherwise hold a larger model or a longer conversation.
Google published a paper called TurboQuant that compresses that cache down to 3-4 bits per element instead of the usual 16. The result: 4-6x less memory for context, 8x faster cache lookups, and zero accuracy loss on benchmarks. The paper dropped on arXiv in April 2025, Google blogged about it in March 2026, and it’s scheduled for presentation at ICLR 2026.
That was the story three days ago when I first published this article. Since then, TurboQuant has broken out of the KV cache and is now being applied to model weights — a much bigger deal. Meanwhile, Georgi Gerganov (the creator of llama.cpp) has opened a PR called attn-rot that delivers most of TurboQuant’s KV cache benefits with almost no downsides, and it’s about to merge into mainline. The timeline for “when can I use this” just moved from months to days.
Here’s how it all works, what it means for your hardware, and what’s changed since last week.
What the KV cache is (and why it eats your VRAM)
When a model processes your prompt, it doesn’t just read and forget. For every token it sees, it creates two vectors: a key (a label for that token’s role in context) and a value (the actual meaning data). These get stored in a lookup table – the KV cache – so the model can refer back to earlier parts of the conversation without reprocessing everything.
The problem: this cache grows linearly with context length. More tokens in the conversation means more key-value pairs stored. At FP16 precision, an 8B model at 32K context uses about 4.6GB just for the cache. A 70B model at 100K context can burn through 20GB+ of cache alone.
Here’s what that looks like on real hardware:
| Model | Context length | KV cache (FP16) | KV cache (TurboQuant 4-bit) |
|---|---|---|---|
| Llama 3.1 8B | 4K | 640 MB | ~120 MB |
| Llama 3.1 8B | 32K | 4.6 GB | ~870 MB |
| Llama 3.1 8B | 128K | 18.4 GB | ~3.5 GB |
| Qwen 3.5 27B | 32K | ~5.4 GB | ~1.0 GB |
That 128K row is the one that matters. Without TurboQuant, running Llama 3.1 8B at its full 128K context on a 24GB GPU means the cache alone takes 18.4GB, leaving barely 6GB for model weights. With TurboQuant, the cache drops to 3.5GB, leaving 20GB for the model. That’s the difference between “impossible” and “comfortable.”
How TurboQuant works
TurboQuant has two parts, and the analogy that helped me understand it is giving directions.
PolarQuant: better coordinates
Standard quantization takes a vector of numbers and rounds each one to fit in fewer bits. Think of it like giving someone turn-by-turn directions: “go 3 blocks north, 2 blocks east, 1 block south.” Each step gets rounded, and the errors accumulate.
PolarQuant converts the vectors from Cartesian coordinates to polar coordinates first. Instead of step-by-step directions, you point at the destination and say “that direction, this far.” A direction (angle) and a distance (magnitude). After a random preconditioning step, these polar values follow a predictable statistical distribution, which means you can quantize them more efficiently with less distortion.
The practical win: PolarQuant eliminates the per-block normalization constants that traditional quantization methods need to store. No calibration data required. No fine-tuning. You just apply it at inference time and it works.
On its own, PolarQuant achieves 4.2x compression with quality that matches or beats existing methods like KIVI and KVQuant.
QJL: 1-bit error correction
The second piece is Quantized Johnson-Lindenstrauss, or QJL. It applies a mathematical transform and then reduces each value to a single bit (+1 or -1). This creates an unbiased estimate of the residual error left by PolarQuant.
Together, PolarQuant + QJL achieve 6x compression at 3.5 bits per element with zero accuracy loss on LongBench, needle-in-haystack, and other standard benchmarks.
One interesting finding from the community: most implementers have dropped QJL entirely and put all available bits into better PolarQuant centroids instead. In practice, allocating bits to higher-quality centroids outperforms the theoretical elegance of the 1-bit error correction layer. The math is sound, but raw quality from better centroids wins.
What this actually means for your GPU
Let’s be specific about what TurboQuant does and doesn’t change.
What it does
Longer context on the same hardware. If you’re running Qwen 3.5 27B Q4_K_M on a 24GB GPU, the model weights take ~17GB. That leaves 7GB for the KV cache. At FP16, that’s roughly 40K tokens of context. With TurboQuant at 4-bit, you get 160K+ tokens in the same space.
More room for larger models. If a model barely fits in your VRAM with no context headroom, TurboQuant gives you breathing room. A model that previously choked at 8K context might run comfortably at 32K.
Faster cache lookups on long conversations. Google claims 8x speedup on attention logit computation with 4-bit cache vs FP32 on H100s. On consumer GPUs the improvement will be smaller, but the direction is real – less data to read means faster reads.
What it doesn’t do
Does not compress model weights. Update (April 2, 2026): This was true when this article was first published. It’s no longer true — TurboQuant is now being applied to model weights, not just the KV cache. See the new section below for details.
Does not speed up training. This is inference-only.
The “8x speedup” is for cache operations, not overall tok/s. Your total token generation speed depends on many things – memory bandwidth, compute, model size. The cache speedup is one component. Real-world tok/s improvement is meaningful but not 8x across the board.
Community benchmarks
Google tested TurboQuant on H100s with Gemma, Mistral, and Llama. The community has since tested it on consumer hardware.
Official results (Google, H100)
| Method | KV bits | LongBench score | Needle-in-haystack |
|---|---|---|---|
| Full precision (FP16) | 16 | 50.06 | 0.997 |
| TurboQuant | 3.5 | 50.06 | 0.997 |
| TurboQuant | 2.5 | 49.44 | 0.997 |
At 3.5 bits, the scores are identical. At 2.5 bits, there’s a slight LongBench dip but needle-in-haystack accuracy holds.
Community results (consumer hardware)
RTX 4080, Qwen2.5 models (back2matching/turboquant):
| Model | Context | TQ 4-bit speed | FP16 speed | VRAM saved |
|---|---|---|---|---|
| Qwen2.5-3B | 4K | 7.4 tok/s | 2.5 tok/s | 1,048 MB |
| Qwen2.5-7B | 1.8K | 1.4 tok/s | OOM | 444 MB |
| Qwen2.5-0.5B | 8K | 19.8 tok/s | – | 2,070 MB |
That Qwen2.5-7B row is telling. At FP16 it ran out of memory entirely. With TurboQuant 4-bit it runs, period. Not fast, but it runs.
Apple Silicon, Llama 3 8B (helgklaizar/turboquant_mlx):
| Context | Uncompressed KV | TQ 3-bit | Compression |
|---|---|---|---|
| 4K | 64 MB | 12 MB | 5.3x |
| 64K | 1,024 MB | 192 MB | 5.3x |
| 128K | 2,048 MB | 384 MB | 5.3x |
Consistent 5.3x compression across all context lengths. On a MacBook with 36GB unified memory, that’s the difference between running Llama 3 8B at 64K context uncomfortably and running it at 128K context with room to spare.
Quality at different bit-widths
The community consensus after months of testing:
| Bits | Compression | Quality impact |
|---|---|---|
| 4-bit | 3.8x | Indistinguishable from FP16 on 3B+ models |
| 3.5-bit | 4.9x | Zero loss on 8B+, minor on smaller models |
| 3-bit | 4.9x | Noticeable on models under 8B |
| 2-bit | 7.1x | Visible degradation, fine for drafts |
| 1-bit | 12.8x | Research curiosity, not practical |
The sweet spot is 4-bit for reliability or 3.5-bit if you’re willing to accept tiny quality variance on smaller models.
TurboQuant for weights: bigger than the cache trick
Added April 2, 2026
The original TurboQuant paper only targeted the KV cache. The model weights themselves stayed in whatever quantization format you downloaded — Q4_K_M, Q5_K_M, etc. That’s changed.
Community developers have applied TurboQuant’s PolarQuant technique to model weights, and the early results are significant. Reddit user testing shows Qwen3.5-27B running at near-Q4_0 quality but in a file 10% smaller — small enough to fit entirely on a 16GB RTX 5060 Ti. That’s a 27-billion-parameter model on a midrange GPU that launched at $449.
This matters more than the cache compression. The KV cache only becomes a problem at long contexts. Model weights eat your VRAM from the moment you load the model. If TurboQuant weight quantization holds up across more models, it shifts the “what can I run on X VRAM” calculations that drive most GPU buying decisions.
TQ3_4S: the new weight format to watch
The llama.cpp community has introduced TQ3_4S, a new TurboQuant weight format. Early testing shows it’s 2x faster than the earlier TQ3_1S format with better output quality. Community testing is still ongoing, but the trajectory is clear — TurboQuant weight formats are iterating fast.
If you build llama.cpp from source and want to experiment, TQ3_4S is the format to try.
APEX: TurboQuant meets MoE
Added April 2, 2026
APEX is a new quantization technique designed specifically for Mixture-of-Experts architectures — models like Mixtral, DBRX, and DeepSeek-V3 where only a subset of parameters activate per token. APEX outperforms Unsloth Dynamic 2.0 on accuracy benchmarks while producing models that are 2x smaller for MoE architectures.
The performance numbers: 33% faster inference overall, with a 14% speedup in prompt processing attributed to TurboQuant-derived techniques.
This is relevant because MoE models are the most VRAM-hungry architectures in local AI. Mixtral 8x7B already pushes the limits of a 24GB card. If APEX delivers on its benchmarks, MoE models become dramatically more accessible on consumer hardware — exactly the kind of improvement that changes which models you should download.
Implementation status (where can you use it today?)
llama.cpp
The big update: attn-rot (PR #21038).
Georgi Gerganov — the creator of llama.cpp himself — has opened a PR called attn-rot that rotates activations for better KV cache quantization. This isn’t the full TurboQuant implementation, but the community consensus is that it delivers “80% of TurboQuant’s benefit with almost no downsides.” The practical result: Q8 KV cache is now approximately F16 quality.
This PR is about to merge into llama.cpp mainline. Not a community fork. Not a feature request with upvotes and no timeline. The project founder wrote it. Once it lands, every tool that builds on llama.cpp — Ollama, LM Studio, Open WebUI — gets these benefits automatically in their next update.
The full TurboQuant implementation is still tracked in discussion #20969 and feature request #20977 with 212 upvotes. But attn-rot changes the calculus: you don’t need to wait for full TurboQuant to get most of the benefit.
Several community forks with full TurboQuant still exist for those who want maximum compression:
- TheTom/llama-cpp-turboquant – the most mature fork, with Flash Attention support and
turbo3/turbo4KV cache types on Metal - spiritbuun’s fork – CUDA support for RTX 3090+
- Madreag’s fork – RTX 5090 tested, 4.6x KV compression, ~98% of q8_0 prefill speed
Pure C implementation: A standalone C implementation of the TurboQuant paper has appeared — specifically the 1-bit key vector compression via randomized Hadamard transform. This is significant for embedded and edge deployments where Python dependencies aren’t practical.
TQ3_4S weight format: New TurboQuant weight quantization format in community forks, 2x faster than TQ3_1S with better quality. See the weights section above.
MLX (Apple Silicon)
Multiple working implementations:
- helgklaizar/turboquant_mlx – the most production-ready, with an OpenAI-compatible API, tested on DeepSeek R1 Distill 8B, Mistral Nemo 12B, Llama 3/3.2
- A HuggingFace model exists for Qwen3.5-35B with TurboQuant KV compression, showing exact-match quality from 8.5K to 64K context
If you’re on Apple Silicon and comfortable with MLX, this is usable now.
Ollama and LM Studio
Neither supports full TurboQuant yet. Ollama has a feature request (#15051) with 117 upvotes and no maintainer response.
But here’s what changed: both tools pull from llama.cpp, and attn-rot is about to merge into llama.cpp mainline. Once it does, Ollama and LM Studio will pick up the “80% of TurboQuant” benefit in their next llama.cpp sync — likely within weeks, not the Q2-Q3 timeline I originally estimated.
Current workaround: Use Q4_K_M quantization with OLLAMA_FLASH_ATTENTION=1 to reduce VRAM usage from the compute side. Not the same as TurboQuant, but it helps. Once attn-rot lands, Q8 KV cache with attn-rot will be the new default recommendation.
Python (pip install)
The fastest way to try TurboQuant today:
pip install turboquant
The back2matching/turboquant package drops into any HuggingFace model. Version 0.3.0 supports asymmetric K/V compression and layer-adaptive precision (keeps sensitive early/late layers at FP16). Works on any GPU with PyTorch.
The Jevons Paradox angle
Memory chip stocks dropped 3-6% when Google blogged about TurboQuant. The logic: if you need 6x less cache memory, you need fewer GPUs. That logic is wrong.
Cheaper inference doesn’t mean less GPU demand. It means people run longer contexts, load bigger models, and find new use cases that weren’t worth the memory cost before. An RTX 3090 that could handle 32K context will now handle 128K – and users will immediately start using 128K. A 24GB card that forced you into Q4 for a 27B model now lets you use Q6 with the memory savings from cache compression.
This pattern – efficiency gains increasing total demand – plays out every time compute gets cheaper. For local AI users, TurboQuant is pure upside: same hardware, more capability. For the industry as a whole, it accelerates adoption.
What to do now (updated April 2, 2026)
The timeline has compressed dramatically since this article first went live three days ago.
If you build from source: You have options today. Grab a llama.cpp fork for full TurboQuant, try the TQ3_4S weight format, or pip install turboquant for the Python path. The community implementations are stable enough for testing.
If you use Ollama or LM Studio: Attn-rot is days from merging into llama.cpp mainline. Once it does, your tools will pick up Q8-as-good-as-F16 KV cache quality in their next update. You don’t need to do anything — just update when prompted. This is the “80% of TurboQuant for free” moment.
If you’re buying hardware: TurboQuant weight quantization changes the math. A 16GB card like the RTX 5060 Ti can now run Qwen3.5-27B — a model that previously needed 24GB. Don’t rush to buy a bigger GPU until the dust settles on TurboQuant weight formats. The “what fits on what” guides across this site (including the VRAM guide) will need updating once TQ weight formats stabilize.
The practical takeaway has expanded: TurboQuant started as a way to get longer context on the same hardware. It’s becoming a way to run larger models on smaller hardware. Both are free upgrades — no new silicon required.
Related guides
Get notified when we publish new guides.
Subscribe — free, no spam