RWKV-7: Infinite Context, Zero KV Cache — The Local-First Architecture

The number one complaint in local AI: “my model ran out of VRAM during a long conversation.” You start chatting, everything’s fast, and 30 minutes later your GPU is thrashing or the process crashes. The culprit is the KV cache, a data structure that every transformer builds during inference. It grows with every token in the conversation. More context, more memory, until something breaks.

RWKV-7 doesn’t have a KV cache. It processes each token with a fixed-size state that never grows. A 7B RWKV model uses the same memory at token 100 as it does at token 100,000. Context length has zero effect on VRAM usage. Read that again, because if you’ve spent any time managing memory on local hardware, that sentence should feel like a relief.

The catch: it’s not a transformer, which means the ecosystem is smaller, the community is thinner, and raw benchmark quality is slightly behind on some tasks. But for specific workloads, the no-KV-cache architecture is the better choice.

How RWKV works (without the math)

RWKV (pronounced “RwaKuv”) is an RNN at inference time and a transformer at training time. This hybrid approach means it trains efficiently on GPUs using parallelism (like a transformer) but runs inference sequentially with fixed memory (like an RNN).

Here’s how it differs from a transformer:

A transformer processes your entire conversation at once. Every token can “look at” every other token through the attention mechanism. This is powerful but expensive: the attention computation scales quadratically with sequence length (O(T^2) time, O(T^2) memory). The KV cache stores the keys and values for every previous token so they don’t need to be recomputed, but that cache grows linearly with conversation length.

RWKV processes tokens one at a time and maintains a compressed “state” that summarizes everything it’s seen. This state has a fixed size regardless of how many tokens have passed through it. New tokens update the state, old information gets naturally compressed. Time complexity is O(T), memory is O(1) per token.

RWKV-7 (“Goose”) is the latest version. It introduces “Dynamic State Evolution,” which lets the model update its internal state more expressively than previous versions. The paper (March 2025, arXiv:2503.14456) shows it can do things that fixed-depth attention provably cannot, like state tracking and recognizing all regular languages.

In practice, RWKV trades the ability to look back at any arbitrary token in the conversation (what transformers do with attention) for constant memory and constant speed. It compensates by being smarter about what it puts into its state.

The VRAM comparison that matters

This is the table you came here for. Transformer VRAM grows with context length because of the KV cache. RWKV VRAM stays flat.

Approximate memory at different context lengths for 7B-class models (Q4 quantization, single GPU):

Context length	Transformer 7B (Llama/Qwen)	RWKV-7 7B
2K tokens	~5.5 GB	~4.5 GB
8K tokens	~7 GB	~4.5 GB
16K tokens	~9 GB	~4.5 GB
32K tokens	~12 GB	~4.5 GB
64K tokens	~18 GB	~4.5 GB
128K tokens	~30+ GB	~4.5 GB

The transformer column keeps climbing. The RWKV column doesn’t move. At 2K tokens the difference is small. At 32K it’s 2.5x. At 128K the transformer needs a multi-GPU setup or CPU offloading while RWKV runs comfortably on a single 8GB card.

This is why RWKV matters for local AI. Not because it beats transformers on MMLU. Because it solves the memory growth problem that kills long conversations on consumer hardware.

An int8 RWKV 14B model can run on sequences of any length with about 3GB of VRAM. Try that with a 14B transformer at 32K context.

VRAM by quantization (constant regardless of context length)

Model	FP16	Q8_0	Q5_1	Q4_0
0.1B	~0.2 GB	~0.1 GB	~0.08 GB	~0.06 GB
0.4B	~0.8 GB	~0.4 GB	~0.3 GB	~0.2 GB
1.5B	~3.0 GB	~1.5 GB	~1.1 GB	~0.8 GB
2.9B	~5.8 GB	~2.9 GB	~2.1 GB	~1.5 GB
7.2B	~14.4 GB	~7.2 GB	~5.2 GB	~3.6 GB
13.3B	~26.6 GB	~13.3 GB	~9.6 GB	~6.7 GB

The RWKV wiki recommends FP16 > Q8_0 > Q5_K_M > Q4_K_M. Lower quantizations (Q4_0, Q4_1) have been reported to cause quality issues with RWKV due to weight/activation outliers. Stick to Q5 or above if possible.

Available models

RWKV-7 ships in two series: base World models and G1 reasoning models. G1 is the current recommendation.

Base models (World v3 dataset)

Model	Parameters	Training Data
RWKV7-World3-0.1B	0.1B	3.1T tokens
RWKV7-World3-0.4B	0.4B	3.1T tokens
RWKV7-World3-1.5B	1.52B	3.1T tokens
RWKV7-World3-2.9B	2.9B	3.1T tokens

G1 reasoning models (World v3.5 dataset)

Model	Parameters	Training Data	Notes
RWKV7-G1 0.1B	0.1B	5.16T tokens	Thinking mode
RWKV7-G1 0.4B	0.4B	5.16T tokens	Thinking mode
RWKV7-G1 1.5B	1.5B	5.16T tokens	Thinking mode
RWKV7-G1 2.9B	2.9B	5.16T tokens	Best benchmarked size, 3B SoTA on multilingual
RWKV7-G1 7.2B	7.2B	5.16T tokens	Thinking mode
RWKV7-G1 13.3B	13.3B	5.16T tokens	Largest available

The G1 series adds chain-of-thought reasoning. Toggle it with /set think and /set nothink in Ollama.

The 2.9B model is the star. It matches Qwen2.5-3B on English benchmarks (71.5% average vs 71.4%) while training on 5.6 trillion tokens compared to Qwen’s 18 trillion. That’s the same quality from 3x less training data, which speaks well for the architecture’s efficiency.

On the Hacker News thread discussing these results, the claim is RWKV-7 averages 72.8% across standard benchmarks versus Llama 3.2-3B’s 69.7%. That was measured with fewer than a third of Llama’s training tokens.

All models are released under Apache 2.0. The RWKV team also released their 3.1 trillion token multilingual training corpus, which is unusual and appreciated.

GGUF files are available from the RWKV GGUF collection on HuggingFace, with quantizations from Q4_K_M through FP16. The RWKV wiki recommends Q8_0 or higher, noting that lower quantizations degrade quality more than they do for transformers.

Running RWKV-7 locally

Ollama (easiest)

ollama run mollysama/rwkv-7-g1:2.9b

That’s it. Ollama downloads the Q8_0 GGUF and starts a chat session. The G1 models support thinking mode by default (chain-of-thought traces before answering). Toggle it off with /set nothink if you want direct answers.

Other sizes available:

ollama run mollysama/rwkv-7-g1:1.5b
ollama run mollysama/rwkv-7-g1:2.9b-q6_k

llama.cpp

Download a GGUF from the RWKV collection, then:

./llama-cli -m models/rwkv-7-world-2.9b-Q8_0.gguf \
  -p "You are a helpful assistant" \
  -cnv -t 8 -ngl 99 -n 500

For a web interface:

./llama-server -m models/rwkv-7-world-2.9b-Q8_0.gguf -ngl 99

This gives you an OpenAI-compatible API at http://127.0.0.1:8080.

rwkv.cpp (native runtime)

The dedicated RWKV inference engine supports v4 through v7:

git clone --recursive https://github.com/RWKV/rwkv.cpp.git
python python/convert_pytorch_to_ggml.py input.pth output.bin FP16
python python/quantize.py model.bin model-Q5_1.bin Q5_1
python python/chat_with_bot.py model-Q5_1.bin

Other options

RWKV also runs on KoboldCpp, Text Generation WebUI, and SillyTavern. Ai00 Server supports Int8 and NF4 quantization. RWKV-Runner by josStorer provides a GUI launcher. The dedicated rwkv Python package works for direct integration, and ChatRWKV for a standalone chat interface. llama.cpp has become the most practical choice since it handles RWKV natively.

Speed: the $50 PC benchmark

We ran this on a Lenovo M710Q (i7-6700T, 8GB DDR4, no GPU):

Turn	RWKV-7 2.9B (Q8_0)	gemma3:4b
1	4.7 tok/s	5.7 tok/s
3	4.7 tok/s	5.3 tok/s
6	4.7 tok/s	CRASHED
10	4.7 tok/s	–

The transformer starts faster but dies when its KV cache fills memory. RWKV-7 holds a flat line because there’s no cache to fill.

Where RWKV wins

Long document processing

Lawyers, researchers, anyone who feeds long documents into a model. A 50-page contract is roughly 25K tokens. With a transformer, that burns 25K tokens of KV cache before you’ve asked a single question. With RWKV, the document streams through the fixed-size state. VRAM doesn’t care how long the document is.

Edge deployment

RWKV-7 on an ARM Cortex-A76 (Raspberry Pi class) generates 16.39 tok/s. For comparison, Llama 2-7B with INT4 on a Raspberry Pi 4 manages 0.11 tok/s. That’s not a typo. RWKV is 150x faster on the same class of hardware because the constant-memory architecture maps naturally to devices with limited RAM and no GPU.

On a desktop NVIDIA RTX 5090, the 7B model at FP16 pushes 10,250+ tok/s in batched inference. Different hardware, same architectural advantage.

Always-on assistants

If you run a local AI assistant that stays active all day, memory stability matters. Transformer-based assistants slowly eat more memory as conversations grow, eventually needing a context reset or restart. RWKV-based assistants use the same memory at hour 8 as they did at hour 1. No memory leaks from the model side, no gradual slowdown, no surprise OOM crashes.

RAG with long contexts

RAG pipelines retrieve chunks of documents and stuff them into the context window. More chunks, more KV cache, more memory. With RWKV, you can feed in as many retrieved chunks as you want without worrying about VRAM limits. The bottleneck moves from memory to quality of the retrieved content, which is where it should be.

Offline and air-gapped systems

RWKV’s predictable resource usage makes it ideal for air-gapped deployments where you can’t afford surprises. You know exactly how much memory the model will use, regardless of what users throw at it. No edge cases that suddenly double VRAM consumption.

Where transformers still win

Raw quality per parameter

At the same parameter count, transformers generally produce more accurate and nuanced outputs on short-context tasks. The attention mechanism’s ability to look back at any token is genuinely useful for complex reasoning where specific earlier details matter. RWKV’s compressed state loses some of that granularity.

The gap has narrowed significantly with RWKV-7 (matching 3B transformers on English benchmarks), but on tasks like creative writing, complex multi-step reasoning, and precise information retrieval from context, transformers still have an edge.

Ecosystem maturity

Every major tool is transformer-first. Ollama, LM Studio, vLLM, Open WebUI. They all support RWKV now (llama.cpp merged RWKV support), but the experience is less polished. You won’t find as many pre-made Modelfiles, fewer community guides, and the model selection is smaller.

Fine-tuning is the biggest gap. There are thousands of Llama and Qwen fine-tunes on HuggingFace. RWKV fine-tunes are scarce. If you need a model customized for your specific domain, transformers offer far more options.

Tool calling and structured output

Function calling, JSON mode, structured output — these features are well-tested on transformer models and less battle-tested on RWKV. If your application depends on reliable structured output, stick with a transformer until RWKV tooling catches up.

Community size

The RWKV Discord is active but small compared to r/LocalLLaMA’s transformer-focused community. When you hit a problem, there are fewer people who’ve solved it before you. The HuggingFace download counts tell the story: RWKV-7 models have hundreds of downloads. Comparable Llama and Qwen models have millions.

When to use which

Your situation	Use this
Long conversations that crash from OOM	RWKV
Raspberry Pi or phone deployment	RWKV
Processing 50+ page documents	RWKV
Always-on assistant with stable memory	RWKV
Best quality on short tasks	Transformer
Need specific fine-tunes	Transformer
Tool calling and structured output	Transformer
First local AI setup (ecosystem support)	Transformer
RAG with large context windows	RWKV
Air-gapped/offline with predictable resources	RWKV

The bigger picture

RWKV, Mamba, and LFM2 are all attacking the same problem from different angles: transformer inference is expensive, and KV cache growth is the specific bottleneck that hurts local users most. Each offers a different trade-off between quality and efficiency.

RWKV is the most mature of the three for local deployment. It has llama.cpp and Ollama support, GGUF files ready to download, and a 2.9B model that competes with 3B transformers on benchmarks. The 7B and 14B models push quality higher for users with more hardware.

The architecture doesn’t need to replace transformers everywhere. It needs to be good enough for the specific cases where constant memory matters more than peak quality. For long documents, edge devices, always-on assistants, and memory-constrained hardware, it already is.

Start with the 2.9B on Ollama. If the quality works for your use case, you just found a model that will never run out of VRAM no matter how long the conversation goes.

Check if your GPU can handle it with the VRAM Calculator. For more on why transformers eat memory over time, read Memory Leak in Long Conversations. And for the broader view on post-transformer architectures, see Beyond Transformers: 5 Architectures.