Best Local LLMs for Summarization in 2026
๐ More on this topic: Best Local LLMs for RAG ยท Context Length Explained ยท VRAM Requirements ยท Local AI Privacy
Summarization is the use case where local AI makes the most sense. Not because local models are better than GPT-4 at condensing text โ they’re not. But because the documents worth summarizing are exactly the ones you shouldn’t upload to a cloud API: legal briefs, medical records, financial statements, internal memos, client files.
A law firm can’t send privileged documents to OpenAI’s servers. A medical office can’t upload patient records to Claude. A financial advisor can’t pipe client portfolios through any API without compliance headaches. But a local model running on hardware you control? That’s your data staying on your machine, processed by software you own, with no third-party access.
The models have caught up enough to be useful. A 7B model on an 8GB GPU summarizes a 50-page PDF in under a minute with decent accuracy. A 70B model on a 24GB card produces summaries that rival GPT-3.5. Here’s which model to run at each VRAM tier, how to handle documents that exceed your context window, and the failure modes to watch for.
What Makes a Good Summarization Model
Three things matter for summarization, in order of importance:
1. Context window. A standard page of text is roughly 400-500 tokens. A 30-page contract is ~15,000 tokens. A 200-page report is ~100,000 tokens. If your model’s context window is 8K tokens, you can summarize about 15 pages in a single pass. At 128K tokens, you can handle 200+ pages at once. Bigger context windows mean fewer chunks and better coherence.
2. Instruction following. Summarization prompts are specific: “Summarize the key findings,” “Extract action items,” “Condense to 3 bullet points.” Weak instruction followers ignore the format you request, produce summaries that are too long or too short, or add commentary you didn’t ask for. Strong instruction followers (Qwen 2.5, Gemma 3) stick to the brief.
3. Faithfulness. The model should compress, not invent. A hallucinated detail in a summary โ a fabricated statistic, a misattributed quote, an invented clause in a contract โ is worse than no summary at all. Larger models hallucinate less. So does prompting with “only include information explicitly stated in the document.”
How Many Pages Fit in One Pass?
At roughly 450 tokens per page of standard text:
| Context Window | Usable Context (~70%) | Pages in One Pass |
|---|---|---|
| 8K tokens | ~5,600 | ~12 pages |
| 16K tokens | ~11,200 | ~25 pages |
| 32K tokens | ~22,400 | ~50 pages |
| 128K tokens | ~89,600 | ~200 pages |
Why 70% usable? The system prompt, your instructions, and the output itself consume context. If you set aside 30% for overhead and generation, 70% is what’s available for the actual document. In practice, models degrade when you fill the full context window โ staying under 70% gives better results.
Best Models by VRAM Tier
8GB VRAM (RTX 3060, RTX 4060, Arc B580)
| Model | Context | VRAM (Q4_K_M) | Strength |
|---|---|---|---|
| Qwen 2.5 7B | 128K | ~4.7 GB | Best instruction following, longest context |
| Llama 3.1 8B | 128K | ~6 GB | Good all-rounder |
| Gemma 3 4B | 128K | ~3 GB | Smallest footprint, surprisingly capable |
| Mistral 7B v0.3 | 32K | ~4.5 GB | Fast but limited to ~50 pages |
Pick Qwen 2.5 7B. 128K context means you can summarize a 200-page document in a single pass without chunking. At Q4 it uses ~4.7GB, leaving room for the KV cache needed for long contexts. Instruction following is the strongest in the 7B class โ it respects formatting requests and stays faithful to the source.
Gemma 3 4B is the budget option at just 3GB. The 128K context window is real and it handles summarization well for its size. If you’re on a laptop with shared VRAM or need to run an embedding model alongside it, this is the smallest option that still produces usable summaries.
Mistral 7B v0.3 is fastest for short documents but the 32K context limits you to ~50 pages. For anything longer you’ll need chunking, which the 128K models avoid.
# 8GB VRAM summarization setup
ollama pull qwen2.5:7b
# Summarize a file via CLI
cat document.txt | ollama run qwen2.5:7b "Summarize the following document in 5 bullet points:"
KV cache note: 128K context with a 7B Q4 model needs significant VRAM for the KV cache โ on 8GB, you’ll realistically max out around 32-64K tokens of actual input. For full 128K utilization, you need 12GB+. On 8GB VRAM, treat it as a ~50-page single-pass limit.
12-16GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB)
| Model | Context | VRAM (Q4_K_M) | Strength |
|---|---|---|---|
| Qwen 2.5 14B | 128K | ~9 GB | Best quality at this tier |
| Qwen 3 14B | 128K | ~9 GB | Hybrid thinking mode |
| Gemma 3 12B | 128K | ~7 GB | QAT models, leaves VRAM headroom |
| Qwen 3 30B-A3B | 128K | ~15 GB | MoE โ 30B quality, 3B inference cost |
Pick Qwen 2.5 14B for 16GB. Pick Gemma 3 12B for 12GB.
The jump from 7B to 14B is noticeable for summarization. Longer summaries stay coherent. Key points are less likely to be dropped. Hallucinated details are rarer. At 16GB, you have enough VRAM for the model plus a full 128K KV cache โ true 200-page single-pass summarization.
Qwen 3 14B adds hybrid thinking mode โ it can reason step-by-step before summarizing, which helps with complex documents where the key points aren’t obvious (financial reports, technical papers). The thinking tokens consume extra context but the summaries are more analytical.
Gemma 3 12B at ~7GB Q4 on a 12GB card leaves 5GB for KV cache โ enough for ~50-80K tokens of input. Google’s QAT (Quantization-Aware Trained) variants maintain near-full-precision quality at int4 sizes, which matters for faithfulness.
The sleeper pick: Qwen 3 30B-A3B. It’s a mixture-of-experts model โ 30B total parameters but only 3B active per token. At Q4 it needs ~15GB. You get 30B-level summarization quality with 3B-level inference speed. The tradeoff: higher disk usage (~17GB download) and slightly slower loading.
# 16GB VRAM setup
ollama pull qwen2.5:14b
# Summarize a PDF (extract text first)
pdftotext report.pdf - | ollama run qwen2.5:14b "Summarize this report. Include: key findings, recommendations, and any numerical data mentioned."
24GB VRAM (RTX 3090, RTX 4090)
| Model | Context | VRAM (Q4_K_M) | Strength |
|---|---|---|---|
| Qwen 2.5 72B | 128K | ~42 GB | Needs offloading or 2x GPUs |
| Llama 3.3 70B | 128K | ~42 GB | Matches 405B quality on many tasks |
| Command R 35B | 128K | ~19 GB | Inline citations in summaries |
| Gemma 3 27B | 128K | ~14 GB | Leaves room for full 128K context |
Pick Command R 35B for cited summaries. Pick Gemma 3 27B for maximum context.
At 24GB, your choice is between a larger model with tight VRAM or a smaller model with room for longer contexts.
Command R 35B (~19GB at Q4) is built for grounded generation โ it produces summaries with inline citations pointing back to specific sections of the source document. For legal and medical use cases, this is significant: you can verify every claim in the summary against the original. No other local model does this natively.
Gemma 3 27B at ~14GB Q4 leaves 10GB for KV cache โ enough for the full 128K context window. This means genuine 200-page single-pass summarization at 27B quality. For long documents where you want to avoid chunking entirely, this is the best option on a single 24GB card.
The 70B models (Qwen 2.5 72B, Llama 3.3 70B) produce the best summaries but at ~42GB Q4 they don’t fit on a single 24GB card. You’d need aggressive quantization (Q2/Q3), CPU offloading (slow), or two GPUs. If you have dual 24GB cards, 70B Q4 is the quality ceiling for local summarization.
# 24GB citation-focused summarization
ollama pull command-r:35b
# Summarize with source citations
cat contract.txt | ollama run command-r:35b "Summarize this contract. For each key point, cite the specific section or paragraph it comes from."
When Documents Exceed Your Context Window
A 500-page book is ~225,000 tokens. Even with 128K context, that doesn’t fit. You need chunked summarization.
Map-Reduce (Best for Most Cases)
Split the document into chunks that fit your context. Summarize each chunk independently. Then feed all chunk summaries into a final pass that produces the overall summary.
Document โ [Chunk 1] [Chunk 2] [Chunk 3] ... [Chunk N]
โ โ โ โ
[Summary 1] [Summary 2] [Summary 3] ... [Summary N]
โ โ โ โ
โโโโโโโโโโ Final Summary Pass โโโโโโโโโโ
โ
[Final Summary]
This works well because each chunk summary is independent โ you can parallelize it and failures in one chunk don’t corrupt the others. The tradeoff: connections between distant parts of the document can be missed because no single pass sees the whole thing.
Practical implementation with Ollama:
# Split a large document into chunks (e.g., 10,000 tokens each)
split -l 300 large_document.txt chunk_
# Summarize each chunk
for f in chunk_*; do
cat "$f" | ollama run qwen2.5:7b "Summarize the following section concisely:" > "summary_$f"
done
# Combine chunk summaries and produce final summary
cat summary_chunk_* | ollama run qwen2.5:7b "The following are summaries of sections from a larger document. Produce a coherent overall summary covering the key points from all sections:"
Refine (Best for Narrative Documents)
Process chunks sequentially. Start with a summary of chunk 1. Then pass that summary plus chunk 2 to produce an updated summary. Repeat through all chunks.
Chunk 1 โ Summary v1
Summary v1 + Chunk 2 โ Summary v2
Summary v2 + Chunk 3 โ Summary v3
... โ Final Summary
This preserves narrative flow and connections between sections better than map-reduce, but it’s sequential (can’t parallelize) and later chunks can cause the model to “forget” earlier details. Best for books, reports with narrative structure, or meeting transcripts.
Hierarchical (Best for Very Long Documents)
Map-reduce but with multiple levels. Summarize chunks into section summaries, then summarize section summaries into chapter summaries, then summarize chapters into the final summary. Use this for 500+ page documents where even the chunk summaries don’t fit in one context window.
Tools for Local Summarization
Ollama + Open WebUI (Easiest)
Open WebUI supports drag-and-drop file upload. Drop a PDF, type “summarize this document,” and it extracts the text, passes it to your Ollama model, and returns the summary. No coding required.
Setup:
# Install Ollama + model
ollama pull qwen2.5:14b
# Run Open WebUI
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open localhost:3000, upload your PDF, and ask for a summary. Open WebUI handles text extraction and chunking automatically.
Critical setting: Increase Ollama’s context size in Open WebUI’s admin panel (Models > Advanced Parameters). The default 2,048 tokens is far too small for summarization. Set it to at least 16,000 โ or 32,000-128,000 if your VRAM supports it.
For details, see our Open WebUI setup guide.
Ollama CLI (Fastest for Single Files)
Pipe text directly through Ollama. No GUI needed.
# Summarize a text file
cat meeting_notes.txt | ollama run qwen2.5:7b "Summarize in bullet points:"
# Summarize a PDF (requires pdftotext from poppler-utils)
pdftotext report.pdf - | ollama run qwen2.5:7b "Provide a 3-paragraph executive summary:"
# Summarize with specific instructions
cat legal_brief.txt | ollama run qwen2.5:14b "Summarize this legal brief. Include: parties involved, key arguments, cited precedents, and the requested relief."
LM Studio
Load a model, open the chat, paste your text. LM Studio doesn’t have built-in file upload but handles copy-pasted text well. The advantage: you can adjust temperature (lower = more faithful) and context size per session.
AnythingLLM
Upload documents into workspaces, then query them. AnythingLLM handles chunking, embedding, and retrieval โ making it better for asking questions about documents than pure summarization. But you can prompt it to summarize an entire workspace’s contents. See our RAG guide for setup.
Common Failures and How to Fix Them
Hallucinated Details
The model adds facts, statistics, or quotes that aren’t in the original document. This is the most dangerous failure mode โ a hallucinated clause in a legal summary or a fabricated number in a financial summary can cause real harm.
Fix: Use explicit instructions: “Only include information explicitly stated in the document. Do not add analysis, interpretation, or external knowledge.” Larger models (14B+) hallucinate less. Lower temperature (0.1-0.3) reduces creativity, which is what you want for summarization.
Lost in the Middle
Key information from the middle of long documents gets dropped. Models attend more strongly to the beginning and end of their context window. A critical finding on page 47 of a 100-page report may be omitted while trivial details from page 1 are preserved.
Fix: Use chunked summarization even when the document technically fits in context. Chunks of 20-30 pages produce more uniform coverage than a single 100-page pass. Or use a two-pass approach: first pass produces a summary, second pass checks the summary against the original and fills gaps.
Over-Compression
The summary is so generic it’s useless. “The document discusses various aspects of the project and makes several recommendations.” That’s a summary of every document ever written.
Fix: Be specific in your prompt. Instead of “summarize this,” use “summarize this in 500 words, including: specific numerical findings, named recommendations, deadlines mentioned, and open questions.” Give the model a structure to fill rather than a blank canvas.
Fabricated Quotes
The model attributes statements to people who didn’t make them, or rewrites quotes inaccurately. Especially common with meeting transcripts and interview summaries.
Fix: Instruct the model to paraphrase rather than quote: “Do not use direct quotes. Paraphrase all statements.” If you need exact quotes, use Command R 35B with its citation grounding, or manually verify every quoted passage.
The Privacy Use Case
This is where local summarization isn’t just convenient โ it’s necessary.
| Sector | Why Local Matters |
|---|---|
| Legal | Attorney-client privilege. Uploading client documents to OpenAI’s servers may waive privilege in some jurisdictions. |
| Medical | HIPAA compliance. Patient records cannot be processed by cloud APIs without BAAs and extensive compliance. |
| Financial | Regulatory requirements (SEC, FINRA). Client financial data has strict handling rules. |
| Government | Classified or sensitive documents. Air-gapped local inference is the only option. |
| Corporate | Trade secrets, M&A documents, internal investigations. Cloud APIs create discovery risk. |
A law firm summarizing a 500-page deposition. A medical office condensing patient history for a referral. A financial advisor extracting key terms from a prospectus. In every case, the documents are too sensitive for cloud APIs, and manual summarization takes hours.
A local 14B model on a $300 GPU processes these documents in minutes, on hardware the organization controls, with no data leaving the building. The summaries aren’t as polished as GPT-4’s. But they’re private, they’re fast, and they’re good enough to be useful โ which, for sensitive documents, is all that matters.
For a deeper look at what’s actually private (and what leaks), see our local AI privacy guide.
The Recommendation
| Your Setup | Model | Approach |
|---|---|---|
| 8GB VRAM, short docs (<50 pages) | Qwen 2.5 7B (Q4) | Single pass |
| 8GB VRAM, long docs | Qwen 2.5 7B (Q4) | Map-reduce chunking |
| 12GB VRAM | Gemma 3 12B (QAT) | Single pass up to ~80 pages |
| 16GB VRAM | Qwen 2.5 14B (Q4) | Single pass up to ~200 pages |
| 24GB, need citations | Command R 35B (Q4) | Single pass with inline citations |
| 24GB, long docs | Gemma 3 27B (Q4) | Single pass, full 128K context |
| Dual 24GB GPUs | Llama 3.3 70B (Q4) | Best quality, single pass |
| Any VRAM, 500+ pages | Any model above | Hierarchical map-reduce |
Start with Ollama + Open WebUI. Upload a PDF, ask for a summary, see if the quality is good enough. If it is, you’re done. If the model hallucinates or drops key points, move up to the next tier or switch to chunked summarization.
The best summarization model is the largest one that fits your VRAM with room for context. Context window matters more than raw model size โ a 7B model with 128K context that sees the whole document produces better summaries than a 70B model with 8K context that only sees the first 15 pages.