Best Local LLMs for Translation: What Actually Works

Translation is one of the strongest use cases for running AI locally. Your documents stay on your machine. No API costs per character. No rate limits when you’re processing a thousand files. And unlike chat or coding, translation has dedicated small models that punch well above their weight — you don’t need a 70B general model to get good results.

The catch: picking the right model depends entirely on which languages you need and what hardware you have. A 300MB Opus-MT model translates French-to-English faster and often better than a 14B parameter general LLM. But that same model can’t handle Swahili, while NLLB-200 covers it natively.

Here’s what actually works, what doesn’t, and which model fits your setup.

Two Approaches: General LLMs vs Dedicated Translation Models

Before picking a model, understand the split.

General LLMs (Qwen, Llama, Mistral) can translate because they were trained on multilingual text. You prompt them with “Translate this to Spanish” and they do it. They’re flexible — same model handles translation, summarization, chat, code. But they’re large, and their translation quality varies wildly by language pair.

Dedicated translation models (NLLB, MADLAD-400, Opus-MT) were built specifically for translation. They’re smaller, faster, and often better for the task — but translation is all they do. No chat, no summarization, no reasoning.

Approach	Strengths	Weaknesses
General LLM (Qwen, Llama)	Flexible, handles context/idiom, good for high-resource pairs	Large, slower, weak on rare languages
Dedicated (NLLB, Opus-MT)	Small, fast, broad language coverage, purpose-optimized	Translation only, less natural with idioms

If you’re already running a general LLM for other tasks, try it for translation first. If you need dedicated translation infrastructure — batch processing, rare languages, maximum speed — use a purpose-built model.

General LLMs for Translation

Qwen 2.5: Best All-Around

Qwen 2.5 is the strongest general LLM for translation work. Alibaba trained it with explicit multilingual focus — 29 languages at the 7B size, with strong coverage across Chinese, English, and European languages.

Best for: Chinese-English (best in class at this size), European language pairs, any translation where you also want the model for other tasks.

Quality: On high-resource pairs like English-Spanish or English-German, Qwen 2.5 7B produces translations that read naturally — not the stilted output you’d expect from a 7B model. For Chinese-English specifically, it outperforms models twice its size.

Setup:

ollama run qwen2.5:7b

Then prompt with:

Translate the following text to Spanish. Output only the translation, no explanations.

[your text here]

VRAM: ~5GB at Q4 (7B), ~20GB at Q4 (32B for best quality)

Llama 3: Strong European, Weaker Asian

Llama 3’s training data skewed heavily toward English and European languages. The result: excellent French, Spanish, German, Portuguese, and Italian translation. Weaker Japanese, Chinese, and Korean.

Best for: European language pairs where you want a non-Chinese-company model.

VRAM: ~5GB at Q4 (8B), ~20GB at Q4 (70B with multi-GPU)

Mistral/Mixtral: French Advantage

Mistral is a French company, and it shows. Mistral and Mixtral models handle French and European Romance languages particularly well. Mixtral 8x7B gives you strong multilingual performance if you have 24GB+ VRAM.

Best for: French-centric translation, European language pairs.

VRAM: ~5GB (Mistral 7B), ~26GB (Mixtral 8x7B at Q4)

Command-R: Built for Multilingual

Cohere’s Command-R was specifically trained for multilingual use across 10 languages. It handles nuance and idiomatic expressions better than most open models at similar sizes.

Best for: When natural-sounding output matters more than raw accuracy (marketing copy, creative content).

VRAM: ~5GB at Q4 (35B needs ~20GB)

General LLM Translation Quality by Language Pair

Language Pair	Best General LLM	Quality vs Google Translate
English ↔ Chinese	Qwen 2.5 32B	Comparable or better
English ↔ Spanish	Qwen 2.5 7B / Llama 3 8B	Comparable
English ↔ French	Mistral 7B / Llama 3 8B	Comparable
English ↔ German	Qwen 2.5 7B / Llama 3 8B	Comparable
English ↔ Japanese	Qwen 2.5 32B	Slightly below
English ↔ Korean	Qwen 2.5 14B	Below
English ↔ Arabic	Qwen 2.5 14B	Below
English ↔ Swahili	Any general LLM	Significantly below — use NLLB

The pattern: high-resource European pairs work well even on 7B models. Asian languages need 14B+. Low-resource languages need dedicated models.

Dedicated Translation Models

NLLB-200: The Coverage King

Meta’s No Language Left Behind covers 200 languages — including dozens that general LLMs barely handle. It improved translation quality by 44% average BLEU over previous state-of-the-art, with some African and Indian languages seeing 70%+ improvement.

Model sizes:

Model	Parameters	VRAM (FP16)	VRAM (INT8)	Quality
NLLB-200 Distilled 600M	600M	~2 GB	~1 GB	Good for common pairs
NLLB-200 Distilled 1.3B	1.3B	~3 GB	~2 GB	Sweet spot
NLLB-200 3.3B	3.3B	~8 GB	~4 GB	Best quality
NLLB-200 54.5B (MoE)	54.5B	~110 GB	~55 GB	Research only

Setup (Python + HuggingFace):

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "facebook/nllb-200-distilled-1.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Translate English to French
tokenizer.src_lang = "eng_Latn"
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
translated = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn")
)
print(tokenizer.batch_decode(translated, skip_special_tokens=True))

For faster inference, use CTranslate2 with INT8 quantization — 3-4x faster, half the memory:

pip install ctranslate2
# Pre-converted models available:
# OpenNMT/nllb-200-3.3B-ct2-int8

Best for: Low-resource languages, broad coverage, batch processing, privacy-critical workflows.

MADLAD-400: Broadest Coverage

Google’s MADLAD-400 supports 419 languages — more than double NLLB’s coverage. Quality is comparable to NLLB on common pairs but the real value is those extra 200+ languages.

Model sizes: 3B, 7.2B, 10.7B

Setup:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("google/madlad400-3b-mt")
tokenizer = AutoTokenizer.from_pretrained("google/madlad400-3b-mt")

# Prefix with target language code
inputs = tokenizer("<2pt> I love pizza!", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# Output: "Eu amo pizza!"

Best for: Extremely rare languages not covered by NLLB, research applications.

Trade-off vs NLLB: Broader coverage but slightly lower quality on common pairs. NLLB 3.3B generally outperforms MADLAD-400 3B on well-resourced language pairs.

Opus-MT: Lightweight Specialists

Helsinki-NLP’s Opus-MT takes the opposite approach: instead of one model for all languages, it provides 1,200+ individual models — one per language direction. Each model is roughly 300MB.

Why this matters: If you only need English-to-Spanish translation, why load a 3GB model that handles 199 other languages you’ll never use?

Setup:

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es")
result = translator("The weather is nice today.")
print(result[0]['translation_text'])

Best for: Single language pairs, embedded/edge devices, maximum speed, minimal resource usage.

Limitation: Each model handles one direction. English→Spanish and Spanish→English are separate models. For bidirectional work across multiple languages, NLLB is more practical.

Seamless M4T: When You Need Speech Too

Meta’s Seamless M4T v2 handles text-to-text, speech-to-text, text-to-speech, and speech-to-speech translation in a single model. If your workflow involves audio — translating podcasts, meetings, video content — this is the only local option that does it all.

Model sizes: 281M (on-device), 1.2B (medium), 2.3B (large, ~6GB VRAM)

Language coverage: ~100 languages for speech, 200 for text.

Quality: 20% BLEU improvement over previous speech translation models. Text-to-text quality matches NLLB 3.3B.

Best for: Workflows that mix speech and text translation. If you only need text, NLLB is simpler.

Quality Reality Check

How do local models actually compare to cloud translation services?

High-Resource Pairs (EN-ES, EN-FR, EN-DE)

Local quality: Excellent. Qwen 2.5 7B and NLLB 3.3B both produce translations that are usable without editing for most purposes. Professional translators would still find errors, but for business correspondence, documentation, and casual use, local models match Google Translate.

Where cloud still wins: Highly idiomatic text, marketing copy, literary translation. DeepL in particular handles tone and register better than any local model.

Medium-Resource Pairs (EN-JA, EN-ZH, EN-KO)

Local quality: Good but imperfect. Qwen 2.5 handles Chinese-English at near-cloud quality. Japanese and Korean are weaker — grammar structures that differ fundamentally from English trip up smaller models. A 32B model handles these noticeably better than a 7B.

Where cloud still wins: Honorifics, formality levels, cultural context. Japanese keigo (politeness levels) is particularly hard for local models under 32B.

Low-Resource Pairs (EN-SW, EN-TL, EN-AM)

Local quality: This is where dedicated models justify their existence. NLLB 1.3B translates Swahili, Tagalog, and Amharic at quality levels that general LLMs can’t touch — because general LLMs barely saw these languages in training.

Cloud comparison: Google Translate covers 249 languages but quality for low-resource pairs varies. NLLB often matches or beats Google on languages like Yoruba, Igbo, and Hausa where Meta specifically invested in training data.

The Honest Summary

Language Tier	Best Local Option	vs Cloud Services
High-resource European	Qwen 2.5 7B or Llama 3 8B	90-95% of cloud quality
Chinese-English	Qwen 2.5 14B+	95%+ of cloud quality
Japanese/Korean	Qwen 2.5 32B	80-90% of cloud quality
Low-resource African/Asian	NLLB 3.3B	Matches or beats Google
Rare languages (400+)	MADLAD-400	Often the only option

VRAM Requirements Summary

Model	VRAM Needed	Languages	Best For
Opus-MT (per direction)	~300 MB	1 pair per model	Single-pair, max speed
NLLB 600M (distilled)	~2 GB	200	Low-resource, tight hardware
NLLB 1.3B (distilled)	~3 GB	200	Best quality/size ratio
Seamless M4T Large	~6 GB	100 (speech) / 200 (text)	Speech + text translation
Qwen 2.5 7B (Q4)	~5 GB	29	General-purpose + translation
NLLB 3.3B	~8 GB	200	Best dedicated quality
MADLAD-400 3B	~7 GB	419	Broadest coverage
Qwen 2.5 14B (Q4)	~9 GB	29	Better Asian language quality
Qwen 2.5 32B (Q4)	~20 GB	29	Best general LLM translation

If you have 8GB of VRAM or less, dedicated translation models are your best option. You can run NLLB 1.3B and get 200-language coverage on a GTX 1060. A general LLM at that VRAM tier (7B at Q4) translates well for common pairs but can’t touch NLLB’s language breadth.

With 24GB of VRAM, you can run Qwen 2.5 32B for the highest-quality general translation, or NLLB 3.3B alongside other models for dedicated translation tasks.

Practical Setups

Simple Translation: Ollama + Qwen 2.5

The fastest way to start translating locally. Works for high-resource language pairs.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Qwen 2.5 7B
ollama pull qwen2.5:7b

# Translate via API
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "Translate to German. Output only the translation:\n\nThe project deadline has been moved to next Friday.",
  "stream": false
}'

Pros: Zero setup beyond Ollama. Same model handles translation and everything else. Cons: Slower than dedicated models. Weaker on rare languages.

Production Quality: NLLB via HuggingFace

For batch translation or language pairs where quality matters.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "facebook/nllb-200-3.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16
).cuda()

def translate(text, src_lang, tgt_lang):
    tokenizer.src_lang = src_lang
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    translated = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_length=512
    )
    return tokenizer.batch_decode(translated, skip_special_tokens=True)[0]

# English to Swahili
print(translate("Hello, how are you?", "eng_Latn", "swh_Latn"))

# French to Japanese
print(translate("Bonjour le monde", "fra_Latn", "jpn_Jpan"))

Pros: 200 languages, optimized for translation, smaller than general LLMs. Cons: Python required. No chat capability.

Document Translation: Chunk and Reassemble

For translating long documents, you need to chunk the text to stay within context limits while keeping paragraph structure intact.

def translate_document(text, src_lang, tgt_lang, max_chunk=400):
    paragraphs = text.split('\n\n')
    translated_paragraphs = []

    for paragraph in paragraphs:
        if len(paragraph.split()) <= max_chunk:
            translated_paragraphs.append(
                translate(paragraph, src_lang, tgt_lang)
            )
        else:
            # Split long paragraphs at sentence boundaries
            sentences = paragraph.split('. ')
            chunk = []
            for sentence in sentences:
                chunk.append(sentence)
                if len(' '.join(chunk).split()) > max_chunk:
                    translated_paragraphs.append(
                        translate('. '.join(chunk[:-1]) + '.', src_lang, tgt_lang)
                    )
                    chunk = [sentence]
            if chunk:
                translated_paragraphs.append(
                    translate('. '.join(chunk), src_lang, tgt_lang)
                )

    return '\n\n'.join(translated_paragraphs)

Key gotcha: Don’t split mid-sentence. Translation models need sentence-level context to produce correct grammar. Splitting at word boundaries produces garbage.

When to Use Cloud Instead

Local translation isn’t always the right call.

Legal or medical documents: Liability matters. Professional translation services provide certified translations with accountability. Running a legal contract through NLLB and calling it translated is asking for trouble.

50+ language pairs with high quality: If you need production-quality translation across dozens of language pairs simultaneously, DeepL or Google Translate’s API is more practical than managing multiple local models.

Real-time speech translation: Seamless M4T handles this locally, but cloud services (Google, Azure) are more polished for real-time bidirectional speech translation in production.

One-off translations: If you translate three sentences a month, installing models and writing Python scripts is overkill. Just use DeepL.

The break-even point: If you translate more than ~100,000 characters per month, local models save money over API pricing. Below that, the setup time isn’t worth it unless privacy is the priority.

The Bottom Line

Local translation is more practical than most people expect. You don’t need a 70B model or a 3090 — a $200 GPU running NLLB 1.3B handles 200 languages at quality that matches cloud services for most pairs.

Start here:

Common European pairs: Ollama + Qwen 2.5 7B (5GB VRAM, zero setup)
Broad language coverage: NLLB 1.3B distilled (3GB VRAM, 200 languages)
Single language pair, max speed: Opus-MT (300MB, fastest option)
Speech + text: Seamless M4T v2 (6GB VRAM)
Best possible quality: Qwen 2.5 32B (20GB VRAM) for common pairs, NLLB 3.3B (8GB VRAM) for everything else

The models that surprise people most: NLLB at 1.3B parameters translating Yoruba better than a 14B general model, and Opus-MT at 300MB translating French faster than anything ten times its size. Bigger isn’t always better — especially for translation.

📚 More on this topic: Qwen Models Guide · VRAM Requirements · Best LLMs for Chat · Best Models for Summarization