๐Ÿ“š More on this topic: Run Your First Local LLM ยท CPU-Only LLMs ยท Quantization Explained

You don’t have a gaming GPU. Maybe you’re on a laptop with integrated graphics, a five-year-old desktop, a Raspberry Pi, or a phone. You’ve heard people running AI locally and you’re wondering: is that even possible on my hardware?

Yes. And not in a “technically it loads” way โ€” in a “this is genuinely useful” way. The small model landscape changed dramatically in 2024-2025. A 3B model today outperforms a 7B model from 2023 on most benchmarks. A 1.5B model fits in under 2GB of RAM and generates faster than you can read.

This guide covers the best models under 3 billion parameters, what hardware you actually need, and what these models can and can’t do.


Who This Is For

If any of these describe your situation, this guide is for you:

HardwareTypical RAM/VRAMWhat You Can Run
Laptop (no dedicated GPU)8-16GB RAM3B models comfortably, multiple at Q4
Old GPU (GTX 1050 Ti, 1060)4-6GB VRAM3B models with room to spare
Raspberry Pi 58GB RAM1B-3B models at usable speeds
Phone (recent Android/iPhone)6-8GB RAM0.5B-1.5B models, some 3B
Chromebook / thin laptop4GB RAM0.5B-1.5B models at Q4
Desktop with no GPU8-32GB RAMAny sub-3B model, fast

You’re not the person with an RTX 4090 looking for the optimal model. You’re the person wondering if local AI is even possible on what you’ve got. It is.


Why Small Models Matter Now

Two years ago, a 3B model was barely useful. It could complete sentences and sometimes follow instructions, but the output was rough. You needed at least 7B parameters for anything practical.

That changed fast. Three things happened:

Better training data. Model quality scales with data quality, not just size. Qwen 2.5 3B was trained on 18 trillion tokens of carefully curated data โ€” more than many early 70B models saw.

Knowledge distillation. Smaller models now learn from larger ones during training. Llama 3.2 3B was distilled from Llama 3.1 70B, inheriting capabilities that would otherwise require far more parameters.

Architecture improvements. Grouped-query attention, better tokenizers, and improved positional encodings all help small models punch above their weight.

The result: Qwen 2.5 3B scores 65.6 on MMLU (a broad knowledge benchmark). For comparison, the original Llama 2 7B scored 45.3. A model less than half the size, beating one twice as large โ€” that’s how far things have moved.


The Best Sub-3B Models, Ranked

1. Qwen 2.5 3B โ€” Best All-Rounder

The strongest model at this size class, period. Qwen 2.5 3B matches or beats the previous-generation Qwen 2 7B on most benchmarks while using less than half the memory.

MetricScore
MMLU65.6
GSM8K (math)79.1
HumanEval (coding)42.1
HellaSwag74.6
RAM at Q4_K_M~2.5 GB
File size (Q4_K_M)~2.0 GB

Strong at multilingual tasks, solid at coding, good instruction following. If you can run a 3B model, this is the default choice.

ollama pull qwen2.5:3b

2. Llama 3.2 3B โ€” Best Instruction Following

Meta’s small model, distilled from the 70B. Where Qwen 2.5 3B leads on raw benchmarks, Llama 3.2 3B excels at doing what you ask it to do. It scores 77.4 on IFEval (instruction following) โ€” the highest in its class.

MetricScore
MMLU63.4
GSM8K (math)77.7
ARC-C (reasoning)78.6
IFEval77.4
RAM at Q4_K_M~2.5 GB
File size (Q4_K_M)~2.0 GB

Particularly good at tool use (BFCL V2: 67.0) and multilingual tasks (MGSM: 58.2). If you’re building something that needs reliable instruction following โ€” a chatbot, an assistant, a workflow tool โ€” Llama 3.2 3B is the pick.

ollama pull llama3.2:3b

3. Phi-3.5 Mini (3.8B) โ€” The Overachiever

Technically 3.8B parameters โ€” slightly over the 3B line โ€” but it earns its spot here. Phi-3.5 Mini punches absurdly above its weight. It beats Mixtral 8x7B (a 46.7B MoE model) on math benchmarks and nearly matches GPT-3.5 on MMLU.

MetricScore
MMLU69.0
GSM8K (math)86.2
HumanEval (coding)62.8
BBH (hard reasoning)69.0
RAM at Q4_K_M~3.0 GB
File size (Q4_K_M)~2.3 GB

Best coding and math performance under 4B parameters by a wide margin. The tradeoff: weaker on factual recall (TriviaQA: 64.0 vs GPT-3.5’s 85.8) and somewhat less natural in free-form conversation. If your tasks lean toward reasoning and code, Phi-3.5 Mini is the best you’ll find anywhere near this size.

ollama pull phi3.5

4. Qwen 2.5 1.5B โ€” Best Under 2B

When 3B is too much for your hardware, Qwen 2.5 1.5B is where quality really starts. It scores 60.9 on MMLU โ€” a number that would have been impressive for a 7B model not long ago.

MetricScore
MMLU60.9
GSM8K (math)68.5
HumanEval (coding)37.2
HellaSwag67.9
RAM at Q4_K_M~1.5 GB
File size (Q4_K_M)~1.1 GB

This is the sweet spot for Raspberry Pi 5, phones, and machines with 4GB RAM. Fits easily, runs at 8-15 tok/s on a Pi 5, and handles Q&A, summarization, and simple tasks competently.

ollama pull qwen2.5:1.5b

5. Gemma 2 2B โ€” Google’s Efficient Pick

Google’s entry uses knowledge distillation from larger Gemma models to pack capability into 2B parameters. Its strength is language understanding โ€” strong HellaSwag (72.9), BoolQ (72.7), and factual recall (TriviaQA: 60.4).

MetricScore
MMLU52.2
HellaSwag72.9
Winogrande71.3
TriviaQA60.4
RAM at Q4_K_M~1.8 GB
File size (Q4_K_M)~1.1 GB

Weak on math (GSM8K: 24.3) and coding (HumanEval: 20.1). Don’t pick Gemma 2 2B for those tasks. But for commonsense reasoning, entity extraction, and classification, it’s solid. It also has excellent KV cache efficiency, making it a good choice for serving multiple users.

ollama pull gemma2:2b

6. Llama 3.2 1B โ€” The Ultralight

Meta’s smallest. At 1.24B parameters, it fits in under 1.5GB of RAM at Q4 and runs at 30-60+ tok/s on a desktop CPU. Not the smartest model on this list, but fast enough to feel instant.

MetricScore
MMLU49.3
GSM8K (math)44.4
ARC-C (reasoning)59.4
IFEval59.5
RAM at Q4_K_M~1.2 GB
File size (Q4_K_M)~800 MB

Best for: quick answers, text classification, simple extraction tasks, and prototyping. At this size, you can run it alongside other applications without worry.

ollama pull llama3.2:1b

7. StableLM 2 1.6B โ€” The Veteran

Released in early 2024 by Stability AI, StableLM 2 was state-of-the-art for sub-2B models at launch. It’s since been surpassed by Qwen 2.5 1.5B and Llama 3.2 1B on most benchmarks, but it still has a niche: multilingual support across 7 languages and strong language understanding (HellaSwag: 70.5).

MetricScore
MMLU41.8 (Zephyr)
HellaSwag70.5
Winogrande64.6
RAM at Q4_K_M~1.3 GB
File size (Q4_K_M)~1.0 GB

Unless you specifically need its multilingual coverage, Qwen 2.5 1.5B is the better choice today.

ollama pull stablelm2:1.6b

8. Qwen 2.5 0.5B โ€” The Absolute Minimum

Half a billion parameters. This model fits in under 1GB of RAM, downloads in seconds, and runs at 20+ tok/s on a Raspberry Pi 5. It’s the smallest model that produces coherent, useful output.

MetricScore
MMLU47.5
GSM8K (math)41.6
HumanEval (coding)30.5
RAM at Q4_K_M~0.8 GB
File size (Q4_K_M)~400 MB

It outperforms Gemma 2 2B on math and coding despite being 5x smaller โ€” a testament to Qwen’s training pipeline. For edge devices, IoT applications, or situations where every megabyte counts, this is the floor.

ollama pull qwen2.5:0.5b

Head-to-Head Comparison

All models at Q4_K_M quantization:

ModelParamsRAMFile SizeMMLUGSM8KBest For
Phi-3.5 Mini3.8B~3.0 GB~2.3 GB69.086.2Coding, math, reasoning
Qwen 2.5 3B3B~2.5 GB~2.0 GB65.679.1All-around, multilingual
Llama 3.2 3B3B~2.5 GB~2.0 GB63.477.7Instruction following, chat
Qwen 2.5 1.5B1.5B~1.5 GB~1.1 GB60.968.5Best quality under 2B
Gemma 2 2B2B~1.8 GB~1.1 GB52.224.3Classification, extraction
Llama 3.2 1B1.24B~1.2 GB~800 MB49.344.4Speed, prototyping
Qwen 2.5 0.5B0.5B~0.8 GB~400 MB47.541.6Edge devices, IoT
StableLM 2 1.6B1.6B~1.3 GB~1.0 GB41.834.8Multilingual (7 languages)

What Small Models Are Good At

Sub-3B models won’t replace GPT-4. But for specific tasks, they’re more than good enough โ€” and they do it locally, privately, and for free.

Tasks where sub-3B models deliver:

  • Quick Q&A โ€” “What’s the capital of France?” “How do I reverse a list in Python?” Fast answers, no API call needed.
  • Summarization โ€” Summarize a paragraph, an email, or a short document. Qwen 2.5 3B and Llama 3.2 3B handle this well.
  • Text classification โ€” Sentiment analysis, topic categorization, spam detection. Fine-tuned small models hit 90%+ accuracy on classification tasks.
  • Simple coding โ€” Generate a function, fix a syntax error, explain a code snippet. Phi-3.5 Mini scores 62.8 on HumanEval โ€” that’s real coding ability.
  • Translation โ€” Simple translations work well, especially with Qwen (strong multilingual training) and Llama 3.2 (trained on 8 languages).
  • Data extraction โ€” Pull names, dates, and structured fields from unstructured text. Gemma 2 2B is particularly good at this.
  • Autocomplete and suggestions โ€” Fast enough for real-time text completion in editors.

What Small Models Can’t Do

Being honest about limits saves frustration.

Don’t expect these:

  • Complex multi-step reasoning โ€” “Plan a two-week trip optimizing for budget and weather across five cities” will produce mediocre output. The model doesn’t have the capacity to hold complex chains of logic.
  • Long-form writing โ€” Blog posts, essays, fiction beyond a few paragraphs. Coherence breaks down as output length increases.
  • Advanced math โ€” Multi-step proofs, calculus, competition-level problems. Even Phi-3.5 Mini’s strong GSM8K score (86.2) drops hard on MATH (41.3) โ€” the harder benchmark.
  • Nuanced analysis โ€” Comparing legal documents, analyzing research papers, weighing subtle tradeoffs. These tasks need more parameters.
  • Large context processing โ€” Most sub-3B models work best with 2048-4096 tokens of context. Feeding them 10-page documents produces unreliable results.
  • Code generation for complex projects โ€” Small models generate individual functions, not multi-file architectures.

The rule of thumb: if a task requires you to think hard about it, a sub-3B model will struggle with it too. For those tasks, step up to 7B-8B models โ€” they only need 4-5GB of RAM at Q4.


Hardware Requirements

Sub-3B models run on almost anything. Here’s exactly how much resources each tier needs:

RAM Requirements (Q4_K_M Quantization)

Model SizeWeightsTotal with ContextMinimum RAM
0.5B~400 MB~0.8 GB2 GB
1B~600 MB~1.2 GB2 GB
1.5B~900 MB~1.5 GB4 GB
2B~1.1 GB~1.8 GB4 GB
3B~1.7 GB~2.5 GB4 GB

“Total with Context” includes the KV cache at 2048-4096 tokens plus runtime overhead. “Minimum RAM” is total system RAM โ€” you need room for the OS and runtime too.

Storage

Downloads are small. A 3B model at Q4 is about 2GB. A 0.5B model is 400MB. You can fit half a dozen sub-3B models in less space than a single 7B model.

ModelQ4_K_M File Size
Qwen 2.5 0.5B~400 MB
Llama 3.2 1B~800 MB
Qwen 2.5 1.5B~1.1 GB
Gemma 2 2B~1.1 GB
Llama 3.2 3B~2.0 GB
Phi-3.5 Mini~2.3 GB

Speed Expectations

Small models are fast. On most hardware, you’ll be reading slower than the model generates.

Desktop and Laptop CPUs

CPU1B Model (Q4)3B Model (Q4)
Intel i5/Ryzen 5 (laptop)~20-40 tok/s~8-15 tok/s
Intel i7/Ryzen 7 (desktop)~30-60 tok/s~12-25 tok/s
Apple M1/M2~35-70 tok/s~15-30 tok/s
Apple M3 Pro+~45-90 tok/s~20-40 tok/s
AMD Ryzen AI 9 (laptop)~50 tok/s~18-28 tok/s

Memory bandwidth is the bottleneck, not CPU speed. Dual-channel DDR5 is noticeably faster than DDR4. Single-channel RAM can cut throughput by 50-70% โ€” if your laptop has one RAM stick, that’s your limit.

For more on CPU inference, see our CPU-only LLM guide.

Raspberry Pi 5

Modeltok/sUsability
Qwen 2.5 0.5B~20Fast โ€” real-time chat
Qwen 2.5 1.5B~8-12Usable โ€” slight pauses
Llama 3.2 3B~4-6Slow but functional
7B models~2-5Painful โ€” not recommended

Stick to 1B-1.5B on a Pi 5 for a good experience. 3B is possible but you’ll feel the wait. Use active cooling โ€” all four cores hit 100% during inference.

Old GPUs

If you have a dedicated GPU, even an old one, it helps:

GPUVRAMWhat FitsAdvantage
GTX 1050 Ti4GB3B at Q4 comfortably2-3x faster than CPU-only
GTX 10606GB3B at Q8, or 7B at Q4Enough for 7B models
RX 5808GB3B at FP16Full precision, no quantization needed

Even a 4GB GPU fully offloads a 3B Q4 model (weights are ~1.7GB), giving a significant speed boost over CPU inference.

Phones

Phone TierModelSpeed
Flagship (Snapdragon 8 Gen 3, A17 Pro)1B-3B at Q48-17 tok/s
Mid-range (Snapdragon 7 Gen 1)0.5B-1.5B at Q45-10 tok/s
Budget (6GB RAM or less)0.5B at Q4Barely loads

Apps like SmolChat (Android) and MLC Chat (iOS/Android) make this straightforward. Be warned: sustained inference drains battery fast โ€” comparable to a graphics-intensive game.


How to Run Them

Ollama (Easiest)

Ollama is one command to install, one command to run:

# Install Ollama (Linux/Mac)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull qwen2.5:3b
ollama run qwen2.5:3b

That’s it. Ollama auto-detects your hardware and optimizes accordingly. No GPU required.

LM Studio (GUI)

Prefer a visual interface? LM Studio gives you a ChatGPT-like UI for local models. Download, search for a model, click run. It handles GGUF quantization selection for you.

Raspberry Pi

On a Pi 5, Ollama works out of the box:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:1.5b
ollama run qwen2.5:1.5b

For better performance on a Pi, consider building llama.cpp with OpenBLAS โ€” it’s 10-20% faster than Ollama for sustained inference.

Phones

  • Android: SmolChat, MLC Chat, or any app that supports GGUF models
  • iOS: MLC Chat, or LLM Farm
  • Cross-platform apps: llama.rn (React Native bindings for llama.cpp)

When to Stay Small vs. Upgrade to 7B

This is the real question. Here’s the decision framework:

Stay with sub-3B if:

  • Your hardware maxes out at 4GB RAM/VRAM
  • You’re running on a Raspberry Pi, phone, or edge device
  • Your tasks are quick Q&A, classification, extraction, or simple code
  • Speed matters more than depth (you need real-time responses)
  • You want to run alongside other applications without memory pressure
  • Privacy/offline is the priority and quality is secondary

Step up to 7B-8B if:

  • You have 8GB+ RAM or any GPU with 6GB+ VRAM
  • You need multi-step reasoning, longer outputs, or complex coding
  • Quality per response matters more than speed
  • You’re hitting the limits of 3B output quality

The jump from 3B to 7B is the single biggest quality improvement in local AI. A Llama 3.1 8B at Q4 uses about 5GB of RAM and is dramatically more capable. If your hardware can handle it, it’s worth the step โ€” see our 8GB VRAM guide for details.

But if your hardware can’t handle 7B, don’t feel locked out. A Qwen 2.5 3B today is more useful than a 7B model from two years ago. The floor has risen.


Recommendations by Use Case

Use CaseBest PickRunner-UpWhy
General chat/Q&AQwen 2.5 3BLlama 3.2 3BStrongest overall quality
Coding assistancePhi-3.5 Mini (3.8B)Qwen 2.5 3B62.8 HumanEval โ€” real coding ability
Math/reasoningPhi-3.5 Mini (3.8B)Qwen 2.5 3B86.2 GSM8K, untouchable at this size
Classification/extractionGemma 2 2BQwen 2.5 1.5BStrong language understanding, efficient
Raspberry Pi 5Qwen 2.5 1.5BLlama 3.2 1BBest quality at comfortable Pi speeds
PhoneLlama 3.2 1BQwen 2.5 0.5BFast, low battery drain
Edge/IoTQwen 2.5 0.5BLlama 3.2 1BUnder 1GB RAM, 400MB download
MultilingualLlama 3.2 3BQwen 2.5 3B58.2 MGSM, 8 languages
Absolute minimum hardwareQwen 2.5 0.5Bโ€”Runs on 2GB RAM, 400MB storage

The Bottom Line

Small models are no longer a consolation prize. They’re a legitimate way to run AI locally on hardware you already own โ€” no GPU required, no cloud dependency, no subscription.

The practical advice:

  1. Have 4GB+ RAM? Start with ollama pull qwen2.5:3b. You’ll have a working local AI assistant running at 10-25+ tok/s on CPU alone.
  2. Only 2GB RAM or a Pi? Pull qwen2.5:1.5b. It’s 1.1GB to download, needs under 2GB of RAM, and handles most simple tasks.
  3. Building for edge/IoT? qwen2.5:0.5b is 400MB and runs on almost anything.
  4. Need coding or math? phi3.5 (3.8B) is the strongest small model for technical tasks.
  5. When you outgrow 3B โ€” and you’ll know when the answers aren’t good enough โ€” a 7B model on 8GB of VRAM is the next step.

Your laptop is more capable than you think. Try it.



Sources: Qwen 2.5 Technical Report, Meta Llama 3.2 Model Card, Phi-3 Technical Report, Gemma 2 Technical Report, StableLM 2 Technical Report, Raspberry Pi 5 LLM Benchmarks, CPU vs GPU LLM Performance