๐Ÿ“š More on this topic: Best Models for Coding ยท Best Models for Writing ยท VRAM Requirements

You want a local model you can just talk to. Ask it questions, bounce ideas off it, get help thinking through problems โ€” without sending every thought to OpenAI’s servers.

Chat is where local models have improved the most. The Qwen 3.5 family (released February-March 2026) moved the bar again โ€” the 9B model now has built-in vision, the 27B scores 95.0 on IFEval (best instruction-following in its class), and the 35B-A3B MoE fits on a 16GB GPU while matching models twice its active parameter count. Meanwhile, Ollama 0.19 just dropped with MLX support that nearly doubles token generation speed on Apple Silicon.

Here’s what to download depending on your GPU.


Best models by VRAM tier

VRAMModelQuantSpeed (est.)Best for
8 GBQwen 3.5 9BQ4_K_M~38-45 tok/sBest all-around, built-in vision
8 GBLlama 3.1 8BQ4_K_M~38 tok/sWidest ecosystem support
8 GBGemma 3 12BQAT int4~35 tok/sFits in 6.6GB, multimodal
12 GBQwen 3.5 27BQ3_K_M~15-20 tok/sDense 27B quality on 12GB
12 GBQwen3-14BQ4_K_M~22-30 tok/sStill strong, good speed
16 GBQwen 3.5 35B-A3BQ4_K_M~30-40 tok/sMoE, 35B quality at 3B speed
16 GBQwen 3.5 27BQ4_K_M~18-22 tok/sDense, all parameters active
24 GBQwen 3.5 27BQ6_K~20-25 tok/sBest dense model, IFEval 95.0
24 GBGemma 3 27BQAT int4~20-30 tok/sHigh Chatbot Arena Elo, vision
24 GBMistral Small 3.1 (24B)Q4_K_M~30-50 tok/sFastest at this tier, multimodal
48 GB+Llama 3.3 70BQ4_K_M~8-12 tok/sBest instruction following at 70B
48 GB+Qwen 3.5 27BFP16~15-20 tok/sFull precision, no quantization loss

The generational leap: Qwen 3.5 models jumped again from Qwen3. The 9B now includes native vision (no separate VL model needed). The 27B is the only fully dense model in the family โ€” all 27B parameters activate every forward pass, giving it the highest per-token reasoning density at its size. The 35B-A3B routes only 3B parameters per token, which means 35B-level quality at inference speeds closer to a 3B model.

โ†’ Check what fits your hardware with our Planning Tool.


What Makes a Good Chat Model?

Chat is different from coding or writing. A good conversational model needs to:

  • Follow instructions consistently โ€” do what you ask without drifting
  • Handle multi-turn conversation โ€” remember what you discussed earlier in the chat
  • Sound natural โ€” not robotic, not overly formal, not stuffed with filler phrases
  • Know when it doesn’t know โ€” hallucinate less, admit uncertainty
  • Be responsive โ€” fast enough for interactive back-and-forth (15+ tok/s minimum)

Benchmarks that correlate with chat quality: Chatbot Arena Elo (blind human preference), IFEval (instruction following), and MT-Bench (multi-turn conversation). Raw MMLU scores don’t tell you much about how a model feels to talk to.


The 8GB tier

This is where most people start. An 8GB GPU runs 7-9B models at Q4 comfortably with room for context.

Qwen 3.5 9B โ€” The new default

Qwen 3.5 9B (released March 2, 2026) replaced Qwen3-8B as the model to beat at this tier. GPQA Diamond score of 81.7 versus 71.5 for models 13x its size. It supports 256K context, 201 languages, and hybrid thinking/non-thinking modes.

The headline feature: native vision built in. Previous generations required a separate VL model for image understanding. Qwen 3.5 9B handles text and images in the same model โ€” upload a screenshot, ask about a chart, describe a photo. No extra download, no second model eating your VRAM.

ollama pull qwen3.5:9b

Llama 3.1 8B โ€” The safe pick

If you want the model with the widest ecosystem support, most documentation, and the most third-party integrations, Llama 3.1 8B is still it. The quality is behind Qwen 3.5 9B on benchmarks, but the tooling maturity makes up for it if you’re building integrations that expect Llama-format outputs.

ollama pull llama3.1:8b

Gemma 3 12B โ€” Multimodal on 8GB

Gemma 3 12B’s QAT int4 version fits in just 6.6GB of VRAM. That means it runs on an 8GB GPU with room for context. Multimodal (text + images), 128K context, and Google’s Gemini knowledge distillation gives it surprisingly broad knowledge for its size.

ollama pull gemma3:12b

What to expect at 8GB

Be realistic: 9B models are good for casual chat, quick Q&A, brainstorming, and simple tasks. They struggle with complex multi-step reasoning, nuanced analysis, and maintaining coherence over very long conversations. If you find yourself thinking “this is useful but not quite smart enough,” the jump to 27B is where things get real.


The 12-16GB tier

This tier changed the most with Qwen 3.5. 12GB GPUs can now squeeze in Qwen 3.5 27B at aggressive quantization. 16GB GPUs run the 35B-A3B MoE model comfortably โ€” and that model punches far above what you’d expect from 3B active parameters.

Qwen 3.5 35B-A3B โ€” MoE on 16GB

This is the surprise of the Qwen 3.5 lineup. 35B total parameters, but only 3B activate per token thanks to Mixture-of-Experts routing. It fits on a 16GB GPU at Q4_K_M and generates tokens fast โ€” closer to 3B speeds than 35B. The quality? It ties or beats Qwen3-235B-A22B (the previous flagship MoE) across most benchmarks.

For chat, the combination of speed and quality makes it the best option at this tier. You get fast, responsive conversation with 35B-level knowledge.

ollama pull qwen3.5:35b-a3b

Qwen 3.5 27B on 12GB โ€” Tight but real

Qwen 3.5 27B at Q3_K_M fits in ~12GB. You lose some quality from the aggressive quantization, and context headroom is tight. But you’re running a 27B dense model on a $200 RTX 3060 12GB โ€” every parameter activates every forward pass, no MoE routing. If quality per token matters more than speed, this is the play.

ollama pull qwen3.5:27b

Qwen3-14B โ€” Still solid

Qwen3-14B is now last-gen but still matches Qwen 2.5-32B on most benchmarks. ArenaHard score of 85.5, 128K context, hybrid thinking modes. At Q4 on 12GB it runs at ~22-30 tok/s, which is snappier than Qwen 3.5 27B at Q3. If speed matters more than peak quality, this is a reasonable pick.

ollama pull qwen3:14b

Phi-4 14B โ€” Reasoning specialist

Phi-4 outperforms Llama 3.3 70B on GPQA and MATH benchmarks. But it has a 16K context limit and weaker instruction following (lower IFEval scores). Use it when you need a thinking partner for analytical problems, not as a general chat model.

ollama pull phi4:14b

The 24GB tier

This is where local chat gets genuinely excellent. A 24GB GPU runs 27-32B models that compete with cloud AI for most conversational tasks.

Qwen 3.5 27B โ€” Best dense model

Qwen 3.5 27B is the only fully dense model in the 3.5 family. All 27B parameters activate every forward pass โ€” no MoE routing, no parameter sharing. The result: IFEval score of 95.0 (highest instruction-following in its class), native vision, 256K context, 201 languages.

At Q6_K on a 24GB card, it fits with room for long conversations. At Q4_K_M it’s more comfortable. Expect ~20-25 tok/s on an RTX 3090.

ollama pull qwen3.5:27b

Gemma 3 27B โ€” Strong alternative

Gemma 3 27B still holds high Chatbot Arena Elo scores (1338-1339) and ranked 2nd on EQ-Bench for creative writing. The QAT int4 version fits in 14.1GB of VRAM, leaving massive headroom for context. Vision included. 128K context.

If you prefer Google’s style of output โ€” broader knowledge distillation from Gemini, slightly more natural prose โ€” it’s a legitimate alternative to Qwen 3.5 27B.

ollama pull gemma3:27b

Mistral Small 3.1 (24B) โ€” The speed pick

Mistral Small 3.1 is the fastest model at this tier โ€” 30-50 tok/s quantized on an RTX 4090. MMLU 81%+, multimodal (vision), 128K context, Apache 2.0. If response speed matters more than peak quality, this is your pick.

ollama pull mistral-small3.1:24b

QwQ-32B โ€” The deep thinker

QwQ-32B leads on coding, math, and logic problems with scores on par with DeepSeek R1. The thinking traces are more concise than DeepSeek’s (less verbose), but it’s still slower for casual chat because it reasons through everything. Best used as a thinking partner for complex problems rather than a general chatbot.

ollama pull qwq:32b

The 48GB+ tier

For dual-GPU setups, Mac Studio with 64GB+ unified memory, or dedicated AI machines.

Llama 3.3 70B โ€” Best instruction following at 70B

IFEval score of 92.1% โ€” still the highest among 70B-class open models. Clean, controllable prose. 128K context. The widest ecosystem support of any 70B model.

Requires ~40GB at Q4_K_M. Runs at ~8-12 tok/s on an RTX 4090 (with partial offloading). On a Mac Studio with 64GB+ unified memory and Ollama 0.19’s MLX support, performance improves significantly.

ollama pull llama3.3:70b

Qwen 2.5 72B โ€” Highest MT-Bench

MT-Bench score of 9.35 โ€” the highest among open models for multi-turn conversation. Stronger on math (MATH 83.1%), multilingual support (29 languages), and structured data handling. Apache 2.0 license.

ollama pull qwen2.5:72b

Chat-Specific Settings That Matter

Temperature

Temperature controls randomness. For chat, the sweet spot depends on what you want:

Use CaseTemperatureWhy
Factual Q&A0.2-0.3Consistent, accurate, minimal variation
Everyday conversation0.5-0.7Natural-sounding, some personality
Creative brainstorming0.8-1.0More varied, surprising ideas

Set it in Ollama:

ollama run qwen3:8b /set parameter temperature 0.7

Temperature 0 doesn’t make the model more accurate โ€” it just makes it more consistent. If the model is wrong at temperature 0, it’ll be confidently wrong every time.

System Prompt

A good system prompt makes a measurable difference in chat quality:

You are a helpful, knowledgeable conversational partner. Keep responses
concise โ€” 2-3 sentences for simple questions, longer only when the topic
needs depth. Use natural language with contractions. Never start with
"Great question!" or filler phrases. If you don't know something, say so
directly.

Set it in a Modelfile:

FROM qwen3:14b
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM """You are a helpful, knowledgeable conversational partner. Keep responses concise โ€” 2-3 sentences for simple questions, longer only when the topic needs depth. Use natural language with contractions. Never start with "Great question!" or filler phrases. If you don't know something, say so directly."""
ollama create my-chat -f Modelfile
ollama run my-chat

Context Length for Long Chats

Chat conversations accumulate context fast. Each message (yours and the model’s) adds to the token count. When you hit the context limit, the oldest messages get dropped silently.

Set num_ctx based on your VRAM and how long you want conversations to last:

num_ctxApprox. MessagesVRAM Impact (14B model)
2048~10-15 back-and-forthMinimal
4096~20-30 back-and-forth~0.5 GB
8192~40-60 back-and-forth~1 GB
16384~80-120 back-and-forth~2 GB
32768Very long sessions~4 GB

If you’re on tight VRAM, enable KV cache quantization (OLLAMA_KV_CACHE_TYPE=q8_0) to roughly double the context you can fit.


Community fine-tunes worth trying

The Qwen 3.5 family has spawned an active fine-tuning community. Two categories stand out for chat.

Claude-distilled variants

Jackrong’s Qwen3.5-9b-claude-4.6-opus-reasoning-distilled is one of the most downloaded community models on HuggingFace right now. The idea: take Qwen 3.5’s architecture and train it on Claude Opus 4.6 reasoning trajectories โ€” the internal chain-of-thought that Claude uses before answering. The v2 version uses 20% fewer tokens while hitting higher accuracy.

The 27B variant has stable tool-calling support. The 9B variant is faster but less reliable for function calling. Both are available as GGUF files on HuggingFace and through Ollama.

One caveat: these models sometimes have identity confusion. Ask “who are you?” and you might get a response about being Claude rather than a local assistant. It’s a training artifact, not a feature. Set a clear system prompt to override it.

Uncensored fine-tunes

DavidAU’s “heretic” and “ultra-uncensored” Qwen 3.5 fine-tunes remove safety restrictions from the base model. These are trained using Unsloth with Claude 4.6 distill datasets and come in 2B, 9B, and 40B (expanded from 27B) variants. The heretic versions actually score slightly higher than base Qwen3.5-9B-Instruct on several benchmarks (ARC: 0.624 vs 0.571, HSWAG: 0.886 vs 0.895).

These exist for people who want unrestricted chat โ€” creative writing, roleplay, or just a model that won’t refuse topics. Available on HuggingFace from DavidAU. If you need an uncensored chat model, these are the current best option at the 9B tier.


Ollama 0.19: MLX on Apple Silicon

Ollama 0.19 shipped today (March 31, 2026) with MLX integration for Apple Silicon Macs. The performance improvement is significant:

MetricOllama 0.18Ollama 0.19 (MLX)Improvement
Prefill speed1,154 tok/s1,810 tok/s1.6x faster
Token generation58 tok/s112 tok/s1.9x faster
With int4 quantโ€”134 tok/sโ€”

If you’re on a Mac with 32GB+ unified memory, update Ollama now. The same ollama run commands work โ€” MLX acceleration happens automatically. M5, M5 Pro, and M5 Max chips get additional acceleration from the new GPU Neural Accelerators.

This makes Apple Silicon genuinely competitive with mid-range NVIDIA GPUs for local chat inference. A Mac Studio M4 Max running Qwen 3.5 35B-A3B through Ollama 0.19 with MLX is a legitimate daily-driver setup.


What’s coming: Qwen 3.6

Qwen 3.6 Plus Preview appeared on OpenRouter on March 30 with a 1,000,000 token context window, mandatory chain-of-thought reasoning, and tool use support. It’s API-only for now โ€” no open weights, no local inference. Currently free on OpenRouter.

No timeline for open weights. If the Qwen 3.5 release pattern holds (API first, open weights weeks later), we might see downloadable Qwen 3.6 models by late April or May. Worth watching, but nothing you can run locally yet.


Ollama quick start commands

Copy-paste these to start chatting immediately:

# 8GB VRAM โ€” Best overall (with vision)
ollama run qwen3.5:9b

# 8GB VRAM โ€” Widest ecosystem support
ollama run llama3.1:8b

# 12-16GB VRAM โ€” MoE, fast and smart
ollama run qwen3.5:35b-a3b

# 16GB VRAM โ€” Dense 27B quality
ollama run qwen3.5:27b

# 24GB VRAM โ€” Best dense model at Q6
ollama run qwen3.5:27b

# 24GB VRAM โ€” Fastest at this tier
ollama run mistral-small3.1:24b

# 48GB+ โ€” Best instruction following
ollama run llama3.3:70b

Don’t have Ollama yet? Our getting started guide walks you through installation in under 15 minutes.


The bottom line

VRAMGet thisWhy
8 GBQwen 3.5 9BBuilt-in vision, 256K context, best benchmarks at this size.
12 GBQwen 3.5 27B at Q3Dense 27B on a budget card. Tight on VRAM, big on quality.
16 GBQwen 3.5 35B-A3BMoE โ€” 35B quality at 3B inference speed. The sweet spot.
24 GBQwen 3.5 27B at Q6Best dense model. IFEval 95.0, native vision, 256K context.
48 GB+Llama 3.3 70BBest instruction following, cleanest output.
CPU onlyQwen 3.5 35B-A3BMoE model โ€” 35B quality with only 3B active. 12-15 tok/s on CPU.
Apple SiliconQwen 3.5 35B-A3B + Ollama 0.19MLX acceleration nearly doubles speed. Update Ollama now.

The quality gap between a local chat model and ChatGPT has never been smaller. Qwen 3.5 27B on a 24GB GPU handles everyday conversation, Q&A, brainstorming, and analysis at a level that would have required a $20/month subscription a year ago. On Apple Silicon, Ollama 0.19’s MLX support makes the experience even smoother. Download one, start chatting, and see for yourself.