πŸ“š More on this topic: Voxtral TTS Guide Β· Crane + Qwen3-TTS Voice Cloning Β· Open WebUI Setup Guide Β· Best Models for Chat Β· VRAM Requirements

Talking to your local LLM instead of typing is one of those things that sounds like a gimmick until you try it. Once you can just speak a question and hear the answer back, it changes how you interact with local AI entirely.

The pipeline is simpler than you’d think: Whisper listens to you, your LLM thinks, and a TTS engine reads the response aloud. Three pieces, all running locally, no cloud required.

Here’s how to set it up.


How the Voice Pipeline Works

Every local voice assistant follows the same three-step chain:

  1. STT (Speech-to-Text): Microphone audio β†’ Whisper β†’ text transcript
  2. LLM (Language Model): Text prompt β†’ Ollama/llama.cpp β†’ text response
  3. TTS (Text-to-Speech): Text response β†’ TTS engine β†’ speaker audio

The total latency is the sum of all three. On a decent GPU, you’re looking at:

StageTypical LatencyWhat Drives It
STT (Whisper turbo)100-300 msAudio length, model size, GPU speed
LLM (time to first token)200-500 msModel size, context length, GPU speed
TTS (first audio chunk)100-300 msTTS engine, voice quality, GPU/CPU
Total round-trip500-1100 msEverything above combined

That’s roughly 0.5-1.1 seconds from when you stop talking to when you start hearing the answer. Not quite real-time conversation, but fast enough to feel natural.


Step 1: Speech-to-Text with Whisper

OpenAI’s Whisper is the de facto standard for local speech recognition. It’s open-source, runs on consumer hardware, and is genuinely good β€” accurate across accents, handles background noise, supports 99 languages.

Whisper Model Sizes

ModelParametersVRAMRelative SpeedAccuracy (WER)
tiny39M~1 GB32xRough β€” fine for commands
base74M~1 GB16xDecent for clear speech
small244M~2 GB6xGood general use
medium769M~5 GB2xGreat accuracy
large-v31.55B~10 GB1xBest accuracy
turbo809M~6 GB6-8xNear large-v3 quality

The one to use: Whisper turbo. It’s 6-8x faster than large-v3 with minimal accuracy loss. Unless you’re transcribing heavily accented speech in a noisy room, turbo is the sweet spot.

Which Whisper Implementation?

You have three main options:

faster-whisper (recommended for GPU users): Uses CTranslate2 under the hood. 2-4x faster than OpenAI’s original code, uses 40-60% less VRAM. This is what most voice pipeline tools use internally.

pip install faster-whisper
from faster_whisper import WhisperModel

model = WhisperModel("turbo", device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5)

for segment in segments:
    print(segment.text)

whisper.cpp (recommended for CPU-only): C/C++ port that runs well without a GPU. Great for laptops and low-power setups. Download GGUF model files and run from the command line:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make
./main -m models/ggml-base.en.bin -f audio.wav

Original OpenAI Whisper (the baseline): Works but is the slowest option. Only use this if you need a specific feature the others don’t support.

pip install openai-whisper
whisper audio.wav --model turbo

Real-Time Streaming

For voice chat, you don’t want to record a whole sentence and then transcribe it. You want streaming β€” transcribing as you speak. faster-whisper supports this with voice activity detection (VAD):

from faster_whisper import WhisperModel

model = WhisperModel("turbo", device="cuda")
# Use Silero VAD to detect speech segments automatically
segments, _ = model.transcribe("audio.wav", vad_filter=True)

For a microphone-to-text pipeline, pair faster-whisper with PyAudio or sounddevice to capture audio chunks, and feed them to the model as they come in.


Step 2: Text-to-Speech Options

TTS has changed more than any other part of this pipeline since February. Two months ago, voice cloning meant compromises. Now there are three models that beat ElevenLabs in blind tests, and two of them clone voices from 3 seconds of audio.

TTS Comparison (March 2026)

EngineSizeRuns OnQualityLatencyVoice CloningLicense
Voxtral TTS4B paramsGPU (16 GB) or MLX (2.5 GB)Best in blind tests β€” 62.8% preferred over ElevenLabs Flash v2.570ms TTFA (GPU), ~90ms (MLX)Yes (3s reference)CC BY-NC 4.0
Qwen3-TTS1.7B paramsGPU (~4 GB)Excellent β€” 0.789 speaker similarity vs ElevenLabs’ 0.64697ms end-to-endYes (3s reference)Apache 2.0
Kokoro82M paramsCPU or GPUExcellent β€” #1 on TTS ArenaSub-300 msNo (preset voices)Apache 2.0
Chatterbox~300M paramsGPU (2-4 GB)Excellent β€” beats ElevenLabs in blind testsSub-200 msYes (10s reference clip)Apache 2.0
Piper15-20M paramsCPU onlyGood β€” clear and naturalSub-100 msNo (trained voices)MIT
Coqui XTTS~1.5B paramsGPU (~8 GB)Very good, 17 languages300-500 msYes (6s reference clip)MPL 2.0
edge-ttsCloudInternet requiredVery good (Microsoft voices)100-300 msNoN/A

Voxtral TTS (Best Quality)

Mistral released Voxtral TTS on March 26, 2026. In blind listening tests, 62.8% of human listeners preferred it over ElevenLabs Flash v2.5. For voice cloning specifically, that number hit 69.9%.

It’s actually three models stacked together β€” a 3.4B transformer decoder, a 390M flow-matching acoustic transformer, and a 300M audio codec β€” totaling about 4.1B parameters. Voice cloning works from 3 seconds of reference audio across 9 languages, with cross-lingual transfer (clone a French speaker’s voice, have it speak English with the accent intact).

The catch is hardware: 16GB VRAM on GPU, or about 2.5GB with the MLX 4-bit version on Apple Silicon. On an M-series Mac, the MLX path hits RTF 0.97 on short clips and 0.74 on longer ones β€” comfortably faster than real-time. The license is CC BY-NC 4.0, so personal and research use only.

For voice chat pipelines, Voxtral’s 70ms time-to-first-audio on GPU (or ~90ms on MLX) makes it the fastest high-quality option. vLLM serves it with an OpenAI-compatible /v1/audio/speech endpoint, so you can swap it into any existing pipeline.

Qwen3-TTS (Best for Voice Cloning)

Alibaba’s Qwen3-TTS (released January 2026, 1.7B params) scores 0.789 on speaker similarity benchmarks β€” higher than ElevenLabs’ 0.646. It clones from 3 seconds of reference audio, runs on about 4GB VRAM, and is Apache 2.0 licensed (commercial use allowed).

The community has built a full ecosystem around it: ComfyUI nodes for integration with image generation workflows, multiple WebUI frontends with audiobook generation, and a fine-tuning tool with a Gradio interface and Docker support. If you want to train the model on your own voice data, Qwen3-TTS is the only major open TTS model with active fine-tuning tooling.

One quirk: English preset voices have a subtle accent from training data bias toward dubbed animation content. Using voice cloning with a native English reference sample fixes this. The Crane inference engine provides a pure Rust path with an OpenAI-compatible API if you want zero Python dependencies.

Kokoro (Best Lightweight / No Cloning)

Kokoro is a 82M-parameter model that hit #1 on the TTS Arena leaderboard, beating commercial services. It’s fast, sounds natural, and runs on CPU or GPU.

pip install kokoro>=0.9
from kokoro import KPipeline

pipeline = KPipeline(lang_code="a")  # American English
generator = pipeline("Hello! I'm your local AI assistant.", voice="af_heart")

for i, (gs, ps, audio) in enumerate(generator):
    # audio is a numpy array, play or save it
    pass

Kokoro has multiple preset voices. No voice cloning, but the built-in voices sound genuinely good.

Piper (Best for CPU-Only)

If you don’t have a GPU β€” or your GPU is fully occupied by the LLM β€” Piper is the answer. It’s a neural TTS engine that runs entirely on CPU with near-instant output:

echo "Hello from your local assistant" | \
  piper --model en_US-lessac-medium --output_file response.wav

Piper has dozens of pre-trained voices across many languages. Quality is a step below Kokoro or Chatterbox, but speed on CPU is unbeatable.

Download voices from the Piper voices repository.

Chatterbox (Good Cloning, Low VRAM)

ResembleAI’s Chatterbox clones voices from a 10-second audio clip and beat ElevenLabs in blind listener tests when it launched. It needs more reference audio than Voxtral or Qwen3-TTS (10 seconds vs 3), but runs on just 2-4 GB VRAM:

pip install chatterbox-tts
import torchaudio
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")
text = "Here's your local AI assistant, speaking in a cloned voice."
wav = model.generate(text, audio_prompt_path="reference_voice.wav")
torchaudio.save("output.wav", wav, model.sr)

Needs about 2-4 GB VRAM. The voice cloning is genuinely impressive for a local model.

Skip These

Bark: Can express emotions, laugh, and sing, but it’s painfully slow (2-5 seconds per sentence) and needs ~12 GB VRAM. With Voxtral and Chatterbox both producing better quality at lower latency, Bark is hard to justify for voice chat.

edge-tts: Good quality and free, but it sends audio through Microsoft’s servers. Not local. Defeats the purpose if you’re building a private voice pipeline.

Coqui XTTS v2: Still works for 17-language support (including CJK), but Coqui AI shut down in 2024 and the community fork won’t see a v3. If you need languages that Voxtral and Qwen3-TTS don’t cover, it’s still the only real option.


The Easy Way: Open WebUI Voice Chat

If you just want to talk to your local LLM without building anything, Open WebUI has built-in voice chat. It works with any Ollama model.

Setup

  1. Install Ollama and pull a chat model:
ollama pull qwen3:8b
  1. Install Open WebUI:
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main
  1. Open http://localhost:3000, create an account, select your model.

  2. Click the microphone icon in the chat input to use voice. Open WebUI uses your browser’s built-in speech recognition (Web Speech API) for STT and can use various TTS backends.

Configuring Better TTS

Open WebUI’s default TTS is basic. For better quality, go to Settings β†’ Audio and configure:

  • STT Engine: Set to “whisper (local)” if you want fully local transcription
  • TTS Engine: Point to a local TTS server. Open WebUI supports custom TTS backends through its API integration β€” point it at a vLLM instance running Voxtral or a Qwen3-TTS server, and you get voice chat with voice cloning in a browser.

This gives you a ChatGPT-like voice interface running entirely on your machine.


The DIY Way: Command-Line Voice Pipeline

For maximum control, you can wire up the pipeline yourself. Here’s a minimal working example using faster-whisper + Ollama + Kokoro:

import subprocess
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
from kokoro import KPipeline
import requests
import json

# Initialize models
whisper = WhisperModel("turbo", device="cuda", compute_type="float16")
tts = KPipeline(lang_code="a")

def record_audio(duration=5, sample_rate=16000):
    """Record from microphone."""
    print("Listening...")
    audio = sd.rec(int(duration * sample_rate),
                   samplerate=sample_rate, channels=1, dtype="float32")
    sd.wait()
    return audio.flatten()

def transcribe(audio):
    """Speech to text with faster-whisper."""
    segments, _ = whisper.transcribe(audio, beam_size=5)
    return " ".join(s.text for s in segments).strip()

def query_llm(prompt):
    """Send text to Ollama and get response."""
    response = requests.post("http://localhost:11434/api/generate",
        json={"model": "qwen3:8b", "prompt": prompt, "stream": False})
    return response.json()["response"]

def speak(text):
    """Text to speech with Kokoro."""
    for _, _, audio in tts(text, voice="af_heart"):
        sd.play(audio, samplerate=24000)
        sd.wait()

# Main loop
while True:
    audio = record_audio()
    text = transcribe(audio)
    if not text:
        continue
    print(f"You: {text}")
    response = query_llm(text)
    print(f"AI: {response}")
    speak(response)

This is intentionally simple. For a real setup, you’d add:

  • Voice activity detection (VAD) instead of fixed-duration recording
  • Streaming LLM output to TTS (speak while generating)
  • Conversation history (multi-turn context)
  • A wake word or push-to-talk

Dedicated Voice Chat Apps

If you want something more polished than a script but more customizable than Open WebUI:

Moshi (Full-Duplex)

Moshi from Kyutai is the first open-source full-duplex voice model β€” it can listen and talk simultaneously, like a phone call. No STTβ†’LLMβ†’TTS chain; it’s a single speech-to-speech model.

  • 7B backbone with Mimi audio codec
  • 160-200 ms latency
  • Runs on a single GPU (needs ~16 GB VRAM)
  • Genuinely feels like talking to someone

The catch: it’s a fixed model. You can’t swap in your favorite LLM. The voice quality and intelligence are whatever Moshi provides. But for natural conversation flow, nothing else comes close locally.

pip install moshi
python -m moshi.run

RealtimeVoiceChat

An open-source project that wires up the full STT→LLM→TTS pipeline with a web interface and push-to-talk. Supports Ollama, faster-whisper, and multiple TTS engines.

Pipecat

A framework for building voice assistants. More complex to set up, but supports interruption handling, different conversation flows, and multiple backend options. Good if you’re building something production-grade.


Hardware Requirements

The big question: can your GPU handle the LLM and the voice pipeline simultaneously?

VRAM Budget

ComponentVRAM Needed
Whisper turbo (faster-whisper, FP16)~6 GB
Whisper turbo (faster-whisper, INT8)~3 GB
Whisper small (faster-whisper, INT8)~1 GB
Voxtral TTS (GPU, BF16)~16 GB
Voxtral TTS (MLX 4-bit, Apple Silicon)~2.5 GB unified RAM
Qwen3-TTS 1.7B~4 GB
Kokoro TTS~0.5 GB (or CPU)
Piper TTS0 GB (CPU only)
Chatterbox TTS2-4 GB
Your LLMDepends on model

The math for a 12 GB GPU (RTX 3060):

  • Whisper small INT8: 1 GB
  • Qwen3-TTS: 4 GB (or Piper on CPU for 0 GB)
  • Remaining for LLM: ~7 GB β†’ Qwen3-8B Q4_K_M fits
  • Or skip Qwen3-TTS, use Piper on CPU, and get ~10 GB for the LLM

The math for a 24 GB GPU (RTX 3090):

  • Whisper turbo INT8: 3 GB
  • Qwen3-TTS: 4 GB (with voice cloning)
  • Remaining for LLM: ~16 GB β†’ Qwen3-14B Q6_K or Qwen3-32B Q3_K_M
  • Or run Kokoro/Piper on CPU and get ~19 GB for the LLM

Apple Silicon (16 GB unified memory):

  • faster-whisper turbo INT8: ~1 GB
  • Voxtral MLX 4-bit: ~2.5 GB
  • Remaining for LLM: ~9 GB β†’ Qwen 3.5 9B Q4 fits with room to spare
  • Total pipeline latency: under 800ms

CPU-only setup (no GPU):

  • whisper.cpp with base model: runs fine on any modern CPU
  • Piper TTS: CPU-native, sub-100ms latency
  • LLM: Qwen3-4B Q4_K_M via llama.cpp (~4 GB RAM)
  • Total: workable but slower. Expect 2-4 second round-trips.

The Sharing Problem

Whisper and your LLM can’t use the GPU simultaneously by default β€” one waits for the other. This is fine because the pipeline is sequential anyway. Whisper finishes transcribing before the LLM starts generating.

TTS is where it gets tricky. If your TTS runs on GPU, it competes with the LLM for VRAM. The simplest fix: run TTS on CPU (Piper or Kokoro both handle this well) and keep the GPU for Whisper + LLM.

β†’ Use our Planning Tool to check exact VRAM for your setup.


Latency Optimization Tips

Want to get below 1 second total? Here’s what actually helps:

Use Whisper turbo, not large-v3. The accuracy difference is minimal, the speed difference is 6-8x.

Use INT8 quantization for Whisper. Cuts VRAM roughly in half with negligible accuracy loss:

model = WhisperModel("turbo", device="cuda", compute_type="int8")

Enable VAD (Voice Activity Detection). Faster-whisper’s built-in Silero VAD trims silence before transcription, so Whisper only processes actual speech:

segments, _ = model.transcribe(audio, vad_filter=True,
                                vad_parameters=dict(min_silence_duration_ms=500))

Stream the LLM output to TTS. Don’t wait for the full response. Start speaking the first sentence while the LLM generates the rest. This is the single biggest latency improvement β€” it turns a 3-second wait into perceived sub-second response.

Pick a TTS model that fits your VRAM budget. If you have 16+ GB VRAM or Apple Silicon, Voxtral gives the best quality at 70-90ms to first audio. If you need voice cloning on tighter VRAM, Qwen3-TTS fits in 4 GB. If you’re running the LLM on GPU and can’t spare VRAM for TTS, Kokoro and Piper are fast enough on CPU.

Use a smaller LLM. A Qwen3-4B model generates its first token in ~100 ms on a decent GPU. A 32B model takes 400+ ms. For voice chat, speed matters more than model intelligence β€” pick the smallest model that gives acceptable answers.


Budget (8 GB VRAM or CPU-only)

  • STT: whisper.cpp with base model (CPU) or faster-whisper small INT8 (GPU)
  • LLM: Qwen3-4B Q4_K_M via Ollama
  • TTS: Piper (CPU)
  • Interface: Open WebUI or custom script
  • Expected latency: 1.5-3 seconds

Mid-Range (12 GB VRAM)

  • STT: faster-whisper turbo INT8 (GPU, ~3 GB)
  • LLM: Qwen3-8B Q4_K_M via Ollama (~5 GB)
  • TTS: Qwen3-TTS on GPU (~4 GB) for voice cloning, or Kokoro on CPU if you don’t need cloning
  • Interface: Open WebUI with local Whisper
  • Expected latency: 0.8-1.5 seconds

High-End (24 GB VRAM)

  • STT: faster-whisper turbo INT8 (GPU, ~3 GB)
  • LLM: Qwen3-14B Q6_K via Ollama (~13 GB)
  • TTS: Qwen3-TTS on GPU (~4 GB) with voice cloning, or Kokoro on CPU
  • Interface: Custom pipeline with streaming
  • Expected latency: 0.5-1.0 seconds

Apple Silicon (16 GB+ unified memory)

  • STT: faster-whisper turbo INT8 (~1 GB)
  • LLM: Qwen 3.5 9B Q4 via Ollama (~6 GB)
  • TTS: Voxtral MLX 4-bit (~2.5 GB) β€” 70-90ms to first audio, voice cloning from 3 seconds
  • Interface: Open WebUI with vLLM Voxtral backend
  • Expected latency: under 800ms

This Apple Silicon setup is the best local voice pipeline available right now. Under a second of latency, voice cloning, and the whole stack fits in 16 GB with room to spare.


Common Problems

“Whisper keeps transcribing silence.” Enable VAD filtering. Without it, Whisper tries to transcribe background noise and outputs garbage. Use vad_filter=True in faster-whisper.

“The TTS voice sounds robotic.” Switch from Piper to Kokoro, Voxtral, or Qwen3-TTS. If you’re on Piper, try a higher-quality voice model β€” the “medium” and “high” quality voices sound significantly better than “low.”

“Qwen3-TTS has a weird accent in English.” Known issue β€” the English preset voices have a subtle accent from training data. Use voice cloning with a native English speaker’s reference audio instead of the built-in presets.

“There’s a long pause before the response starts.” You’re probably waiting for the full LLM response before sending it to TTS. Stream the output and start speaking the first sentence immediately.

“My GPU runs out of memory.” Run TTS on CPU (Piper or Kokoro). Use Whisper small or base instead of turbo. Use a smaller LLM quantization (Q3_K_M instead of Q4_K_M).

“CUDA out of memory when switching between Whisper and LLM.” Some frameworks don’t release VRAM properly. Use del model; torch.cuda.empty_cache() between stages, or keep both models loaded if VRAM allows.


The full local voice pipeline in March 2026

The best local voice stack right now chains three pieces, all offline:

StageModelTimeVRAM
Speech-to-textfaster-whisper (turbo, INT8)~200ms~1-3 GB
Language modelQwen 3.5 9B Q4~500ms to first token~6 GB
Text-to-speechVoxtral MLX 4-bit (Mac) or Qwen3-TTS (GPU)~70-97ms to first audio~2.5-4 GB
Total~770-800ms~9.5-13 GB

Under a second. On a MacBook Pro with 16 GB, the Whisper + LLM + Voxtral stack fits with room to spare. On a desktop with an RTX 3090, you can run Whisper + a 14B model + Qwen3-TTS and still have VRAM left over.

Two months ago, when this article was first published, Kokoro and Chatterbox were the top options and voice cloning required trade-offs. Voxtral and Qwen3-TTS changed the math. Voxtral gives the highest quality at 70ms on Apple Silicon. Qwen3-TTS fits voice cloning into 4 GB VRAM with an Apache 2.0 license, and people are already building fine-tuning WebUIs and ComfyUI nodes for it.

Bottom Line

  1. Easiest start: Install Ollama + Open WebUI for instant voice chat with browser-based microphone input
  2. Best quality (Mac): faster-whisper + Ollama + Voxtral MLX 4-bit β€” under 800ms round-trip
  3. Best voice cloning (any GPU): faster-whisper + Ollama + Qwen3-TTS β€” 4 GB VRAM, Apache 2.0
  4. Lowest VRAM: faster-whisper + Ollama + Kokoro or Piper on CPU β€” saves all GPU memory for the LLM
  5. Most natural conversation: Moshi (full-duplex, 16 GB VRAM) β€” no pipeline, single speech-to-speech model

The gap between “local voice assistant” and “cloud voice assistant” closed this month. A personal voice pipeline that never phones home and sounds like a paid service runs on a MacBook with 16 GB or an RTX 3060. That wasn’t true eight weeks ago.