Building a Local AI Assistant: Your Private Jarvis

📚 Guides referenced: Run Your First LLM · Open WebUI Setup · Voice Chat with Local LLMs · Local RAG · Function Calling

Cloud assistants know what you ask, when you ask it, and what files you feed them. A local assistant doesn’t. Everything runs on your hardware, your data stays on your machine, and there’s no monthly bill.

This guide walks you through building one, piece by piece. Each level adds a capability. Stop wherever you’re satisfied — Level 1 alone gives you a working assistant in 15 minutes.

What you’re building

A local AI assistant chains several components:

Wake word detection — listens for “hey Jarvis” (or whatever you pick)
Speech-to-text (STT) — converts your voice to text (Whisper)
LLM — generates the response (Ollama)
Text-to-speech (TTS) — reads the response aloud (Kokoro or Piper)
RAG — searches your documents for context before answering
Tools — web search, calculations, file operations, API calls
Home control — turns lights on, checks sensors, runs automations

You don’t need all seven. Most people want 1-4. The guide is structured so each level builds on the previous one.

Platform comparison

Several tools bundle parts of this stack. Here’s what each handles:

	Open WebUI	AnythingLLM	Jan	LM Studio	DIY Python
Chat interface	Yes (web)	Yes (desktop/web)	Yes (desktop)	Yes (desktop)	You build it
Voice input/output	Yes (built-in)	No	No	No	Yes (manual)
Document RAG	Yes (built-in)	Yes (best RAG)	No	No	Yes (manual)
Function calling	Yes (BYOF editor)	Yes (agents)	Yes (MCP)	Yes (agent API)	Yes (manual)
Home automation	No	No	No	No	Yes (manual)
Multi-model switching	Yes	Yes	Yes	Yes	Yes
Setup difficulty	1 Docker command	Desktop installer	Desktop installer	Desktop installer	Write code

The recommendation: Open WebUI for levels 1-4. It handles chat, voice, RAG, and basic tools out of the box. Switch to AnythingLLM if you need better document management (workspaces, permissions, more chunking options). Go DIY Python only for the voice pipeline or home automation — those need custom code regardless.

Hardware requirements

Your assistant runs multiple models at once: the LLM, the Whisper model for speech recognition, the TTS model, and (if using RAG) an embedding model.

VRAM budget breakdown

Component	VRAM	Notes
7B LLM (Q4)	~4-5 GB	Qwen 2.5 7B, Llama 3.1 8B
14B LLM (Q4)	~8-10 GB	Qwen 2.5 14B — needs 12GB+
Whisper large-v3	~1.5 GB	Loaded only during speech input
Whisper turbo	~0.8 GB	Faster, slightly less accurate
Kokoro TTS	~0.3 GB	82M parameters, runs on CPU too
Piper TTS	~0.1 GB	Lighter, lower quality
Embedding model	~0.3-0.5 GB	nomic-embed-text or similar

What each GPU tier can run

GPU	LLM	Voice	RAG	Simultaneous?
8 GB (RTX 3060 Ti, 4060)	7B Q4	Whisper turbo + Kokoro	Yes	Tight but works. Close other apps.
12 GB (RTX 3060, 3080 12GB)	7B Q4 comfortably	Whisper large-v3 + Kokoro	Yes	Room to spare. Sweet spot.
16 GB (RTX 4070 Ti Super)	14B Q4	Full Whisper + any TTS	Yes	Everything fits with room left over.
24 GB (RTX 3090, 4090)	14B Q6 or 32B Q4	Everything	Yes	Run it all without thinking about VRAM.
CPU only (32GB+ RAM)	7B Q4 (~3-5 tok/s)	Whisper on CPU (~4-8x realtime)	Yes	Works, just slow. Usable for text chat.

Ollama handles model loading and unloading automatically. If Whisper and the LLM don’t fit in VRAM simultaneously, Ollama swaps them — adds a second or two of latency but doesn’t crash.

Level 1: Text chat (15 minutes)

If you already have Ollama installed, this takes 2 minutes.

Install Ollama and pull a model

# Install Ollama (if not already)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull qwen2.5:7b

Qwen 2.5 7B is a good all-rounder for assistant tasks — strong at following instructions, good at reasoning, and small enough for 8GB VRAM. Llama 3.1 8B is the other solid choice.

Start Open WebUI

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000, create an account (local only, no cloud), select your model, and start chatting. You now have a private ChatGPT-like interface.

For detailed setup (custom configs, GPU passthrough, connecting to remote Ollama), see the Open WebUI setup guide.

Level 2: Voice input and output (30 minutes)

Two paths: use Open WebUI’s built-in voice, or build a standalone pipeline.

Option A: Open WebUI voice (easier)

Open WebUI has voice chat built in. In Settings > Audio:

STT engine — set to “Web API” (uses your browser’s speech recognition) or configure a local Whisper endpoint
TTS engine — set to “Web API” (uses browser TTS) or point to a local TTS server

For fully local voice, configure Open WebUI to use a local Whisper server:

# Run faster-whisper as a server
pip install faster-whisper
# Or use the whisper-server Docker image
docker run -d -p 8765:8765 \
  fedirz/faster-whisper-server:latest-cuda

Then set the STT URL in Open WebUI’s audio settings.

The browser-based option works immediately but sends audio to your browser’s speech API (Google/Apple). For true privacy, run Whisper locally.

Option B: Standalone voice setup

For a dedicated voice assistant (not browser-based), you need three pieces running independently. Our voice chat guide covers this in detail. The short version:

pip install faster-whisper  # Speech-to-text
pip install kokoro          # Text-to-speech (~82M params)
# Ollama is already running from Level 1

Latency breakdown (RTX 3060 12GB, 7B model):

Stage	Time
Whisper turbo (STT)	~0.3-0.5 sec
LLM first token	~0.2-0.4 sec
LLM full response	~1-3 sec (depends on length)
Kokoro TTS (first audio)	~0.1-0.3 sec
Total to first spoken word	~0.8-1.5 sec

That’s comparable to Alexa’s response time, though the LLM’s answer takes longer to finish speaking than a canned Alexa response.

Level 3: Ask questions about your files (30 minutes)

RAG (Retrieval Augmented Generation) lets your assistant search your documents before answering. Feed it PDFs, notes, code, emails — it finds the relevant sections and includes them in the LLM’s context.

Our RAG guide covers three setup methods in depth. Here’s the fastest.

Open WebUI RAG (built-in)

Pull an embedding model:

ollama pull nomic-embed-text

In Open WebUI, go to Workspace > Knowledge, create a collection, and upload your files. Open WebUI handles chunking, embedding, and retrieval. When you chat, toggle the knowledge base on for that conversation.

Supported formats: PDF, TXT, DOCX, CSV, Markdown, and code files.

AnythingLLM (better for large document sets)

If you have hundreds or thousands of documents, AnythingLLM handles it better. It gives you workspaces (separate document collections), more chunking options, and a no-code agent builder.

docker pull mintplexlabs/anythingllm
docker run -d -p 3001:3001 \
  -v anythingllm:/app/server/storage \
  mintplexlabs/anythingllm

Point it at your Ollama instance, upload documents, and chat. AnythingLLM uses the same embedding models as Open WebUI but gives you more control over how documents are split and searched.

What to expect

RAG works well for:

Answering questions about specific documents (“What does section 4.2 of this contract say?”)
Searching across large collections (“Which meeting notes mentioned the Q3 budget?”)
Code Q&A (“How does the authentication module work in this repo?”)

RAG struggles with:

Summarizing entire books (context window too small for full content)
Questions that need reasoning across many documents simultaneously
Highly structured data (use a database, not RAG)

Level 4: Give it tools (1 hour)

A chat assistant can only answer from its training data and your documents. Tools let it take actions — search the web, run calculations, check APIs, read live data.

Our function calling guide covers the protocol in detail. Two ways to add tools:

Open WebUI tools (no code)

Open WebUI has a built-in Python function editor (BYOF — Bring Your Own Function). Go to Workspace > Tools, click “Create Tool,” and write a Python function. Open WebUI handles the function-calling protocol with the LLM automatically.

Example — a web search tool:

import requests

class Tools:
    def search_web(self, query: str) -> str:
        """Search the web for current information.

        :param query: The search query
        :return: Search results as text
        """
        # Use SearXNG or another local search engine
        resp = requests.get(
            "http://localhost:8888/search",
            params={"q": query, "format": "json"}
        )
        results = resp.json()["results"][:3]
        return "\n".join(
            f"- {r['title']}: {r['content']}" for r in results
        )

The LLM sees the function signature and docstring, decides when to call it, and Open WebUI executes the code and feeds results back.

Ollama function calling (API-level)

If you’re building a custom assistant in Python, Ollama supports function calling directly:

import ollama

response = ollama.chat(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "What's the weather in Denver?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }]
)

# Model returns tool_calls — you execute them and feed results back

Qwen 2.5 7B is the best function-calling model at this size — 0.933 F1 for tool selection, close to GPT-4 level. See the function calling guide for the full agentic loop pattern with error handling.

Practical tools worth adding

Tool	What it does	Complexity
Web search	Answers questions about current events	Medium (needs SearXNG or similar)
Calculator	Math without hallucination	Easy
File reader	Read local files on demand	Easy
Calendar/reminders	Check schedule, set alerts	Medium
Shell commands	Run system commands (careful with this one)	Easy but risky
Home Assistant API	Control smart home devices	Medium (see Level 5)

Level 5: Control your home (2+ hours)

This is where it gets Jarvis-like. Connect your assistant to Home Assistant and it can control lights, check temperatures, lock doors, and trigger automations.

Home Assistant + Ollama

Home Assistant has native Ollama integration since HA 2025.6. Setup:

Settings > Devices & Services > Add Integration > Ollama
Enter your Ollama server URL (e.g., http://192.168.1.100:11434)
Select a model
Under “Assist,” set it as your conversation agent

Now you can type or speak commands in Home Assistant and the local LLM handles them.

The home-llm approach (specialized model)

General-purpose LLMs aren’t great at home control out of the box. They don’t know your device names, and they hallucinate entity IDs. The home-llm project solves this with a 3B model fine-tuned specifically for smart home commands:

ollama pull fixt/home-3b-v3

This model understands Home Assistant’s entity format and generates valid service calls. It’s much more reliable for “turn off the kitchen lights” than a general 7B model, though it can’t hold a general conversation.

The practical setup: use your general-purpose model (Qwen 2.5 7B) for conversation and RAG, and route home-control commands to home-3b-v3. Home Assistant’s conversation pipeline supports this kind of routing.

Wyoming protocol for voice satellites

Want to talk to your assistant from every room? The Wyoming protocol lets you place voice satellites (a Raspberry Pi with a microphone and speaker) around your house, all connected to your central Home Assistant + Ollama server.

Each satellite runs:

openWakeWord — listens for your wake word
Whisper (via Wyoming) — converts speech to text
Piper TTS (via Wyoming) — speaks the response

The LLM processing happens on your main GPU machine. The satellites just handle audio I/O.

Hardware per satellite: Raspberry Pi 4/5 (~~$35-60), a USB microphone (~~$10), and a speaker. Total per room: under $80.

Honest assessment

Home automation is the most fragile part of this stack. Some realities:

Simple commands work reliably: “Turn off the bedroom lights,” “What’s the temperature downstairs,” “Lock the front door.”
Complex commands are hit-or-miss: “Turn on the lights in every room except the nursery” may or may not work depending on how your entities are named.
The LLM doesn’t know your entity names unless you expose them explicitly. Expose too many and you eat context window. Expose too few and it can’t help.
Latency for voice commands through a Wyoming satellite is 3-5 seconds end-to-end. Not instant like a cloud assistant.
The home-llm 3B model is more reliable for device control but can’t handle follow-up conversation.

It works. It’s private. It’s also clearly a generation behind commercial voice assistants for smart home control. If you go in knowing that, you’ll be fine.

The full DIY voice pipeline

If you want a standalone voice assistant (not browser-based, not Home Assistant), the script below chains everything together.

Requirements

pip install ollama faster-whisper kokoro sounddevice numpy openwakeword

Minimal voice loop

import ollama
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
from kokoro import KPipeline

# ── Init models ──────────────────────────────────────
whisper = WhisperModel("turbo", device="cuda", compute_type="float16")
tts = KPipeline(lang_code="a")  # American English

SYSTEM_PROMPT = "You are a helpful assistant. Keep responses concise — under 3 sentences when possible."
conversation = [{"role": "system", "content": SYSTEM_PROMPT}]

def record_audio(duration=5, sample_rate=16000):
    """Record audio from microphone."""
    print("Listening...")
    audio = sd.rec(
        int(duration * sample_rate),
        samplerate=sample_rate,
        channels=1,
        dtype="float32"
    )
    sd.wait()
    return audio.flatten()

def transcribe(audio):
    """Convert speech to text with faster-whisper."""
    segments, _ = whisper.transcribe(audio, language="en")
    return " ".join(s.text for s in segments).strip()

def speak(text):
    """Convert text to speech with Kokoro."""
    for _, _, audio in tts(text):
        sd.play(audio, samplerate=24000)
        sd.wait()

def chat(user_message):
    """Send message to Ollama, get response."""
    conversation.append({"role": "user", "content": user_message})
    response = ollama.chat(
        model="qwen2.5:7b",
        messages=conversation
    )
    reply = response["message"]["content"]
    conversation.append({"role": "assistant", "content": reply})
    return reply

# ── Main loop ────────────────────────────────────────
print("Voice assistant ready. Press Ctrl+C to stop.")
while True:
    try:
        audio = record_audio(duration=5)
        text = transcribe(audio)
        if not text or len(text) < 2:
            continue
        print(f"You: {text}")
        reply = chat(text)
        print(f"Assistant: {reply}")
        speak(reply)
    except KeyboardInterrupt:
        print("\nStopped.")
        break

This is a minimal version — no wake word, no silence detection, fixed recording duration. For wake word support, add openwakeword and record in a continuous loop, triggering transcription only after detecting the wake phrase. The ollama-STT-TTS project on GitHub has a more complete implementation with silence detection via webrtcvad.

Latency targets

Setup	First spoken word	Full response
8 GB GPU, 7B model	~1.5-2.5 sec	~3-6 sec
12 GB GPU, 7B model	~0.8-1.5 sec	~2-4 sec
24 GB GPU, 14B model	~0.8-1.2 sec	~2-4 sec
CPU only, 7B model	~5-10 sec	~15-30 sec

The bottleneck is the LLM, not the speech processing. Whisper turbo transcribes in under 0.5 seconds on any modern GPU. Kokoro generates audio faster than real-time on CPU alone.

What local assistants still can’t do

Time for honesty. A local assistant in 2026 is genuinely useful, but it’s not Alexa or Siri in several ways.

Where local wins:

Privacy. Nothing leaves your machine. No recordings stored on corporate servers.
No subscription. No monthly fee, no API costs, no rate limits.
Customization. You choose the model, the system prompt, the tools, the voice.
Offline. Works without internet (after initial model download).
Document search. Feed it your files — something Alexa will never do.

Where local still falls short:

Response time. 1-2 seconds vs Alexa’s ~0.5 seconds for simple queries. CPU-only is much slower.
Always-on listening. Cloud assistants run wake-word detection on dedicated low-power chips. A local setup needs a Raspberry Pi satellite or your PC running constantly.
Ecosystem. No Spotify integration, no Amazon shopping, no thousands of “skills.” You build each integration yourself.
Multi-room audio. Wyoming satellites work but setup is manual and finicky compared to dropping an Echo in each room.
Accuracy for home control. Cloud assistants have been trained on millions of smart-home interactions. Local models are still catching up.

The gap is closing. A year ago, local voice latency was 5-10 seconds. Now it’s under 2. Models like Qwen 2.5 handle tool calling almost as well as GPT-4. But if your primary use case is “set a timer” and “play music,” a $30 Echo is still better at that specific job.

The local assistant wins when your use case involves private documents, custom tools, or anything you don’t want a corporation listening to.

The bottom line

Start with Level 1. Ollama + Open WebUI takes 15 minutes and gives you a private ChatGPT. Most people are surprised how capable this is on its own.

Add voice when you want hands-free interaction. Add RAG when you have documents to search. Add tools when you need live data. Add home automation if you’re already running Home Assistant.

You don’t need to build everything at once. Each level is useful on its own, and each one links to a detailed guide if you want to go deeper.

The full stack — voice, documents, tools, home control — runs on a single 12GB GPU. That’s a $200 used RTX 3060. Your private Jarvis costs less than a year of ChatGPT Plus.

Sources: Open WebUI Docs, Home Assistant Ollama Integration, home-llm, Wyoming Protocol, Kokoro TTS, faster-whisper, ollama-STT-TTS, AnythingLLM