Function Calling with Local LLMs: Tools, Agents, and Structured Output
๐ Related: Structured Output from Local LLMs ยท llama.cpp vs Ollama vs vLLM ยท Qwen Models Guide ยท Best Coding Models
Cloud APIs have had function calling for years. You give GPT-4 a list of tools, it decides which one to call, you execute it, feed the result back. It’s how every AI agent works under the hood.
Local models can do this now. Ollama added tool support, llama.cpp has native function calling handlers, and models like Qwen 2.5 match GPT-4’s accuracy at picking the right tool. The gap isn’t whether it works โ it’s knowing which models to use, which failure modes to watch for, and how to structure the agentic loop so it doesn’t spiral.
This guide covers the practical side: working code, model recommendations, and the patterns that hold up when you move past demos.
What Function Calling Actually Is
Function calling is not the model running code. The model never executes anything. It writes a structured request asking you to run a function on its behalf.
The flow works like this:
- You send the model a message plus a list of available tools (name, description, parameters)
- The model decides whether to call a tool or respond with text
- If it wants a tool, it outputs a JSON object:
{"name": "get_weather", "arguments": {"city": "Tokyo"}} - Your code executes the real function and gets a result
- You send the result back to the model as a “tool” message
- The model uses the result to form a natural-language response
The model’s only job is to produce the right JSON. Your code handles everything else. This separation is what makes it work โ and what makes it fail, because the model can produce JSON that’s syntactically valid but semantically wrong.
Function calling is different from structured output. Structured output means “give me valid JSON in a specific schema.” Function calling means “choose a tool from this list and provide the right arguments.” Function calling uses structured output under the hood, but adds the decision layer of which tool and when.
Which Models Support It
Not every local model can do function calling. The model needs to be trained on tool-use data with the right special tokens and chat template. These are the ones worth using:
| Model | Size | VRAM (Q4) | Tool Accuracy | Best For |
|---|---|---|---|---|
| Qwen 2.5 7B | 7B | ~5-6 GB | 0.933 F1 | Default pick. Best accuracy per GB. |
| Qwen 2.5 14B | 14B | ~9-10 GB | 0.971 F1 | Near-GPT-4 accuracy. Multi-step chains. |
| Llama 3.1 8B | 8B | ~5-6 GB | 89% overall | Native tool calling. Good all-rounder. |
| Llama 3.3 70B | 70B | ~42 GB | 94%+ | Best accuracy if you have the VRAM. |
| Mistral 7B v0.3 | 7B | ~5 GB | Good | Fastest inference. 457 tok/s. |
| Mistral Nemo 12B | 12B | ~7-8 GB | Good | 128K context. Solid mid-range. |
| Mistral Small 24B | 24B | ~15 GB | Strong | Best agentic capabilities at this size. |
| Hermes 2 Pro 8B | 8B | ~5-6 GB | Good | XML-based tool format. vLLM/SGLang support. |
| Mixtral 8x7B | 56B (13B active) | ~24 GB | 88% overall | Expert routing. Good multilingual. |
| Functionary v3 | 8B | ~5-6 GB | Strong | Purpose-built for function calling. |
The default recommendation: Start with Qwen 2.5 7B. In Docker’s practical evaluation, Qwen 3 14B hit 0.971 F1 for tool selection โ 0.003 behind GPT-4’s 0.974. The 7B variant at 0.933 is still excellent and runs on an 8GB GPU.
Llama 3.1 8B is the runner-up. Meta baked tool calling into the training, including three built-in tools (web search, Wolfram Alpha, code interpreter) that activate when you put Environment: ipython in the system prompt. It uses special tokens like <|python_tag|> and an ipython role for tool results.
Hermes models use a different format: XML tags (<tool_call>, <tool_response>) instead of the OpenAI-style JSON. This works well with vLLM and SGLang’s automatic tool parsers but can confuse tools that expect the OpenAI format. Qwen 2.5 actually borrowed the Hermes format, so they’re compatible.
How It Works in Ollama
Ollama’s tools API follows the OpenAI format. If you’ve used OpenAI’s function calling, the interface is almost identical.
curl Example
curl http://localhost:11434/api/chat -s -d '{
"model": "qwen2.5:7b",
"messages": [
{"role": "user", "content": "What is the weather in Tokyo?"}
],
"stream": false,
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. Tokyo"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
}'
The model responds with tool_calls instead of content:
{
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"function": {
"name": "get_weather",
"arguments": {
"location": "Tokyo",
"unit": "celsius"
}
}
}
]
}
}
Your code executes get_weather("Tokyo", "celsius"), then sends the result back as a tool role message.
Python with Auto-Schema
The Ollama Python SDK has a nice trick: pass Python functions directly and it builds the tool schema from the function signature and docstring. No manual JSON schema needed.
import ollama
import json
def get_weather(location: str, unit: str = "celsius") -> str:
"""Get current weather for a city.
Args:
location: City name, e.g. Tokyo
unit: Temperature unit, celsius or fahrenheit
"""
# Your real API call goes here
return json.dumps({"temperature": 22, "unit": unit, "condition": "clear"})
response = ollama.chat(
model="qwen2.5:7b",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=[get_weather], # Pass the function directly
)
if response.message.tool_calls:
for call in response.message.tool_calls:
print(f"Model wants to call: {call.function.name}")
print(f"With arguments: {call.function.arguments}")
The SDK inspects your type hints and Google-style docstring to create the schema automatically. This is the fastest way to prototype.
Tool Calling vs JSON Mode
These are different features that people mix up:
| Feature | What It Does | When to Use |
|---|---|---|
tools parameter | Model decides whether to call a tool and outputs structured tool calls | When the model needs to take actions (API calls, DB queries, calculations) |
format: "json" | Forces the model to output valid JSON (no schema) | When you need raw JSON output, not tool decisions |
format: {schema} | Forces output to match a specific JSON schema | When you need structured data extraction. See the structured output guide. |
Tool calling includes the decision: should I call a tool, and if so, which one? JSON mode just forces the output format. They solve different problems.
How It Works in llama.cpp
If you’re running llama.cpp directly instead of Ollama, function calling works through the server’s OpenAI-compatible API.
Starting the Server
llama-server --jinja -fa \
-hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
The --jinja flag enables chat template processing (required for tool calling). The -fa flag enables flash attention.
Tool Calling Request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5",
"messages": [
{"role": "user", "content": "What is 42 * 17?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "calculator",
"description": "Evaluate a math expression",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Math expression, e.g. 42 * 17"
}
},
"required": ["expression"]
}
}
}
]
}'
llama.cpp has native handlers optimized for specific models: Llama 3.x, Qwen 2.5, Hermes, Mistral Nemo, Functionary, and Command R7B. Other models fall back to a generic handler that still works but may be less reliable.
GBNF Grammars for Custom Constraints
llama.cpp’s grammar system constrains output at the token level. The model cannot generate tokens that violate the grammar. This is the same mechanism Ollama uses under the hood when you pass a JSON schema.
For tool calling, you rarely need to write grammars manually โ the server handles it. But if you want a custom output format that isn’t standard JSON, GBNF gives you token-level control:
llama-cli -m model.gguf \
--grammar-file grammars/json.gbnf \
-p 'Output the result as JSON:'
When to Use llama.cpp Over Ollama
Ollama wraps llama.cpp and adds model management, easier setup, and a friendlier API. For most tool-calling use cases, Ollama is simpler.
Use llama.cpp directly when you need:
- Custom GBNF grammars beyond JSON schemas
- More control over sampling parameters during tool calls
- To avoid Ollama’s overhead in high-throughput scenarios
- A specific model that Ollama doesn’t support yet
Building an Agentic Loop
A single tool call is easy. The hard part is the loop: the model calls a tool, gets a result, decides whether to call another tool, and eventually produces a final answer. Here’s a pattern that handles the failure modes.
import ollama
import json
# --- Define your tools ---
def search_products(query: str, max_results: int = 5) -> str:
"""Search the product database.
Args:
query: Search term
max_results: Maximum results to return
"""
# Simulated โ replace with real DB call
return json.dumps({
"results": [
{"name": "Widget Pro", "price": 29.99, "in_stock": True},
{"name": "Widget Basic", "price": 9.99, "in_stock": False},
]
})
def calculate(expression: str) -> str:
"""Evaluate a math expression safely.
Args:
expression: Math expression like '29.99 * 1.08'
"""
allowed = set("0123456789+-*/.() ")
if not all(c in allowed for c in expression):
return json.dumps({"error": "Invalid characters"})
try:
return json.dumps({"result": round(eval(expression), 2)})
except Exception as e:
return json.dumps({"error": str(e)})
# --- Tool registry ---
tools = [search_products, calculate]
tool_map = {
"search_products": search_products,
"calculate": calculate,
}
# --- Agentic loop ---
def run_agent(user_message: str, model: str = "qwen2.5:7b", max_steps: int = 10):
messages = [{"role": "user", "content": user_message}]
for step in range(max_steps):
response = ollama.chat(model=model, messages=messages, tools=tools)
# No tool calls โ model is done
if not response.message.tool_calls:
return response.message.content
messages.append(response.message)
for call in response.message.tool_calls:
func_name = call.function.name
func_args = call.function.arguments
# Guard: hallucinated function name
if func_name not in tool_map:
messages.append({
"role": "tool",
"content": json.dumps({"error": f"Unknown function: {func_name}"}),
"tool_name": func_name,
})
continue
# Guard: wrong argument types
try:
result = tool_map[func_name](**func_args)
except TypeError as e:
result = json.dumps({"error": f"Bad arguments: {e}"})
except Exception as e:
result = json.dumps({"error": f"Failed: {e}"})
messages.append({
"role": "tool",
"content": result,
"tool_name": func_name,
})
return "Agent hit max steps without producing a final answer."
# --- Use it ---
answer = run_agent(
"Find the Widget Pro and tell me the price with 8% sales tax."
)
print(answer)
The key details that make this work in practice:
- Max iterations. Without a cap, a confused model loops forever. 10 is a reasonable default.
- Unknown function guard. Models hallucinate function names, especially smaller ones. Check the name against your registry before executing.
- TypeError catch. Models pass wrong argument types or miss required fields. Return a clear error so the model can retry.
- The loop exits when
tool_callsis empty. That’s how the model signals “I have enough information to answer.”
Local vs Cloud Function Calling
The accuracy gap has mostly closed. The latency gap hasn’t.
Accuracy
Docker ran a practical evaluation pitting local models against cloud APIs on real tool-selection tasks:
| Model | F1 Score (Tool Selection) | Type |
|---|---|---|
| GPT-4 | 0.974 | Cloud |
| Qwen 3 14B | 0.971 | Local |
| Claude 3 Haiku | 0.933 | Cloud |
| Qwen 3 8B | 0.933 | Local |
Qwen 14B is within 0.003 of GPT-4. For single tool calls with clear intent, local models match cloud.
Latency
| Cloud API | Local (RTX 3090) | |
|---|---|---|
| Single tool call | 3-5 sec | 5-15 sec |
| Multi-step (3 tools) | 10-15 sec | 30-60 sec |
| Tokens/sec | 80-90 (GPT-4o) | 50-112 (7B-8B Q4) |
Local is slower, mostly because of lower tokens/sec on consumer hardware. The gap shrinks with faster GPUs and smaller models.
Where Local Wins
- Privacy. Your function arguments never leave your machine. If you’re querying internal databases or processing customer data, that matters.
- Cost. Zero per-call cost after hardware. GPT-4 function calling at scale gets expensive.
- No rate limits. No throttling during peak hours. No surprise API changes.
- Offline. Works on air-gapped networks, planes, bad WiFi.
Where Cloud Still Wins
- Multi-step reasoning. GPT-4 and Claude handle 5+ step tool chains reliably. Local 7B models start losing coherence after 2-3 steps.
- Knowing when NOT to call a tool. Local models, especially smaller ones, tend to call tools eagerly โ even for questions they can answer directly. Docker’s evaluation flagged this as the biggest weakness.
- Parallel tool calls. Some cloud APIs support calling multiple tools in one turn. Fewer local models handle this.
- Recovery from errors. Cloud models self-correct better when a tool call fails. Local models often repeat the same broken call or enter a degenerate loop.
For single-tool use cases (weather lookup, database query, calculator), local is ready. For complex agentic workflows with branching decisions, larger models (14B+) or cloud APIs are still more reliable.
Common Failures and How to Fix Them
These are the issues you’ll hit in practice, not in demos.
Hallucinated Function Names
The model invents a function that doesn’t exist in your tool list. This happens more with smaller models and when you have many tools defined.
Fix: Always validate tool_call.function.name against your tool registry before executing. Return an error message so the model can try again.
Eager Tool Invocation
The model calls a tool when it shouldn’t โ like calling search_web in response to “Hello, how are you?” Small models are worst at this.
Fix: Add explicit instructions in your system prompt: “Only call tools when you need external data you don’t already have.” Validate whether the user’s question actually needs a tool before executing. Ollama doesn’t support tool_choice yet, so you can’t force “auto” behavior at the API level.
The Bad-State Loop
After a failed tool call, the model repeats the user’s input, produces empty responses, or keeps calling the same broken function. This happens across Llama 3, Hermes, and Qwen models.
Fix: Set max iterations on your agentic loop. If the model produces an empty response or repeats itself, break the loop and return a fallback. Don’t let it spin.
Context Pressure with Many Tools
Each tool definition costs 50-150 tokens depending on how detailed the description and parameters are. Ten tools can consume 1,000+ tokens of your context window before the conversation starts.
Fix: Keep tool counts under 5-10 for 7B models. Use concise descriptions. For large tool sets, consider injecting only relevant tools based on the user’s message rather than loading everything every turn.
Wrong Parameters
The model passes the right function name but wrong argument types, missing required fields, or values that don’t match the enum.
Fix: Validate arguments against the schema before executing. Return specific error messages (“Missing required parameter: location”) rather than generic errors. The model uses error messages to correct its next attempt.
KV Cache Quantization
This one is subtle. If you’re running llama.cpp with aggressive KV cache quantization (-ctk q4_0), tool calling accuracy degrades. The precision loss in the attention cache affects the model’s ability to track tool schemas.
Fix: Use Q8 or higher for the KV cache when doing tool calling. The VRAM savings from Q4 KV cache aren’t worth the reliability hit.
Grammar Guarantees Structure, Not Semantics
Ollama’s format parameter and llama.cpp’s grammar enforcement guarantee valid JSON. But the model can still fill valid JSON with wrong data โ {"temperature": 999} is valid JSON with a nonsensical value.
Fix: Validate the returned values in your code, not just the structure. Treat tool call arguments the same way you’d treat user input: sanitize and bounds-check.
The Bottom Line
Function calling with local LLMs works. Not “sort of works with caveats” โ actually works, with accuracy matching cloud APIs for single tool calls.
The setup in three steps:
# 1. Install Ollama and pull a model with tool support
ollama pull qwen2.5:7b
# 2. Use the tools API (curl, Python, or any OpenAI-compatible client)
# 3. Build the agentic loop with guards (see code above)
For simple tool use (one function, clear intent): Qwen 2.5 7B on 8GB VRAM. It just works.
For multi-step agents (chaining tools, branching logic): Qwen 2.5 14B on 12GB, or Mistral Small 24B on 24GB. Smaller models lose coherence after 2-3 steps.
For maximum reliability: Llama 3.3 70B or Qwen 2.5 72B if you have the VRAM. These approach cloud API quality across the board.
The pattern stays the same at every scale: define tools, run the model, validate the output, execute the function, feed results back. The only thing that changes is how many guardrails you need.
Related Guides
- Structured Output from Local LLMs
- llama.cpp vs Ollama vs vLLM
- Qwen Models Guide
- Mistral & Mixtral Guide
- Ollama vs LM Studio
Sources: Ollama Tool Calling Docs, llama.cpp Function Calling, Docker Blog: Local LLM Tool Calling Evaluation, Llama 3.1 Prompt Format, NousResearch Hermes Function Calling, BFCL Leaderboard