Razer AIKit Guide: Multi-GPU Local AI on Your Desktop
📚 More on this topic: GPU Buying Guide · VRAM Requirements · llama.cpp vs Ollama vs vLLM · Fine-Tuning on Consumer Hardware
Running one model on one GPU is a solved problem. Ollama handles that in a single command. But the moment you want to split a 70B model across two GPUs, fine-tune it on your own data, and monitor token throughput in real time — you’re stitching together vLLM, Ray, LlamaFactory, Prometheus, and Grafana by hand. It works, but it takes a full weekend.
Razer — yes, the gaming peripherals company — open-sourced AIKit in late 2025. It’s a Docker-based stack that bundles all of those tools into one container with a CLI that auto-detects your GPUs and configures everything. Apache 2.0 license, supports 280K+ HuggingFace models, and runs on Windows via WSL2 or Ubuntu.
This guide covers what AIKit actually includes, who should use it, how to set it up, what you can do with it, and when simpler tools are the better choice.
What Razer AIKit Is
AIKit is a Docker container that packages a complete AI development environment:
| Component | What It Does | Version (v0.2.0) |
|---|---|---|
| vLLM | High-throughput inference engine with OpenAI-compatible API | 0.14.1 |
| LlamaFactory | Fine-tuning framework (LoRA, QLoRA, full) | 0.9.4 |
| Ray | Distributed computing for multi-GPU/multi-node clusters | 2.50.1 |
| Grafana | Real-time dashboard for GPU utilization, throughput, latency | Pre-configured |
| Prometheus | Metrics collection (5-second scrape interval) | Pre-configured |
| Open WebUI | ChatGPT-like web interface for chatting with models | Pre-configured |
| Jupyter Lab | Interactive notebooks for inference, fine-tuning, and experiments | Pre-configured |
| PyTorch | Deep learning framework | 2.8.0 + CUDA 12.9 |
The rzr-aikit CLI wraps everything with commands like model run, model generate, cluster run, and cluster join. It auto-discovers your GPUs and suggests optimal tensor parallelism settings for whatever model you want to run.
License: Apache 2.0 — fully open source, commercial use allowed.
GitHub: razerofficial/aikit
Who This Is For (and Who Should Skip It)
Use AIKit if you:
- Have 2+ NVIDIA GPUs and want to run models that don’t fit on one card
- Want fine-tuning and inference in the same environment
- Need production-grade serving with monitoring and an OpenAI-compatible API
- Plan to cluster multiple machines for distributed inference
- Want a managed Docker stack instead of wiring up tools manually
Skip AIKit if you:
- Have one GPU and just want to chat with models — Ollama is simpler by an order of magnitude
- Run GGUF-quantized models and want maximum single-GPU speed — llama.cpp is lighter
- Don’t want Docker overhead
- Have an AMD GPU — AIKit requires NVIDIA (compute capability 7.0+)
AIKit is a power-user tool. It adds real value when you need multi-GPU distribution, fine-tuning, and monitoring in one stack. For single-GPU inference, it’s unnecessary complexity.
Requirements
Hardware
- GPU: NVIDIA with compute capability 7.0+ (Turing or newer)
- Minimum: RTX 2060 / GTX 1650 Ti (Turing) with 8GB+ VRAM
- Recommended: RTX 3080+ / RTX 4070+ with 24GB+ total VRAM across GPUs
- Also supports datacenter cards: A100, L40S, H100
- RAM: 8GB minimum, 32GB+ recommended
- Storage: 50GB free (200GB+ recommended — models are large)
- CPU: Any modern x86_64
Supported GPUs
| Architecture | Cards | Compute Capability |
|---|---|---|
| Turing | RTX 2060/2070/2080, Titan RTX, T4 | 7.5 |
| Ampere | RTX 3060/3070/3080/3090, A100, A6000 | 8.0 / 8.6 |
| Ada Lovelace | RTX 4060/4070/4080/4090, L40S | 8.9 |
| Blackwell | RTX 5090, RTX PRO 6000 | 10.x |
Software
- Windows 11 with WSL2, or Ubuntu 22.04/24.04
- Docker Engine
- NVIDIA GPU drivers (latest recommended)
- NVIDIA Container Toolkit
- HuggingFace account + access token (for downloading models)
Setup Walkthrough
Option A: Quick Start (Single Container)
The fastest way to try AIKit:
# Create cache directory for model downloads
mkdir -p $HOME/.cache/huggingface
# Run the container
docker run -it \
--restart=unless-stopped \
--gpus all \
--ipc host \
--network host \
--mount type=bind,source=$HOME/.cache/huggingface,target=/home/Razer/.cache/huggingface \
--env HUGGING_FACE_HUB_TOKEN=<YOUR_TOKEN> \
razerofficial/aikit:latest
Once inside the container:
# Run your first model
rzr-aikit model run deepseek-ai/deepseek-coder-1.3b-instruct
# Generate text
rzr-aikit model generate "Write a Python function to add two numbers."
Option B: Full Stack with Monitoring (Docker Compose)
This gives you Grafana, Prometheus, Open WebUI, and Jupyter alongside the inference engine:
git clone https://github.com/razerofficial/aikit.git && cd aikit
mkdir -p $HOME/.cache/huggingface
export HUGGING_FACE_HUB_TOKEN=<YOUR_TOKEN>
docker compose -f docker_compose/docker-compose.yaml up -d --pull always
After the containers start, you get:
| Service | URL | Notes |
|---|---|---|
| Open WebUI | http://localhost:1919 | Chat interface (connects to vLLM) |
| Grafana | http://localhost:3000 | Monitoring dashboard (admin/admin) |
| Jupyter Lab | http://localhost:8888 | Interactive notebooks |
| vLLM API | http://localhost:8000 | OpenAI-compatible endpoints |
| Ray Dashboard | http://localhost:8265 | Cluster management |
| Prometheus | http://localhost:9090 | Raw metrics |
Windows (WSL2) Setup
Windows requires a few extra steps:
- Install latest NVIDIA GPU drivers on Windows
- Install WSL2:
wsl --install -d Ubuntu-24.04 - Enable mirrored networking — create
%USERPROFILE%\.wslconfig:[wsl2] networkingMode=mirrored - Allow firewall access:
Set-NetFirewallHyperVVMSetting -Name '{40E0AC32-46A5-438A-A0B2-2B479E8F2E90}' -DefaultInboundAction Allow - Inside WSL, install Docker Engine and NVIDIA Container Toolkit
- Follow Option A or B from above
WSL2 gotcha: After starting a model, you may need to run python -m http.server --bind 0.0.0.0 8000 to trigger the WSL2 network bridge. It will error (port occupied) — that’s expected. Then rzr-aikit model generate works. This is a known WSL2 networking quirk.
What You Can Do With It
Distributed Inference Across GPUs
This is AIKit’s primary value. Split large models across multiple GPUs with tensor parallelism:
# Run a 14B model across 2 GPUs
rzr-aikit model run Qwen/Qwen3-14B \
--tensor-parallel-size 2
# Run a 70B model across 4 GPUs with Ray backend
rzr-aikit model run meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--distributed-executor-backend ray
The CLI’s gpu-select command analyzes your hardware and the model to recommend settings:
# Let AIKit figure out the optimal configuration
rzr-aikit gpu-select meta-llama/Llama-3.1-70B-Instruct
For multi-machine clusters, start a Ray head node on one machine and join workers from others:
# Machine 1 (head)
rzr-aikit cluster run --ifname eth0
# Machine 2 (worker)
rzr-aikit cluster join --ifname eth0 --address 192.168.1.100:6379
Fine-Tuning with LlamaFactory
LoRA fine-tuning is built in. No separate install, no dependency conflicts:
llamafactory-cli train \
--stage sft --do_train \
--model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--template qwen \
--finetuning_type lora \
--lora_rank 8 --lora_alpha 16 --lora_dropout 0.05 \
--lora_target q_proj,k_proj,v_proj,o_proj \
--dataset alpaca_gpt4 \
--dataset_dir ~/fine-tuning/data \
--output_dir ~/fine-tuning/adapters/lora_alpaca_gpt4 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-4 \
--fp16
Then serve your fine-tuned model with the LoRA adapter:
rzr-aikit model run Qwen/Qwen2.5-0.5B-Instruct \
--enable-lora \
--lora-modules alpaca=~/fine-tuning/adapters/lora_alpaca_gpt4
LlamaFactory supports LoRA, QLoRA (4/8-bit), full parameter fine-tuning, DoRA, and more. It covers 100+ model families including LLaMA, Qwen, Mistral, DeepSeek, and Phi. For more on fine-tuning methods, see our fine-tuning guide.
Monitoring with Grafana
The pre-configured Grafana dashboard (port 3000) tracks:
- GPU utilization — per-GPU percentage in real time
- GPU memory usage — used vs total VRAM per card
- Output token throughput — tokens generated per second
- Time to First Token (TTFT) — latency before generation starts
- Inter-Token Latency (ITL) — time between each generated token
- Request load — requests per second hitting the API
Prometheus scrapes metrics every 5 seconds from vLLM and Ray. This matters because multi-GPU setups can have unbalanced load — one GPU maxed out while others idle — and you need visibility to diagnose it.
OpenAI-Compatible API
vLLM exposes standard OpenAI endpoints at port 8000. Any tool that speaks the OpenAI API works:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-dummy" # Required but not validated
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Explain tensor parallelism in one paragraph."}]
)
print(response.choices[0].message.content)
Interactive Notebooks
AIKit ships 8 Jupyter notebooks covering single-GPU inference, distributed inference (head + worker), LoRA fine-tuning (local and distributed), OpenAI API integration, and semantic search. Accessible at port 8888.
AIKit vs Ollama vs Plain vLLM
| Razer AIKit | Ollama | vLLM (Direct) | |
|---|---|---|---|
| Setup | Docker Compose, 5-10 min | curl + ollama run, 2 min | Manual Python env, 30+ min |
| Multi-GPU | Tensor + pipeline parallelism | No | Yes (manual config) |
| Multi-Node | Ray clusters | No | Yes (manual config) |
| Fine-Tuning | LlamaFactory built in | No | No (separate tool) |
| Monitoring | Grafana + Prometheus | None | Manual setup |
| Web UI | Open WebUI + Jupyter | None (CLI only) | None |
| API | OpenAI-compatible (vLLM) | OpenAI-compatible | OpenAI-compatible |
| Model Format | HuggingFace native | GGUF (llama.cpp) | HuggingFace native |
| Throughput (batched) | High (PagedAttention) | Low (single request) | High (PagedAttention) |
| VRAM Efficiency | Good (FP16/BF16) | Best (GGUF quantized) | Good (FP16/BF16) |
| Complexity | Medium-High | Low | High |
| Best For | Multi-GPU dev environments | Single-GPU hobbyists | Production serving |
Choose Ollama if you have one GPU and want to chat with models. It’s simpler, uses GGUF quantization for better VRAM efficiency, and works in 2 minutes. See our Ollama vs LM Studio comparison or llama.cpp vs Ollama vs vLLM.
Choose AIKit if you have multiple GPUs, want fine-tuning in the same environment, or need production-grade monitoring and API serving.
Choose plain vLLM if you already know vLLM and want full control without Docker wrapping your stack.
Multi-GPU VRAM Considerations
Tensor parallelism splits model weights across GPUs. Two 24GB GPUs give you roughly 48GB of usable VRAM for model weights, minus overhead for KV cache and activations.
| GPU Setup | Total VRAM | Can Run (FP16) | Can Run (Q4) |
|---|---|---|---|
| 1x RTX 3090 | 24 GB | Up to ~12B | Up to ~24B |
| 2x RTX 3090 | 48 GB | Up to ~24B | Up to ~48B |
| 2x RTX 4090 | 48 GB | Up to ~24B | Up to ~48B |
| 4x RTX 4090 | 96 GB | Up to ~48B | Up to ~96B |
| 1x A100 80GB | 80 GB | Up to ~40B | Up to ~80B |
The PCIe reality check: Consumer GPUs connect through PCIe, not NVLink. This means inter-GPU communication for tensor parallelism runs at ~64 GB/s (PCIe 4.0 x16) instead of 600-900 GB/s (NVLink). You’ll see diminishing returns with more GPUs, especially for smaller models where communication overhead dominates.
Practical guidance:
- 2 GPUs over PCIe works well. The throughput penalty is modest (~10-20%) compared to the benefit of running a larger model.
- 4 GPUs over PCIe works but communication overhead increases. Use pipeline parallelism (
--pipeline-parallel-size) instead of tensor parallelism when possible — it communicates less between GPUs. - Mixed GPU setups (different models or different VRAM) are supported in AIKit v0.2.0 with GPU ordering and load balancing.
For GPU buying recommendations, see our GPU buying guide and VRAM requirements guide.
Known Limitations
AIKit is version 0.2.0. Keep these in mind:
Early project. 74 GitHub stars, no independent reviews or benchmarks published. The performance claims are Razer’s own. The community is essentially nonexistent — no Reddit threads, no Hacker News discussions.
NVIDIA only. No AMD ROCm support. If you have AMD GPUs, this isn’t for you.
Large Docker image. It contains vLLM, Ray, LlamaFactory, PyTorch, CUDA, Jupyter, Conda, and Open WebUI. The v0.2.0 update reduced image size by 24.3%, but it’s still substantial.
WSL2 networking quirks. The port-triggering workaround for Windows is hacky. If you’re on Windows and this matters to you, expect friction.
Low VRAM (<10GB) on Windows. You’ll need to reduce GPU memory utilization to 0.8 with --gpu-memory-utilization 0.8 to avoid OOM errors.
No GGUF support. AIKit uses HuggingFace-format models through vLLM, not GGUF. If you want GGUF quantization (which fits more model into less VRAM), use Ollama or llama.cpp instead.
The Bottom Line
Razer AIKit solves a real problem: setting up multi-GPU inference with fine-tuning, monitoring, and an API is genuinely tedious without it. The Docker stack eliminates hours of dependency wrangling, and the gpu-select auto-configuration is helpful for choosing tensor parallelism settings.
But it’s for a specific audience. If you run one GPU and just want to chat with models, AIKit is a sledgehammer for a thumbtack — Ollama does that in two minutes. If you already have a working vLLM setup, AIKit wraps it in Docker without adding much new capability.
Use AIKit when:
- You have 2+ NVIDIA GPUs and want to run models that don’t fit on one card
- You want fine-tuning and inference in one managed environment
- You need Grafana monitoring for throughput and GPU utilization
- You’re setting up a multi-machine cluster with Ray
Skip AIKit when:
- You have one GPU — use Ollama
- You have AMD GPUs — not supported
- You want maximum VRAM efficiency — GGUF via llama.cpp is better
- You prefer minimal overhead — plain vLLM gives you more control
AIKit is early (v0.2.0) and unproven in the wild. But the stack it assembles is solid — vLLM, Ray, and LlamaFactory are all battle-tested individually. Razer packaged them well. If multi-GPU local AI is your use case, it’s worth trying before building the same thing from scratch.