The Intel Arc B580 is the cheapest way to get 12GB of VRAM right now. At ~$250 street price, it undercuts the RTX 3060 12GB by $50-100 on the used market and gives you enough memory to run every 7-9B model without compromise.

The problem isn’t the hardware. The hardware is fine. The problem is that NVIDIA has had a decade to build CUDA into the default path for everything, and Intel is still catching up. Running LLMs on an Arc card means picking your way through software stacks that change every few months, dealing with setup steps that CUDA users never think about, and occasionally hitting bugs that make you question your life choices.

I still think it’s worth considering, if you go in with the right expectations.


What the B580 Brings to LLM Inference

SpecValueWhy It Matters
VRAM12GB GDDR6Fits 7-9B models at Q4-Q8, some 14B at Q4
Memory bandwidth456 GB/sHigher than RTX 3060 (360 GB/s)
XMX engines160Intel’s matrix math units, analogous to CUDA Tensor Cores
TDP150WReasonable, single 8-pin power
Street price~$249Cheapest 12GB card available new

The bandwidth number is the one that matters most for LLM inference. Token generation speed is bottlenecked by how fast the GPU can read model weights from VRAM. The B580’s 456 GB/s is 27% more than the RTX 3060’s 360 GB/s. In theory, that should translate to faster generation. In practice, software overhead eats some of that advantage.


The Three Software Paths

There are three ways to run LLMs on an Arc B580. They are not equally good.

The Vulkan backend in llama.cpp is the simplest path. Vulkan is cross-platform, doesn’t require Intel’s oneAPI toolkit, and works with standard GPU drivers.

Download a llama.cpp release with Vulkan support, point it at a GGUF model, and go:

# Download latest llama.cpp release with Vulkan
./llama-server -m qwen3.5-9b-q4_k_m.gguf -ngl 99 --gpu-backend vulkan

Community reports and Phoronix benchmarks consistently show Vulkan outperforming SYCL on the B580. This is ironic given that SYCL is Intel’s own stack, but it’s the reality as of early 2026.

Path 2: llama.cpp with SYCL (Intel’s Stack)

The SYCL backend uses Intel’s oneAPI toolkit and gives access to the XMX engines for matrix operations. Setup is heavier:

# Install oneAPI Base Toolkit (2025.1+)
# Then build llama.cpp with SYCL
source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
cmake --build build --config Release -j

# Run with environment variables
ZES_ENABLE_SYSMAN=1 ONEAPI_DEVICE_SELECTOR=level_zero:0 \
  ./build/bin/llama-server -m model.gguf -ngl 99

SYCL should be faster in theory because it can use the XMX engines directly. In practice, community testing shows Vulkan consistently beating SYCL on the B580 for llama.cpp inference. Intel is actively working on this, and the gap may close with future oneAPI releases. For now, start with Vulkan.

Path 3: Ollama via IPEX-LLM (Most Setup)

Ollama doesn’t natively support Intel Arc GPUs. You can run it through Intel’s IPEX-LLM bridge, which wraps Ollama with Intel-optimized backends.

The setup involves:

  1. Install oneAPI toolkit
  2. Create a conda environment
  3. Install IPEX-LLM Ollama package
  4. Set environment variables (OLLAMA_NUM_GPU=999, ZES_ENABLE_SYSMAN=1)
  5. Run init-ollama to configure
  6. Start Ollama with IPEX-LLM backend

It works. Users have reported running Qwen3-8B at ~33 tok/s through this path with 16K context. But there are known issues: model loading failures on some driver versions, bus errors on newer Linux kernels, and the setup is fragile enough that a driver update can break things.

If you need the Ollama interface specifically, this path exists. If you just need to run models, use llama.cpp Vulkan and skip the complexity.


Real Benchmarks

These numbers come from Phoronix/OpenBenchmarking testing llama.cpp with the Vulkan backend on Linux. Text generation at 128 tokens output:

ModelB580 tok/sNotes
Llama 3.1 8B (Tulu-3)12.9Standard 8B performance
Mistral 7B14.4Slightly faster due to architecture
Qwen3 8B13.3Consistent with other 8B models
DeepSeek-R1-Distill 8B13.7Reasoning model, same speed tier
Granite 3B58.2Small models fly
GPT-OSS 20B26.6MoE architecture, not dense 20B

Prompt processing (prefill) is much faster: 590-640 tok/s for 8B models, over 2,300 tok/s for 3B. The B580’s compute is solid for prefill. It’s the generation phase where bandwidth becomes the bottleneck.

For the IPEX-LLM/OpenVINO path, Intel’s own benchmarks claim significantly higher numbers (150+ tok/s for 7B models). Those numbers appear to come from batched or optimized pipeline scenarios that don’t reflect typical single-user interactive chat. The Vulkan numbers above are what you’ll see in normal use.


B580 vs RTX 3060 12GB vs RX 7700 XT

Arc B580RTX 3060 12GBRX 7700 XT
VRAM12GB GDDR612GB GDDR612GB GDDR6
Bandwidth456 GB/s360 GB/s432 GB/s
Price (new/used)~$249 new~$170-220 used~$350 new
7-9B tok/s13-1512-20 (CUDA)15-22 (ROCm)
Software stackVulkan/SYCLCUDAROCm
Ollama supportVia IPEX-LLM bridgeNativeNative (recent)
Setup difficultyMedium-HighEasyMedium
Driver maturityYoungMatureImproving

The RTX 3060 12GB wins on software maturity. CUDA just works. Every guide and troubleshooting post assumes NVIDIA. If you’re buying used and want the smoothest experience, the 3060 is the safer pick at $170-220.

The B580 wins on bandwidth per dollar. 456 GB/s at $249 new is better value than either competitor on paper. In practice, software overhead partially negates the bandwidth advantage, but the gap should narrow as Intel’s stack matures.

The RX 7700 XT has the best raw performance but costs $100 more. ROCm support has improved a lot since 2024, and Ollama added native AMD support. If you’re willing to spend $350, the 7700 XT is the strongest 12GB option for LLM inference today.

My take: If you already own a B580 for gaming and want to try local LLMs, go for it. It works. If you’re buying specifically for LLM inference, a used RTX 3060 12GB at $180 gets you there with less friction. The B580 makes sense if you want a new card that handles both gaming and LLMs at $250 and you don’t mind spending an afternoon on setup.


What Runs on 12GB

With 12GB VRAM, here’s what fits (see our VRAM requirements guide for the full table):

ModelQuantVRAMVerdict
Qwen 3.5 9BQ4_K_M6.6GBRuns easily, room for 32K+ context
Qwen 3.5 9BQ8_011GBFits tight, ~8K context
Llama 3.1 8BQ4_K_M5.5GBComfortable fit
Mistral 7BQ4_K_M5GBPlenty of headroom
Qwen 2.5 Coder 14BQ4_K_M9.5GBFits, limited context
Phi-4 14BQ4_K_M9GBFits, limited context
Qwen 3.5 27BQ4_K_M17GBDoes not fit
Llama 3.1 70BAny40GB+Does not fit

The sweet spot on 12GB is a 7-9B model at Q4_K_M. You get the model plus enough headroom for a useful context window. At Q8 you can squeeze in a 9B model but context will be tight. Anything above 14B at Q4 won’t fit.

For more on what to run, check our llama.cpp vs Ollama vs vLLM comparison.


Known Issues and Workarounds

Driver versions matter a lot. Intel’s GPU drivers update frequently and updates can break things. Pin a working driver version and don’t update unless you have a reason to.

On newer Linux kernels, some users report “bus error (core dumped)” with Intel Arc. The IPEX-LLM team tracks these in their GitHub issues. If you hit this, try an older kernel or check for a fixed driver version. The SYCL/Level Zero backend can also fail to load model tensors with I/O errors, which is a known issue. Switching to Vulkan usually fixes both problems.

One thing that trips people up: Vulkan consistently beats SYCL on the B580. Counterintuitive, since SYCL is Intel’s own stack, but if you’re getting poor performance with SYCL, try Vulkan before spending hours debugging.

Ollama doesn’t natively support Intel Arc the way it supports NVIDIA (CUDA) and AMD (ROCm). The IPEX-LLM bridge works but adds complexity. If your workflow depends on Ollama specifically, this is a real friction point. If you’re fine with llama.cpp directly, it’s a non-issue.

Flash Attention is also limited. The SYCL implementation may not work on all Arc models, and Vulkan doesn’t support it at all. This affects prefill speed on long prompts but doesn’t change generation speed.


Setup Checklist (Vulkan Path)

The fastest way to get running:

  1. Install the latest Intel Arc GPU driver from Intel’s download center
  2. Download a llama.cpp release with Vulkan support
  3. Download a GGUF model (start with Qwen 3.5 9B Q4_K_M from HuggingFace)
  4. Run: ./llama-server -m model.gguf -ngl 99 -c 8192
  5. Open http://localhost:8080 in your browser

That’s it. No oneAPI, no conda, no environment variables. If this works and the performance is acceptable, you’re done. Only go down the SYCL or IPEX-LLM path if you need something Vulkan can’t provide.

If the model isn’t using the GPU, check that Vulkan is detecting your Arc card: vulkaninfo --summary should show the B580. If it doesn’t, your driver isn’t installed correctly. See our Ollama not using GPU guide for general GPU detection troubleshooting, though the NVIDIA-specific fixes won’t apply.


Bottom Line

The Arc B580 is a real budget option for local LLMs. 12GB VRAM at $249 with bandwidth that matches or beats the RTX 3060. The value per dollar is there.

The tax you pay is in software maturity. You’ll spend more time on setup and more time in GitHub issues when something breaks. Intel’s stack is improving fast, but it’s not at CUDA’s level and probably won’t be for another year or two.

Buy it if you want a new 12GB card for gaming that also does LLM inference, and you don’t mind reading GitHub issues when something breaks. Skip it if you want the smoothest possible path to running local models, in which case a used RTX 3060 12GB at $180 is the better call.