DFlash vs MTP on RTX 3090: I Tested Both Locally
๐ More on this topic: DFlash on RTX 3090 (April 30 bench) ยท Best Way to 2x Token Output on RTX 3090 ยท Qwen 3.6 Complete Guide ยท Speculative Decoding Explained
I ran DFlash and MTP on the same RTX 3090 against the same Qwen 3.6-27B target. Both work. The numbers below are firsthand from Miu, my workstation 3090. Where they diverge from each other, and from the published claims, is the article.
DFlash mean 2.56x. MTP mean 1.50x. Same RTX 3090, same Qwen 3.6-27B Q4_K_M โ DFlash leads on raw decode, MTP leads on ergonomics. Below: the numbers, the methodology caveat, and the practical recommendation.
Why this comparison matters today
PR #22673 โ the GitHub pull request adding MTP support to llama.cpp โ is the most active speculative-decoding development right now. A wave of reproductions of am17an’s gist bench has landed on r/LocalLLaMA in the past day โ V100, Strix Halo, RTX 5090, dual R9700, RTX 3090 + RTX 3060. The numbers are converging on ~1.85x for Qwen 3.6-27B with ~76% acceptance, ~2.49 GiB MTP-layer overhead, and ~0.51x prefill cost. Multi-backend support landed May 5: Vulkan and Metal kernels both work alongside CUDA, though the PR is still draft pending structural rework.
DFlash has been the speedup leader on Qwen 3.6-27B since the Luce-Org port landed in late April. My April 30 reproduction put it at 2.56x mean on a single 3090, with the 3.6 draft still maturing. MTP is the mainline-friendly alternative: single GGUF, no separate draft repo, lands in stock llama.cpp once PR #22673 merges.
Nobody has published a firsthand head-to-head on identical hardware. This article is that. Same RTX 3090, same Qwen 3.6-27B Q4_K_M, run through am17an’s gist โ the same harness cturan, the dual-R9700 user, and the Strix Halo and M1 Ultra runs all used. The numbers below compose with that community thread.
The setup
Hardware. Miu โ single RTX 3090 24GB on Linux, CUDA 12, sm_86. No dual-GPU configuration. Same box as the April 30 DFlash run.
DFlash side. Numbers come from the April 30 article. Same bench, same hardware, same target weights. Mean 2.56x; HumanEval 2.81x, GSM8K 2.25x, Math500 2.61x. Q4_K_M target paired with the z-lab/Qwen3.6-27B-DFlash BF16 draft, ~3.5 GiB draft VRAM, custom Luce-Org fork.
MTP side. PR #22673 at am17an’s mtp-clean branch, commit 267f8afe857b7bd1a49e4fde9138ab0f7be36625 (b9030, May 6). Target: RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF, 15.35 GiB. Run with --spec-type mtp --spec-draft-n-max 3 against the OpenAI-compatible llama-server endpoint. No separate draft model โ the MTP layer ships in the same GGUF.
Methodology. One bench: am17an’s gist, the community reference. The original Luce-Org bench_llm.py doesn’t port to mainline llama.cpp without rewriting the bench wrappers, so I couldn’t run the same harness against PR #22673. The next section unpacks why โ and what that means for comparing today’s MTP numbers to the April 30 DFlash numbers.
MTP results โ am17an’s gist bench
Per-prompt, single RTX 3090, Qwen 3.6-27B Q4_K_M, baseline (autoregressive) vs MTP enabled:
| Prompt | Baseline (tok/s) | MTP (tok/s) | Speedup | Acceptance |
|---|---|---|---|---|
| code_python | 42.42 | 68.74 | 1.62x | 0.764 |
| code_cpp | 42.41 | 65.10 | 1.53x | 0.705 |
| explain_concept | 42.39 | 61.98 | 1.46x | 0.646 |
| summarize | 41.83 | 64.85 | 1.55x | 0.726 |
| qa_factual | 42.36 | 66.44 | 1.57x | 0.722 |
| translation | 43.73 | 57.02 | 1.30x | 0.542 |
| creative_short | 42.36 | 57.11 | 1.35x | 0.576 |
| stepwise_math | 42.36 | 67.28 | 1.59x | 0.764 |
| long_code_review | 42.09 | 62.73 | 1.49x | 0.694 |
| MEAN | 42.44 | 63.47 | 1.50x | 0.690 |
Aggregate over 9 prompts (1421 predicted tokens): baseline wall-time 35.82s, MTP wall-time 25.19s, wall-time speedup 1.42x.
The pattern is clean. Coding and stepwise-math hit 1.59-1.62x at 70-76% acceptance. Translation and short creative drop to 1.30-1.35x at 54-58% acceptance. Acceptance rate tracks speedup directly โ when the MTP layer’s predictions match the target, the speedup compounds; when they diverge, the verify pass eats the savings.
Why am17an’s gist, not bench_llm.py
The original DFlash article used bench_llm.py from the Luce-Org repo โ HumanEval, GSM8K, Math500. That harness ships as part of the Luce fork’s build system; it isn’t portable to PR #22673’s mainline llama.cpp branch without rewriting the bench wrappers, which is its own project.
am17an’s gist runs against any llama.cpp-compatible OpenAI server. It targets the PR branch directly, with a mixed prompt set covering coding, math, factual QA, summarization, translation, and creative work. cturan, the dual-R9700 user, the Strix Halo run, and the M1 Ultra run all used the same gist. The numbers compose with the community thread โ but they do not compose directly with the DFlash article’s per-bench numbers.
Different prompt mix produces different baselines. The MTP bench’s autoregressive baseline ran ~42 tok/s on Miu’s RTX 3090. The DFlash article’s autoregressive baseline ran ~33 tok/s on the same hardware. Same model, same quant, same GPU โ am17an’s gist prompts run faster autoregressive because they’re shorter and structurally different from the academic suites. That’s the methodological caveat to keep in mind reading the side-by-side table below.
Side-by-side: DFlash vs MTP on Qwen 3.6-27B
Same hardware, same target, two strategies.
| Metric | DFlash (April 30) | MTP (today) |
|---|---|---|
| Mean speedup | 2.56x | 1.50x |
| Coding subset | HumanEval 2.81x | code_python 1.62x, code_cpp 1.53x |
| Math subset | GSM8K 2.25x, Math500 2.61x | stepwise_math 1.59x |
| Acceptance rate | not measured per-bench | 69% mean (54-76% range) |
| Memory overhead | ~3.5 GiB BF16 draft | ~2.49 GiB MTP layer (per PR) |
| Prefill cost | no measured impact | ~0.51x at long prompts (per PR) |
| Backend | CUDA via Luce fork | CUDA mainline (PR #22673 draft) |
| Setup complexity | Custom fork build | Single GGUF, mainline flags |
Prompt suites differ โ DFlash used HumanEval/GSM8K/Math500 academic benches, MTP used am17an’s mixed-prompt gist. The autoregressive baselines differ accordingly (~33 tok/s for DFlash bench prompts, ~42 tok/s for MTP bench prompts on the same GPU). Treat as a directional comparison, not a one-to-one head-to-head.
DFlash leads on every comparable axis where I have firsthand numbers โ coding, math, mean speedup. MTP catches up on the dimensions that aren’t speedup: memory overhead, mainline compatibility, single-GGUF deployment. Acceptance rate distribution on MTP (54-76%) tracks the per-prompt speedup distribution exactly, which is what you’d expect from a draft-verify mechanism.
What surprised me
Three things from the bench worth flagging.
MTP came in lower than the PR thread suggested. cturan’s RTX 3090+3060 dual-GPU bench posted 1.85x on Qwen 3.6-27B Q6_K. The PR thread targets “75% steady-state acceptance, 2x+ speedup.” On single 3090 with Q4_K_M and am17an’s mixed-prompt gist, my mean was 1.50x with 69% acceptance. Some of the gap is hardware (single vs dual), some is quant (Q4 vs Q6), some is prompt mix.
--spec-draft-n-max 3 got truncated to 2 mid-bench. Server logs showed draft size 3 exceeds max 2, truncating on several prompts during the run. MTP wasn’t running at the requested draft budget on every prompt. Some of the speedup ceiling I left on the table here. The same may be capping cturan’s number too โ worth flagging in the PR thread.
The DFlash gap is meaningful but coding-narrowed. Mean DFlash 2.56x against MTP 1.50x is the headline. Compare coding-only โ DFlash HumanEval 2.81x against MTP code_python 1.62x โ and the gap holds at ~1.7x. Translation and creative_short prompts dragged MTP’s mean down hardest at 54-58% acceptance. Coding-heavy workloads see less delta than the headline suggests.
Which one should you use?
The honest answer depends on three axes: how much speedup you actually need, how willing you are to run a fork, and what hardware you’re on.
Use DFlash today if you want maximum decode speedup, you don’t mind running a custom Luce-Org fork build, and your workload is coding or math-heavy. The 2.56x mean โ 2.81x on HumanEval โ is the headline and it’s reproducible. The 3.5 draft is fully trained; the 3.6 draft is still maturing.
Use MTP today if you’re on mainline llama.cpp, you don’t want a fork, you can wait for the PR to merge before treating it as production-stable, or you need long-context workflows where DFlash’s tree verifier hits 24GB memory pressure. Single-GGUF deployment is a real ergonomic win for production.
Wait if PR #22673 is still draft when you’re reading this. ggerganov asked for handle_mtp_for_ubatch to be lifted out of llama_context before merge. Numbers may shift slightly post-rework. The merged build is worth waiting for if you’re not in a hurry.
The composition question. MTP draft tokens fed into DDTree verification โ in-checkpoint draft proposing into tree verify โ is the obvious next experiment. If both speedups stack even partially, 3x+ territory on the same hardware. Nobody has built it. It’s on my bench list.
Methodology footnotes
Build details. Both runs on Miu โ single RTX 3090 24GB on Linux, CUDA 12, sm_86. DFlash via Luce-Org/lucebox-hub at the April 30 commit. MTP via PR #22673, am17an’s mtp-clean branch, commit 267f8afe857b7bd1a49e4fde9138ab0f7be36625 (b9030, May 6).
MTP weights. RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF โ 15.35 GiB, Q4_K_M target with the MTP layer baked in.
Server flags (MTP on). --spec-type mtp --spec-draft-n-max 3 -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt. Baseline run identical with --spec-type mtp --spec-draft-n-max 3 removed.
Bench script. am17an’s gist โ same harness as cturan’s bench, the dual-R9700 reproduction, and the Strix Halo run.
Truncation note. Server logs showed draft size 3 exceeds max 2, truncating on several prompts during the MTP run. Effective draft budget was 2 not 3 on those prompts. MTP’s measured speedup is therefore a lower bound on what’s achievable with the requested config โ relevant to anyone reproducing.
DFlash baseline. Numbers verbatim from DFlash on RTX 3090: I Built It and Tested It, April 30, 2026.
What’s next
When PR #22673 merges to mainline llama.cpp, I’ll re-run the suite against the merged build to confirm the numbers hold post-rework. ggerganov has asked for handle_mtp_for_ubatch to be lifted out of llama_context first, so structural changes are coming. The DFlash + MTP composition experiment is the bench after that. Newsletter Issue 8 will summarize for the non-technical readers.
If you’ve got numbers from your own hardware, post them in the PR #22673 thread on GitHub. That’s the fastest way to converge on the real shape of what these two strategies do on consumer hardware โ and where the maintainers are watching.
Get notified when we publish new guides.
Subscribe โ free, no spam