DFlash vs MTP on RTX 3090: I Tested Both Locally

📚 More on this topic: DFlash on RTX 3090 (April 30 bench) · Best Way to 2x Token Output on RTX 3090 · Qwen 3.6 Complete Guide · Speculative Decoding Explained

I ran DFlash and MTP on the same RTX 3090 against the same Qwen 3.6-27B target. Both work. The numbers below are firsthand from Miu, my workstation 3090. Where they diverge from each other, and from the published claims, is the article.

DFlash mean 2.56x. MTP mean 1.50x. Same RTX 3090, same Qwen 3.6-27B Q4_K_M — DFlash leads on raw decode, MTP leads on ergonomics. Below: the numbers, the methodology caveat, and the practical recommendation.

Update (July 2026): This head-to-head was benched against am17an’s mtp-clean PR branch on May 6 (commit b9030). PR #22673 has since merged to mainline llama.cpp — May 16. MTP now ships in stock builds, no branch checkout needed, enabled with --spec-type draft-mtp --spec-draft-n-max N (the branch flag --spec-type mtp was renamed to draft-mtp on merge). The bench numbers below are the branch measurement and they stand — the merge changed the plumbing, not what a 3090 clocks. One note on the earliest merged builds: for the few days after May 16, --spec-draft-p-min shipped with a 0.75 default and slowed some configs (issue #23230); ggerganov’s PR #23269 restored the 0.0 default on May 19, so a current build needs nothing — pass --spec-draft-p-min 0 only if you’re pinned to a build from that window. (Separate CPU, Metal, and SYCL slowdowns with their own causes exist too, so verify on your own backend.)

Why this comparison matters today

PR #22673 — the GitHub pull request adding MTP support to llama.cpp — is the most active speculative-decoding development right now. A wave of reproductions of am17an’s gist bench has landed on r/LocalLLaMA in the past day — V100, Strix Halo, RTX 5090, dual R9700, RTX 3090 + RTX 3060. The numbers are converging on ~1.85x for Qwen 3.6-27B with ~76% acceptance, ~2.49 GiB MTP-layer overhead, and ~0.51x prefill cost. Multi-backend support landed May 5: Vulkan and Metal kernels both work alongside CUDA. After the structural rework the maintainers asked for, PR #22673 merged to mainline on May 16.

DFlash has been the speedup leader on Qwen 3.6-27B since the Luce-Org port landed in late April. My April 30 reproduction put it at 2.56x mean on a single 3090, with the 3.6 draft still maturing. MTP is the mainline-friendly alternative: single GGUF, no separate draft repo, and as of May 16 it ships in stock llama.cpp.

Nobody has published a firsthand head-to-head on identical hardware. This article is that. Same RTX 3090, same Qwen 3.6-27B Q4_K_M, run through am17an’s gist — the same harness cturan, the dual-R9700 user, and the Strix Halo and M1 Ultra runs all used. The numbers below compose with that community thread.

The setup

Hardware. Miu — single RTX 3090 24GB on Linux, CUDA 12, sm_86. No dual-GPU configuration. Same box as the April 30 DFlash run.

DFlash side. Numbers come from the April 30 article. Same bench, same hardware, same target weights. Mean 2.56x; HumanEval 2.81x, GSM8K 2.25x, Math500 2.61x. Q4_K_M target paired with the z-lab/Qwen3.6-27B-DFlash BF16 draft, ~3.5 GiB draft VRAM, custom Luce-Org fork.

MTP side. PR #22673 at am17an’s mtp-clean branch, commit 267f8afe857b7bd1a49e4fde9138ab0f7be36625 (b9030, May 6). Target: RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF, 15.35 GiB. Run with --spec-type mtp --spec-draft-n-max 3 against the OpenAI-compatible llama-server endpoint (on the merged mainline this flag is now --spec-type draft-mtp). No separate draft model — the MTP layer ships in the same GGUF.

Methodology. One bench: am17an’s gist, the community reference. The original Luce-Org bench_llm.py doesn’t port to mainline llama.cpp without rewriting the bench wrappers, so I couldn’t run the same harness against PR #22673. The next section unpacks why — and what that means for comparing today’s MTP numbers to the April 30 DFlash numbers.

MTP results — am17an’s gist bench

Per-prompt, single RTX 3090, Qwen 3.6-27B Q4_K_M, baseline (autoregressive) vs MTP enabled:

Prompt	Baseline (tok/s)	MTP (tok/s)	Speedup	Acceptance
code_python	42.42	68.74	1.62x	0.764
code_cpp	42.41	65.10	1.53x	0.705
explain_concept	42.39	61.98	1.46x	0.646
summarize	41.83	64.85	1.55x	0.726
qa_factual	42.36	66.44	1.57x	0.722
translation	43.73	57.02	1.30x	0.542
creative_short	42.36	57.11	1.35x	0.576
stepwise_math	42.36	67.28	1.59x	0.764
long_code_review	42.09	62.73	1.49x	0.694
MEAN	42.44	63.47	1.50x	0.690

Aggregate over 9 prompts (1421 predicted tokens): baseline wall-time 35.82s, MTP wall-time 25.19s, wall-time speedup 1.42x.

The pattern is clean. Coding and stepwise-math hit 1.59-1.62x at 70-76% acceptance. Translation and short creative drop to 1.30-1.35x at 54-58% acceptance. Acceptance rate tracks speedup directly — when the MTP layer’s predictions match the target, the speedup compounds; when they diverge, the verify pass eats the savings.

Why am17an’s gist, not bench_llm.py

The original DFlash article used bench_llm.py from the Luce-Org repo — HumanEval, GSM8K, Math500. That harness ships as part of the Luce fork’s build system; it isn’t portable to PR #22673’s mainline llama.cpp branch without rewriting the bench wrappers, which is its own project.

am17an’s gist runs against any llama.cpp-compatible OpenAI server. It targets the PR branch directly, with a mixed prompt set covering coding, math, factual QA, summarization, translation, and creative work. cturan, the dual-R9700 user, the Strix Halo run, and the M1 Ultra run all used the same gist. The numbers compose with the community thread — but they do not compose directly with the DFlash article’s per-bench numbers.

Different prompt mix produces different baselines. The MTP bench’s autoregressive baseline ran ~42 tok/s on Miu’s RTX 3090. The DFlash article’s autoregressive baseline ran ~33 tok/s on the same hardware. Same model, same quant, same GPU — am17an’s gist prompts run faster autoregressive because they’re shorter and structurally different from the academic suites. That’s the methodological caveat to keep in mind reading the side-by-side table below.

Side-by-side: DFlash vs MTP on Qwen 3.6-27B

Same hardware, same target, two strategies.

Metric	DFlash (April 30)	MTP (today)
Mean speedup	2.56x	1.50x
Coding subset	HumanEval 2.81x	code_python 1.62x, code_cpp 1.53x
Math subset	GSM8K 2.25x, Math500 2.61x	stepwise_math 1.59x
Acceptance rate	not measured per-bench	69% mean (54-76% range)
Memory overhead	~3.5 GiB BF16 draft	~2.49 GiB MTP layer (per PR)
Prefill cost	no measured impact	~0.51x at long prompts (per PR)
Backend	CUDA via Luce fork	CUDA mainline (PR #22673, merged May 16)
Setup complexity	Custom fork build	Single GGUF, mainline flags

Prompt suites differ — DFlash used HumanEval/GSM8K/Math500 academic benches, MTP used am17an’s mixed-prompt gist. The autoregressive baselines differ accordingly (~33 tok/s for DFlash bench prompts, ~42 tok/s for MTP bench prompts on the same GPU). Treat as a directional comparison, not a one-to-one head-to-head.

DFlash leads on every comparable axis where I have firsthand numbers — coding, math, mean speedup. MTP catches up on the dimensions that aren’t speedup: memory overhead, mainline compatibility, single-GGUF deployment. Acceptance rate distribution on MTP (54-76%) tracks the per-prompt speedup distribution exactly, which is what you’d expect from a draft-verify mechanism.

What surprised me

Three things from the bench worth flagging.

MTP came in lower than the PR thread suggested. cturan’s RTX 3090+3060 dual-GPU bench posted 1.85x on Qwen 3.6-27B Q6_K. The PR thread targets “75% steady-state acceptance, 2x+ speedup.” On single 3090 with Q4_K_M and am17an’s mixed-prompt gist, my mean was 1.50x with 69% acceptance. Some of the gap is hardware (single vs dual), some is quant (Q4 vs Q6), some is prompt mix.

--spec-draft-n-max 3 got truncated to 2 mid-bench. Server logs showed draft size 3 exceeds max 2, truncating on several prompts during the run. MTP wasn’t running at the requested draft budget on every prompt. Some of the speedup ceiling I left on the table here. The same may be capping cturan’s number too — worth flagging in the PR thread.

The DFlash gap is meaningful but coding-narrowed. Mean DFlash 2.56x against MTP 1.50x is the headline. Compare coding-only — DFlash HumanEval 2.81x against MTP code_python 1.62x — and the gap holds at ~1.7x. Translation and creative_short prompts dragged MTP’s mean down hardest at 54-58% acceptance. Coding-heavy workloads see less delta than the headline suggests.

Which one should you use?

The honest answer depends on three axes: how much speedup you actually need, how willing you are to run a fork, and what hardware you’re on.

Use DFlash today if you want maximum decode speedup, you don’t mind running a custom Luce-Org fork build, and your workload is coding or math-heavy. The 2.56x mean — 2.81x on HumanEval — is the headline and it’s reproducible. The 3.5 draft is fully trained; the 3.6 draft is still maturing.

Use MTP today if you’re on mainline llama.cpp, you don’t want a fork, or you need long-context workflows where DFlash’s tree verifier hits 24GB memory pressure. Since the May 16 merge that’s the low-friction path — a stock build plus an MTP-capable GGUF, no branch checkout. Single-GGUF deployment is a real ergonomic win for production.

One post-merge caveat, since fixed. The very first merged builds (May 16–19) came in slower than the branch on some configs (issue #23230): --spec-draft-p-min had shipped with a 0.75 default where the branch effectively used 0.0. ggerganov’s PR #23269 restored the 0.0 default on May 19, so a current build needs no flag — pass --spec-draft-p-min 0 only if you’re pinned to a build in that window. That’s not the whole post-merge story, though: separate slowdowns on the CPU (#23698), Apple Metal (#23011), and SYCL (#23203) backends have their own causes. The MTP path moves commit to commit, so verify on your own build rather than trusting a number from any single date.

The composition question. MTP draft tokens fed into DDTree verification — in-checkpoint draft proposing into tree verify — is the obvious next experiment. If both speedups stack even partially, 3x+ territory on the same hardware. Nobody has built it. It’s on my bench list.

Methodology footnotes

Build details. Both runs on Miu — single RTX 3090 24GB on Linux, CUDA 12, sm_86. DFlash via Luce-Org/lucebox-hub at the April 30 commit. MTP via PR #22673, am17an’s mtp-clean branch, commit 267f8afe857b7bd1a49e4fde9138ab0f7be36625 (b9030, May 6).

MTP weights. RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF — 15.35 GiB, Q4_K_M target with the MTP layer baked in.

Server flags (MTP on). --spec-type mtp --spec-draft-n-max 3 -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt. Baseline run identical with --spec-type mtp --spec-draft-n-max 3 removed. On the merged mainline the flag is --spec-type draft-mtp (renamed from mtp); the brief --spec-draft-p-min regression in the first post-merge builds was fixed in #23269 on May 19 (default back to 0.0).

Bench script. am17an’s gist — same harness as cturan’s bench, the dual-R9700 reproduction, and the Strix Halo run.

Truncation note. Server logs showed draft size 3 exceeds max 2, truncating on several prompts during the MTP run. Effective draft budget was 2 not 3 on those prompts. MTP’s measured speedup is therefore a lower bound on what’s achievable with the requested config — relevant to anyone reproducing.

DFlash baseline. Numbers verbatim from DFlash on RTX 3090: I Built It and Tested It, April 30, 2026.

What’s next

PR #22673 merged to mainline on May 16, after the handle_mtp_for_ubatch rework ggerganov asked for. The numbers here are the branch measurement (May 6, commit b9030) — a merged-build re-run is on the list. The first post-merge builds had a --spec-draft-p-min regression (#23230, fixed May 19 in #23269), and separate backend-specific slowdowns are still shaking out, so the merged numbers are worth re-measuring directly. The DFlash + MTP composition experiment is the bench after that. Newsletter Issue 8 summarized this for the non-technical readers.

If you’ve got numbers from your own hardware, post them in the PR #22673 thread on GitHub. That’s the fastest way to converge on the real shape of what these two strategies do on consumer hardware — and where the maintainers are watching.