Back to Publications

Fast-dLLM on MLX: Training-Free Acceleration for Diffusion Language Models on Apple Silicon

    Tech Note
  • Artificial Intelligence

Introduction

We present Fast-dLLM-mlx, a project that brings the core Fast-dLLM ideas to Dream-style diffusion language models running on Apple Silicon.

The project adapts the training-free acceleration strategy introduced in Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding to the MLX ecosystem, with a practical focus on local inference, reproducibility, and benchmarking on macOS.

Rather than introducing a new model family, this project ports an inference strategy: it aims to make diffusion LLM decoding faster on Apple hardware by combining KV-cache reuse and confidence-based parallel token finalization in MLX.

Background

Diffusion language models differ from standard autoregressive LLMs in how they generate text. Instead of predicting the next token one step at a time, they iteratively refine a partially masked sequence across multiple denoising steps. This gives them a different decoding profile, but it also makes inference more expensive if implemented naively.

In this note, Dream-style diffusion language models refers to diffusion LMs built around the Dream family of iterative text generation methods, introduced in Dream 7B: Diffusion Large Language Models (https://arxiv.org/abs/2508.15487), where a transformer repeatedly revises masked token positions instead of committing tokens strictly left-to-right. In practice, that means they keep the diffusion-style refinement loop, but often reuse much of the standard LLM transformer stack, making them a useful bridge between familiar autoregressive architectures and non-autoregressive generation.

The Fast-dLLM paper:

Wu, C., Zhang, H., Xue, S., Liu, Z., Diao, S., Zhu, L., Luo, P., Han, S., & Xie, E. (2025). Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding. https://arxiv.org/abs/2505.22618

proposes a training-free way to speed up diffusion LLM inference by reusing attention state and decoding multiple positions in parallel instead of fully recomputing the sequence at every refinement step.

The original reference implementation is available at: https://github.com/NVlabs/Fast-dLLM

Motivation

Most experimental work around diffusion LLMs is still centered on Python stacks tuned primarily for CUDA environments. That leaves a gap for developers and researchers who want to:

  • Run diffusion LLM inference locally on Apple Silicon
  • Benchmark new decoding strategies in MLX
  • Compare diffusion decoding against mlx_lm autoregressive baselines
  • Avoid introducing training or architecture changes just to test inference improvements

Fast-dLLM-mlx addresses this gap as a project focused on inference-time acceleration only. It is designed for experimentation with decoding behavior rather than for changing the underlying model weights.

This project brings these ideas to MLX, Apple's machine learning framework for Apple Silicon, and applies them to Dream-style diffusion language model inference.

Fast-dLLM for Dream Models in MLX

This project implements Dream architecture inference in MLX and adds a first MLX adaptation of the Fast-dLLM approach for that setting.

The current implementation includes:

  • Dream architecture inference in MLX
  • Dual-cache support for prompt-prefill and cache-aware token updates
  • Parallel token generation with probability thresholding
  • Confidence-threshold token finalization to lock in easy tokens early
  • Benchmark scripts for comparing Fast-dLLM-mlx, Dream-MLX, and mlx_lm baselines

In practice, the Fast-dLLM path attempts to finalize multiple masked positions in parallel when their predicted confidence is high enough. This reduces redundant computation relative to repeatedly reprocessing the refinement state from scratch.

Example benchmark invocation:

uv run python -m benchmarks.fast_dllm_mlx_benchmark \
  --model mlx-community/DiffuCoder-7B-cpGRPO-8bit \
  --trust-remote-code \
  --max-new-tokens 128 \
  --steps 20 \
  --block-length 32 \
  --threshold 0.9 \
  --warmup

Design Principles

Fast-dLLM-mlx follows several practical design principles:

  1. Training-free acceleration: improve inference without retraining the model
  2. Apple-native execution: target MLX and Apple Silicon directly
  3. Architectural restraint: adapt the decoding strategy without overcomplicating the model stack
  4. Benchmarkability: make comparisons with baseline MLX paths straightforward
  5. Research utility: keep the project clear enough for further experimentation

Sctructure of project

The implementation consists of three main pieces:

  1. MLX inference of Dream-style architecture.
  2. Dual KV-cache.
  3. Parallel decoding with confidence thresholding.

First, it provides an MLX implementation of the Dream-style transformer stack, including attention, RoPE handling, RMSNorm, and the language modeling head.

Second, it extends the standard cache behavior with a dual KV-cache design. This cache supports both:

  • appending prompt and decoded states in the normal way
  • replacing cached spans in place during iterative refinement

This matters because Fast-dLLM-style decoding reuses previously computed attention state while iteratively updating masked positions, rather than recomputing every token position from scratch.

Third, the generator adds confidence-based parallel decoding. During refinement, the implementation:

  • predicts candidate tokens for masked positions
  • computes per-position confidence
  • finalizes tokens whose confidence exceeds a threshold
  • guarantees progress by forcing at least one update when masked positions remain

This mechanism preserves the spirit of the original Fast-dLLM approach while fitting the execution model of MLX.

The repository currently depends on:

  • MLX / MLX-LM for tensor execution and model loading
  • Transformers for tokenizer support
  • Apple Silicon hardware as the primary target environment

The benchmark examples in the repository focus on:

  • mlx-community/DiffuCoder-7B-cpGRPO-8bit for diffusion-style generation
  • mlx-community/Qwen2.5-Coder-7B-Instruct-8bit as an autoregressive mlx_lm baseline

Performance

The repository includes dedicated benchmark entrypoints for three comparison paths:

  • fast_dllm_mlx_benchmark.py
  • dream_mlx_benchmark.py
  • qwen_mlx_lm_benchmark.py

These scripts evaluate prompts from the local prompts/ set and record per-prompt runtime summaries in CSV and JSON form. The current benchmark was run on a limited number of samples, not on a full dataset.

The benchmark slice currently used in the repository compares three prompt categories, coding, general, and math, across:

  • autoregressive Qwen baselines in mlx_lm
  • Dream MLX diffusion decoding
  • Fast-dLLM MLX decoding on top of the same Dream-style model family

For the MLX diffusion path, the experiments use DiffuCoder, and we include Qwen2.5-Coder as an autoregressive baseline because the Dream architecture used here is based on Qwen2.5, making the comparison more meaningful.

Average throughput by category is shown below:

CategoryQwen MLX 4-bitQwen MLX 8-bitDream MLX 4-bitDream MLX 6-bitDream MLX 8-bitFast-DLLM MLX 4-bitFast-DLLM MLX 6-bitFast-DLLM MLX 8-bit
coding50.66430.33413.56713.52112.97338.22737.49238.015
general50.64830.16413.55413.47713.01636.98339.13735.729
math50.10330.09611.74811.48910.98934.38134.88737.022
average50.47230.19812.95612.82912.32636.53037.17236.922

The main result is straightforward: Fast-dLLM on MLX is much faster than plain Dream MLX decoding across all measured categories and quantization variants.

Using the average row as a summary:

  • Fast-DLLM MLX 4-bit reaches 36.53 versus 12.96 for Dream MLX 4-bit
  • Fast-DLLM MLX 6-bit reaches 37.17 versus 12.83 for Dream MLX 6-bit
  • Fast-DLLM MLX 8-bit reaches 36.92 versus 12.33 for Dream MLX 8-bit

This places the Fast-dLLM MLX variants at roughly 2.8x to 3.0x faster than the corresponding Dream MLX baselines in this benchmark slice.

The autoregressive mlx_lm Qwen baselines still set the upper bound in this comparison, especially in 4-bit form. However, the important outcome is that Fast-dLLM substantially narrows the gap between diffusion decoding and autoregressive generation while preserving the diffusion-style inference setup.

These numbers should still be treated as directional rather than exhaustive. The benchmark was run on a limited number of samples rather than a full dataset, and the measurements are best interpreted as evidence that the Fast-dLLM decoding strategy transfers effectively to MLX and Apple Silicon.

Applications and Impact

Fast-dLLM-mlx is useful as both a project and a practical starting point for on-device experimentation with diffusion LLM inference on Apple hardware.

It enables:

  • Faster local experimentation with diffusion decoding strategies
  • Side-by-side comparison between diffusion and autoregressive MLX inference
  • Exploration of cache-aware and parallel diffusion decoding techniques on Apple Silicon
  • A clearer path toward privacy-preserving, local-first diffusion LLM tooling on macOS

More broadly, this project shows that Fast-dLLM-style inference optimizations are not limited to CUDA-first environments. They can also be adapted to Apple's local inference stack in a way that remains faithful to the original idea while being practical for MLX-based workflows.

Code

@online{yemets-2026-fast-dllm-mlx,
  author = {Kyrylo Yemets},
  title = {Fast-dLLM on MLX: Training-Free Acceleration for Diffusion Language Models on Apple Silicon},
  note = {\emph{Online.} \url{https://research.macpaw.com/publications/fast-dllm-mlx}},
  month = {Apr},
  year = {2026},
}

Related publications