Fast-dLLM on MLX for Apple Silicon

Introduction

We present Fast-dLLM-mlx, a project that brings the core Fast-dLLM ideas to Dream-style diffusion language models running on Apple Silicon.

The project adapts the training-free acceleration strategy introduced in Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding to the MLX ecosystem, with a practical focus on local inference, reproducibility, and benchmarking on macOS.

Rather than introducing a new model family, this project ports an inference strategy: it aims to make diffusion LLM decoding faster on Apple hardware by combining KV-cache reuse and confidence-based parallel token finalization in MLX.

Background

Diffusion language models differ from standard autoregressive LLMs in how they generate text. Instead of predicting the next token one step at a time, they iteratively refine a partially masked sequence across multiple denoising steps. This gives them a different decoding profile, but it also makes inference more expensive if implemented naively.

In this note, Dream-style diffusion language models refers to diffusion LMs built around the Dream family of iterative text generation methods, introduced in Dream 7B: Diffusion Large Language Models (https://arxiv.org/abs/2508.15487), where a transformer repeatedly revises masked token positions instead of committing tokens strictly left-to-right. In practice, that means they keep the diffusion-style refinement loop, but often reuse much of the standard LLM transformer stack, making them a useful bridge between familiar autoregressive architectures and non-autoregressive generation.

The Fast-dLLM paper:

Wu, C., Zhang, H., Xue, S., Liu, Z., Diao, S., Zhu, L., Luo, P., Han, S., & Xie, E. (2025). Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding. https://arxiv.org/abs/2505.22618

proposes a training-free way to speed up diffusion LLM inference by reusing attention state and decoding multiple positions in parallel instead of fully recomputing the sequence at every refinement step.

The original reference implementation is available at: https://github.com/NVlabs/Fast-dLLM

Motivation

Most experimental work around diffusion LLMs is still centered on Python stacks tuned primarily for CUDA environments. That leaves a gap for developers and researchers who want to:

Run diffusion LLM inference locally on Apple Silicon
Benchmark new decoding strategies in MLX
Compare diffusion decoding against mlx_lm autoregressive baselines
Avoid introducing training or architecture changes just to test inference improvements

Fast-dLLM-mlx addresses this gap as a project focused on inference-time acceleration only. It is designed for experimentation with decoding behavior rather than for changing the underlying model weights.

This project brings these ideas to MLX, Apple's machine learning framework for Apple Silicon, and applies them to Dream-style diffusion language model inference.

Fast-dLLM for Dream Models in MLX

This project implements Dream architecture inference in MLX and adds a first MLX adaptation of the Fast-dLLM approach for that setting.

The current implementation includes:

Dream architecture inference in MLX
Dual-cache support for prompt-prefill and cache-aware token updates
Parallel token generation with probability thresholding
Confidence-threshold token finalization to lock in easy tokens early
Benchmark scripts for comparing Fast-dLLM-mlx, Dream-MLX, and mlx_lm baselines

In practice, the Fast-dLLM path attempts to finalize multiple masked positions in parallel when their predicted confidence is high enough. This reduces redundant computation relative to repeatedly reprocessing the refinement state from scratch.

Example benchmark invocation:

uv run python -m benchmarks.fast_dllm_mlx_benchmark \
  --model mlx-community/DiffuCoder-7B-cpGRPO-8bit \
  --trust-remote-code \
  --max-new-tokens 128 \
  --steps 20 \
  --block-length 32 \
  --threshold 0.9 \
  --warmup

Design Principles

Fast-dLLM-mlx follows several practical design principles:

Training-free acceleration: improve inference without retraining the model
Apple-native execution: target MLX and Apple Silicon directly
Architectural restraint: adapt the decoding strategy without overcomplicating the model stack
Benchmarkability: make comparisons with baseline MLX paths straightforward
Research utility: keep the project clear enough for further experimentation

Structure of project

The implementation consists of three main pieces:

MLX inference of Dream-style architecture.
Dual KV-cache.
Parallel decoding with confidence thresholding.

First, it provides an MLX implementation of the Dream-style transformer stack, including attention, RoPE handling, RMSNorm, and the language modeling head.

Second, it extends the standard cache behavior with a dual KV-cache design. This cache supports both:

appending prompt and decoded states in the normal way
replacing cached spans in place during iterative refinement

This matters because Fast-dLLM-style decoding reuses previously computed attention state while iteratively updating masked positions, rather than recomputing every token position from scratch.

Third, the generator adds confidence-based parallel decoding. During refinement, the implementation:

predicts candidate tokens for masked positions
computes per-position confidence
finalizes tokens whose confidence exceeds a threshold
guarantees progress by forcing at least one update when masked positions remain

This mechanism preserves the spirit of the original Fast-dLLM approach while fitting the execution model of MLX.

The repository currently depends on:

MLX / MLX-LM for tensor execution and model loading
Transformers for tokenizer support
Apple Silicon hardware as the primary target environment

The benchmark examples in the repository focus on:

mlx-community/DiffuCoder-7B-cpGRPO-8bit for diffusion-style generation
mlx-community/Qwen2.5-Coder-7B-Instruct-8bit as an autoregressive mlx_lm baseline

Performance

The repository includes dedicated benchmark entrypoints for three comparison paths:

fast_dllm_mlx_benchmark.py
dream_mlx_benchmark.py
qwen_mlx_lm_benchmark.py

These scripts evaluate prompts from the local prompts/ set and record per-prompt runtime summaries in CSV and JSON form. The current benchmark was run on a limited number of samples, not on a full dataset.

The benchmark slice currently used in the repository compares three prompt categories, coding, general, and math, across:

autoregressive Qwen baselines in mlx_lm
Dream MLX diffusion decoding
Fast-dLLM MLX decoding on top of the same Dream-style model family

For the MLX diffusion path, the experiments use DiffuCoder, and we include Qwen2.5-Coder as an autoregressive baseline because the Dream architecture used here is based on Qwen2.5, making the comparison more meaningful.

Average throughput by category is shown below:

Category	Qwen MLX 4-bit	Qwen MLX 8-bit	Dream MLX 4-bit	Dream MLX 6-bit	Dream MLX 8-bit	Fast-DLLM MLX 4-bit	Fast-DLLM MLX 6-bit	Fast-DLLM MLX 8-bit
coding	50.664	30.334	13.567	13.521	12.973	38.227	37.492	38.015
general	50.648	30.164	13.554	13.477	13.016	36.983	39.137	35.729
math	50.103	30.096	11.748	11.489	10.989	34.381	34.887	37.022
average	50.472	30.198	12.956	12.829	12.326	36.530	37.172	36.922

The main result is straightforward: Fast-dLLM on MLX is much faster than plain Dream MLX decoding across all measured categories and quantization variants.

Using the average row as a summary:

Fast-DLLM MLX 4-bit reaches 36.53 versus 12.96 for Dream MLX 4-bit
Fast-DLLM MLX 6-bit reaches 37.17 versus 12.83 for Dream MLX 6-bit
Fast-DLLM MLX 8-bit reaches 36.92 versus 12.33 for Dream MLX 8-bit

This places the Fast-dLLM MLX variants at roughly 2.8x to 3.0x faster than the corresponding Dream MLX baselines in this benchmark slice.

The autoregressive mlx_lm Qwen baselines still set the upper bound in this comparison, especially in 4-bit form. However, the important outcome is that Fast-dLLM substantially narrows the gap between diffusion decoding and autoregressive generation while preserving the diffusion-style inference setup.

These numbers should still be treated as directional rather than exhaustive. The benchmark was run on a limited number of samples rather than a full dataset, and the measurements are best interpreted as evidence that the Fast-dLLM decoding strategy transfers effectively to MLX and Apple Silicon.

Applications and Impact

Fast-dLLM-mlx is useful as both a project and a practical starting point for on-device experimentation with diffusion LLM inference on Apple hardware.

It enables:

Faster local experimentation with diffusion decoding strategies
Side-by-side comparison between diffusion and autoregressive MLX inference
Exploration of cache-aware and parallel diffusion decoding techniques on Apple Silicon
A clearer path toward privacy-preserving, local-first diffusion LLM tooling on macOS

More broadly, this project shows that Fast-dLLM-style inference optimizations are not limited to CUDA-first environments. They can also be adapted to Apple's local inference stack in a way that remains faithful to the original idea while being practical for MLX-based workflows.

Code

Fast-dLLM-mlx on GitHub