Google DiffusionGemma: 4x Faster AI Text Generation Explained

On June 10, 2026, Google DeepMind quietly dropped the strangest open-weights model of the year. DiffusionGemma is a 26-billion-parameter mixture-of-experts language model that doesn’t predict the next token at all. It starts with a canvas of 256 random placeholder tokens — literal noise — and denoises the whole block into readable text in parallel.

The result, according to Google (blog.google, June 10, 2026) and confirmed by NVIDIA’s same-day technical blog (blogs.nvidia.com, June 10, 2026), is up to 4× faster text generation on GPUs — with the FP8 quantization hitting 1,288 tokens/second on a single H200 in vLLM benchmarks (vllm.ai, June 10, 2026). That’s roughly six times a standard autoregressive baseline running the same 26B-A4B backbone.

I want to walk you through what’s actually new here, why Google thinks it matters, and where the trade-offs hit you in the face.

What is DiffusionGemma, exactly?

DiffusionGemma is Google’s first open-weight diffusion language model (dLLM), built on the Gemma 4 architecture and released under Apache 2.0 on June 10, 2026. It’s not an incremental tweak to next-token prediction. It’s a different generation paradigm.

Every LLM you’ve used — ChatGPT, Claude, Gemini, Llama, DeepSeek — writes the way a typewriter does. One token, then the next, each conditioned on everything before it. DiffusionGemma writes the way Stable Diffusion paints: start with noise, then iteratively refine the entire image in parallel. Applied to text, that means starting with a 256-token block of random placeholders and denoising them all at once across a handful of passes.

Per the official Hugging Face model card, the headline specs are:

Spec	Value
Total parameters	25.2B
Active parameters (per step)	3.8B
Experts	8 active / 128 total + 1 shared
Layers	30
Canvas length	256 tokens
Context length	Up to 256K tokens
Vocabulary	262K
Sliding window	1,024 tokens
Modalities (input)	Text, image, video
License	Apache 2.0

Because it’s a Mixture-of-Experts model, only 3.8B parameters activate per step. Quantized, it fits comfortably in 18 GB of VRAM on a consumer RTX 4090 or 5090 — Google’s launch post says so explicitly.

Why diffusion for text? The typewriter vs. the printing press

Here’s the part that clicked for me. Brendan O’Donoghue and Sebastian Flennerhag, the research scientists who led the launch, framed it perfectly in their Google blog post:

“It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.”

Autoregressive language models are memory-bandwidth bound. On a local GPU, the model spends most of its time waiting for the next token load. The Tensor Cores sit mostly idle. In a cloud server with massive batching, you can hide that latency. On your desk with a single user, you can’t.

Diffusion flips the equation. Pulling a full 256-token block through the transformer in parallel is compute-bound — exactly what GPUs are built for. The Tensor Cores stay fed. NVIDIA’s developer blog (June 10, 2026) puts it bluntly: “Generating one token at a time is fundamentally a memory-bound problem… Diffusion flips the equation.”

The speed numbers, cross-verified across Google, NVIDIA, vLLM, and Ars Technica:

Hardware	Throughput	Notes
NVIDIA H100 (FP8)	1,008 tok/s	vLLM benchmark, batch size 1
NVIDIA H200 (FP8)	1,288 tok/s	~6× AR baseline
NVIDIA H100 (BF16)	1,000+ tok/s	Google + NVIDIA
NVIDIA RTX 5090	700+ tok/s	Google + NVIDIA
NVIDIA DGX Spark	150 tok/s	NVIDIA
NVIDIA DGX Station	2,000 tok/s	NVIDIA

Pull quote: “On a single H100 it streams 1,000+ tokens a second — Google’s own chart shows 1,107 tok/s against 303 tok/s for the same-size autoregressive Gemma 4. That is a 3.7× throughput jump from the same 26B-A4B backbone.” — Towards AI, June 11, 2026

How text diffusion actually works

The mental model isn’t hard once you let go of next-token thinking. Per the Google Developers Blog guide (June 10, 2026) and the vLLM engineering write-up, here’s the loop:

The canvas. The model initializes a 256-token block with random placeholder tokens. Call it a blank page covered in scribbles.
Iterative refinement. Every pass, the model runs bidirectional attention over the full canvas. It samples a candidate token at each position and measures confidence (entropy) per position. The most confident tokens get locked in. The rest get re-noised.
Convergence. Once the model’s best-guess tokens stop changing for two consecutive passes and the average entropy falls below threshold — or you hit a hard step cap — the block is “committed.”
Block-autoregressive extension. For outputs longer than 256 tokens, the committed block is written to the KV cache, then a fresh noisy canvas is initialized and the loop continues. The whole output is built block by block.

That re-noising step is the secret sauce. It’s called entropy-bound sampling — Google uses an “entropy budget” of 0.1, accepting the lowest-entropy tokens first and re-randomizing the rest. Between steps, self-conditioning feeds the previous step’s full probability distribution back into the next pass through a gated MLP, so re-noised positions don’t lose all context.

The vLLM team had to invent new plumbing to serve this. Standard vLLM assumes a batch is either all causal or all bidirectional. DiffusionGemma mixes both per request — a prefill pass uses causal attention to write the KV cache, then a denoise pass uses bidirectional attention over the canvas. They built dynamic per-sequence causal attention to handle that, with patches to both the Triton Attention and FlashAttention 4 backends. (vllm.ai, June 10, 2026)

What DiffusionGemma does that Gemma 4 can’t

This is the part that genuinely surprised me. DiffusionGemma has structural capabilities that autoregressive models literally cannot replicate without hacks:

Bidirectional attention. Every token on the canvas sees every other token, including ones that haven’t been “written” yet. Sudoku? The digit in cell 1 can inform cell 81 in the same forward pass.
Self-correction. If the model loses confidence in a token mid-generation, it can re-noise it and try again. An autoregressive model is stuck with whatever it committed.
Global constraint propagation. Problems where every output depends on every other output — code infilling, math graphs, molecular sequences — are structurally easier when you can see the whole block at once.

Google’s launch post (blog.google) shows this concretely with Sudoku. The base DiffusionGemma solves roughly 0% of puzzles. After fine-tuning with their Hackable Diffusion JAX recipe, accuracy jumps to 80% — and convergence drops from 48 steps to 12. The developer guide walks through the recipe.

That’s a task where autoregressive models structurally fumble. Each digit depends on intersecting row, column, and 3×3 box constraints. You can’t write it left-to-right.

It also does multimodal work. Per the Hugging Face model card, DiffusionGemma takes text, image, and video inputs (no audio), supports variable image resolution through a configurable visual token budget (70, 140, 280, 560, or 1120 tokens), has a 256K context window, and ships with configurable “thinking mode” reasoning.

The honest trade-offs

Google is refreshingly upfront here. DiffusionGemma is experimental, and they tell you on the launch page: “DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4.”

The Hugging Face model card benchmarks (instruction-tuned, with the recommended entropy-bound sampler) make the gap concrete:

Benchmark	DiffusionGemma 26B A4B	Gemma 4 26B A4B
MMLU Pro	77.6%	82.6%
AIME 2026 (no tools)	69.1%	88.3%
LiveCodeBench v6	69.1%	77.1%
GPQA Diamond	73.2%	82.3%
BigBench Extra Hard	47.6%	64.8%
Codeforces ELO	1429	1718
HLE no tools	11.0%	8.7%
MRCR v2 long-context	32.0%	44.1%
MMMU Pro (vision)	54.3%	73.8%

You’re trading 5 to 19 percentage points on most reasoning and knowledge benchmarks for a 4× speedup. The Reddit thread on r/LocalLLaMA captured the mood in its title: “Diffusion Gemma is 4x faster, but makes 6x more mistakes!” (reddit.com, June 12, 2026).

Where the speed wins and where it doesn’t, per VentureBeat’s analysis (June 11, 2026):

Wins: Local inference, single-user apps, low-concurrency serving, interactive editing, code infilling, structured data generation.
Doesn’t win: High-throughput cloud serving at hundreds of concurrent requests. Autoregressive models already saturate compute there, and DiffusionGemma’s parallel decoding offers diminishing returns — sometimes higher serving cost.
Apple Silicon caveat: Google’s launch post adds a footnote: unified-memory Macs may not see the same speedup because of their lower compute-to-memory ratio.

How it fits in the broader dLLM landscape

DiffusionGemma didn’t emerge from nowhere. The roots go back to a brief Gemini Diffusion preview at Google I/O 2025, which Simon Willison clocked at 857 tokens/second (simonwillison.net, May 21, 2025). That research matured into this Apache 2.0 release.

Outside Google, the diffusion-LLM space is heating up:

Inception Labs’ Mercury launched commercially in 2025 as the “world’s first commercial-scale diffusion LLM,” claiming up to 10× faster than speed-optimized LLMs (inceptionlabs.ai). Mercury 2 followed in February 2026 with reasoning support.
NVIDIA’s Nemotron-Labs Diffusion introduced a related parallel-decoding dLLM approach (Hugging Face blog, May 22, 2026).
Fast-dLLM from NVIDIA Labs showed up to 27.6× throughput improvements with minimal accuracy loss (nvlabs.github.io).

DiffusionGemma’s specific edge is that it’s the first dLLM natively supported in vLLM — meaning day-zero serving with the same OpenAI-compatible API stack most teams already run. (vllm.ai, June 10, 2026)

How to actually run it

The ecosystem support landed on day one. Per Google’s launch post and the NVIDIA RTX AI Garage blog:

Weights. Download google/diffusiongemma-26B-A4B-it from Hugging Face under Apache 2.0. Pre-quantized NVFP4 and FP8-dynamic checkpoints are on the Red Hat AI Hub.
Inference. Serve via vLLM (day-zero support), Hugging Face Transformers, SGLang, or MLX on Apple Silicon. Official llama.cpp support is “coming soon.”
Fine-tuning. Use Hackable Diffusion (JAX), Unsloth, or NVIDIA NeMo AutoModel.
Cloud deploys. Try it free on NVIDIA NIM or Google Cloud’s Model Garden.

Within days of release, the Unsloth team posted a GGUF quantization that hit 2,000+ tokens/sec on suitable hardware. Community spaces on Hugging Face — including code generation, 3D generation, and OCR correction demos — popped up almost immediately. (huggingface.co)

So what does this actually mean?

Here’s my honest take after digging through all the docs and benchmarks.

DiffusionGemma is not a drop-in replacement for Gemma 4. The benchmark gap is real, and you should not ship it to production chatbots where factual accuracy matters. Google says as much themselves.

What it is is a new tool for a specific job: latency-critical, single-user, locally-deployed AI workflows. If you’re building a coding copilot that needs to feel instant, an in-line editor that fills in code mid-keystroke, a constrained generation pipeline for molecular sequences, or a research tool that needs to explore many candidate outputs fast — the 4× speedup is real, and the bidirectional architecture unlocks things autoregressive models can’t do structurally.

The bigger story is the vLLM integration. The new ModelState interface the vLLM team built for this release is designed to be a blueprint for adding future dLLMs without forking the runner. That means the next diffusion model — whether it’s from Google, Inception Labs, NVIDIA, or someone we haven’t heard of yet — can land in the same serving stack with minimal friction.

DiffusionGemma is the first major open-weight dLLM that runs at production scale on consumer hardware with mainstream tooling support. It’s an experimental model with real trade-offs, but it’s also a glimpse of where text generation is heading. The typewriter era might be ending.

Sources cited inline: Google DeepMind launch blog (Jun 10, 2026), Google Developers Blog diffusion guide (Jun 10, 2026), Google AI for Developers documentation, NVIDIA RTX AI Garage blog (Jun 10, 2026), NVIDIA Developer Blog (Jun 10, 2026), vLLM engineering blog (Jun 10, 2026), Ars Technica (Jun 10, 2026), VentureBeat (Jun 11, 2026), InfoWorld (Jun 12, 2026), Hugging Face model card, Simon Willison’s Weblog (Jun 10, 2026), Towards AI (Jun 11, 2026), Inception Labs blog (2025–2026).

Google DiffusionGemma: 4x Faster AI Text Generation Explained

Key Takeaways

Summarize with AI

Google DiffusionGemma: 4x Faster AI Text Generation Explained

What is DiffusionGemma, exactly?

Why diffusion for text? The typewriter vs. the printing press

How text diffusion actually works

What DiffusionGemma does that Gemma 4 can’t

The honest trade-offs

How it fits in the broader dLLM landscape

How to actually run it

So what does this actually mean?

Get our weekly AI digest

AIUnpacker Editorial Team

More in AI Models & Releases

GLM-5.2 Released: New Long-Context AI Model for Agents and Coding

Kimi K2.7 Code Released: Is This the Best Open AI Coding Model?

Claude Fable 5 and Mythos 5 Released: Anthropic's Biggest AI Update Yet

Cohere North Mini Code: New Open-Source AI Coding Model for Developers