Diffusion Models: A Comprehensive Survey of Methods and Applications in Generative AI

🔑 Key Takeaways

  • What Are Diffusion Models? Foundations and Core Principles — At their core, diffusion models are a family of probabilistic generative models that learn to create new data samples by reversing a gradual noise-adding process.
  • Efficient Sampling Strategies for Diffusion Models — One of the most significant practical challenges with diffusion models is their slow sampling speed.
  • Improving Likelihood Estimation in Diffusion Models — While diffusion models excel at sample quality, optimizing likelihood — a measure of how well the model explains the training data — is crucial for applications like data compression, out-of-distribution detection, and model comparison.
  • Handling Special Data Structures with Diffusion Models — Standard diffusion models are designed for continuous data in Euclidean space, but many real-world data types require special treatment.
  • Connections Between Diffusion Models and Other Generative Frameworks — Diffusion models do not exist in isolation — they share deep connections with other generative model families, and hybrid approaches often yield the best results.

What Are Diffusion Models? Foundations and Core Principles

At their core, diffusion models are a family of probabilistic generative models that learn to create new data samples by reversing a gradual noise-adding process. The fundamental idea is elegantly simple: take a clean data sample, progressively corrupt it with Gaussian noise until it becomes indistinguishable from random noise, and then train a neural network to reverse this corruption step by step.

This two-phase approach — a forward diffusion process that destroys data structure and a learned reverse process that reconstructs it — gives diffusion models their remarkable generative capabilities. Unlike Generative Adversarial Networks (GANs) that require delicate adversarial training, or Variational Autoencoders (VAEs) that sometimes produce blurry outputs, diffusion models offer stable training dynamics and high-fidelity generation.

The survey identifies three foundational formulations that underpin all modern diffusion model research:

  • Denoising Diffusion Probabilistic Models (DDPMs): Originally proposed by Sohl-Dickstein et al. (2015) and refined by Ho et al. (2020), DDPMs define a Markov chain forward process and learn the reverse transitions using a simplified variational bound.
  • Score-Based Generative Models (SGMs): Developed by Song and Ermon (2019), these models estimate the score function — the gradient of the log probability density — at multiple noise scales, then use Langevin dynamics to generate samples.
  • Stochastic Differential Equations (Score SDEs): Song et al. (2021) unified DDPMs and SGMs by showing both are discretizations of continuous-time SDEs, providing a powerful mathematical framework for analysis and algorithm design.

This unification through Score SDEs was a pivotal moment in the field, revealing that apparently different approaches were manifestations of the same underlying continuous process. It opened the door to importing decades of research on differential equations into generative modeling, leading to faster samplers and better theoretical understanding.

Denoising Diffusion Probabilistic Models (DDPMs) Explained

DDPMs form the bedrock of practical diffusion model implementations. The forward process is defined as a Markov chain that gradually adds Gaussian noise to data x₀ over T timesteps according to a variance schedule β₁, β₂, …, β_T. At each step, a small amount of noise is added, and after sufficient steps, the data distribution converges to a standard Gaussian.

The critical insight of DDPMs is that the forward process admits a closed-form expression for any intermediate noisy version of the data, allowing efficient training without sequential computation. The reverse process is parameterized by a neural network (typically a U-Net architecture) that predicts either the added noise, the clean data, or the score function at each timestep.

Training optimizes a simplified objective that amounts to denoising score matching: the network learns to predict the noise that was added at each step. Ho et al. (2020) showed that this simplified loss, despite dropping certain weighting terms from the full variational bound, produces substantially better sample quality. This discovery was instrumental in making diffusion models practical and competitive with GANs for the first time.

Technical Detail: The noise prediction parameterization ε_θ(x_t, t) has become the standard in practice. The training loss simplifies to: L = E[||ε – ε_θ(√ᾱ_t x₀ + √(1-ᾱ_t) ε, t)||²], where ᾱ_t is the cumulative noise schedule product.

Score-Based Generative Models and Langevin Dynamics

Score-based generative models take a different but mathematically equivalent approach. Instead of learning reverse Markov transitions, they directly estimate the score function — ∇_x log p(x) — which points in the direction of increasing data probability. Once the score is known at all noise levels, samples can be generated using annealed Langevin dynamics, which iteratively follows the score while adding controlled noise.

The key challenge in score estimation is that the score function is ill-defined in low-density regions of the data space. Song and Ermon addressed this through Noise Conditional Score Networks (NCSNs), which perturb data with multiple levels of Gaussian noise and train a single network to estimate the score at each noise level. Starting from high noise (where the score is well-defined everywhere) and gradually reducing it creates a coarse-to-fine generation process.

The connection between score-based models and DDPMs becomes clear through the Score SDE framework: the DDPM forward process is a discretization of a variance-preserving SDE, while the NCSN process corresponds to a variance-exploding SDE. Both admit reverse-time SDEs whose drift depends on the score function, unifying the two approaches. For researchers exploring related AI topics, our guide on AI alignment taxonomy provides complementary perspectives on responsible AI development.

Efficient Sampling Strategies for Diffusion Models

One of the most significant practical challenges with diffusion models is their slow sampling speed. Standard DDPMs require hundreds to thousands of sequential neural network evaluations to generate a single sample, making them orders of magnitude slower than GANs. The survey catalogues a rich ecosystem of acceleration techniques divided into learning-free and learning-based approaches.

Learning-Free Sampling Methods

SDE Solvers: Advanced numerical methods for solving the reverse-time SDE can significantly reduce the required number of steps. Techniques from numerical analysis — including higher-order Runge-Kutta methods, predictor-corrector schemes, and adaptive step-size control — have been adapted for diffusion sampling. These methods achieve better accuracy per function evaluation without any additional training.

ODE Solvers (DDIM and Beyond): Song et al. (2020) showed that DDPMs admit a deterministic counterpart — the probability flow ODE — whose trajectories map noise to data without stochastic noise. The DDIM sampler exploits this, enabling generation in 10-50 steps instead of 1000. DPM-Solver and DPM-Solver++ further improved upon DDIM by using exponential integrators specifically designed for the diffusion ODE structure, achieving high-quality samples in as few as 10-20 neural function evaluations.

Learning-Based Sampling Methods

Knowledge Distillation: Progressive distillation trains a student model to match two teacher steps in one, halving the step count iteratively. After several rounds, this can compress 1024 steps down to 4-8 steps with minimal quality loss. Consistency models (Song et al., 2023) push this further by learning to map any point on a diffusion trajectory directly to its origin, enabling single-step generation.

Truncated Diffusion: Rather than starting from pure Gaussian noise, truncated diffusion methods begin the reverse process from a partially noised version of a reference image or a learned prior distribution. This reduces the number of steps needed since the model only needs to denoise from an intermediate noise level.

Optimized Discretization: Instead of using uniformly spaced timesteps, these methods learn or optimize the selection of which timesteps to use during sampling. Dynamic programming, reinforcement learning, and gradient-based optimization have all been applied to find step schedules that maximize sample quality for a given computational budget.

Improving Likelihood Estimation in Diffusion Models

While diffusion models excel at sample quality, optimizing likelihood — a measure of how well the model explains the training data — is crucial for applications like data compression, out-of-distribution detection, and model comparison. The survey identifies three main research directions for improving likelihood in diffusion models.

Noise Schedule Optimization: The choice of noise schedule β_t significantly impacts both sample quality and likelihood. Research has shown that cosine schedules, learned schedules, and variance-preserving formulations can substantially improve the evidence lower bound (ELBO). The optimal schedule depends on the data distribution and the number of diffusion steps.

Reverse Variance Learning: Standard DDPMs fix the reverse process variance, but learning it jointly with the mean improves likelihood by several nats per dimension. Nichol and Dhariwal (2021) showed that interpolating between the two natural choices for reverse variance — using a learned mixing coefficient — closes most of the gap to the theoretical optimum.

Exact Likelihood Computation: Through the probability flow ODE, diffusion models can be viewed as continuous normalizing flows, enabling exact likelihood computation via the instantaneous change of variables formula. This connects diffusion models to the flow-based model literature and enables precise density estimation, though at higher computational cost.

📊 Explore this analysis with interactive data visualizations

Try It Free →

Handling Special Data Structures with Diffusion Models

Standard diffusion models are designed for continuous data in Euclidean space, but many real-world data types require special treatment. The survey covers three important categories of structured data.

Discrete Data

Text, categorical variables, and graph structures are inherently discrete. Researchers have developed discrete diffusion processes that corrupt data through token masking, uniform corruption, or absorbing states rather than Gaussian noise. D3PM (Austin et al., 2021) generalized DDPMs to categorical distributions, while Multinomial Diffusion operates directly on probability simplices. These methods enable diffusion-based text generation and molecular graph design.

Data with Invariant Structures

Many scientific datasets possess symmetry properties — molecular conformations are invariant to rotation and translation, and physical simulations respect conservation laws. Equivariant diffusion models incorporate these symmetries directly into the network architecture and diffusion process, ensuring that generated samples respect physical constraints without requiring data augmentation. This has proven especially valuable in computational chemistry and physics simulations, areas also highlighted in the McKinsey State of AI 2024 report.

Data on Manifolds

When data naturally lives on a non-Euclidean manifold — such as protein backbone angles on a torus, or orientations on SO(3) — standard Gaussian diffusion is inappropriate. Riemannian diffusion models define the noising process using the heat kernel on the manifold, with the reverse process learned in the tangent space. This enables generation on spheres, tori, hyperbolic spaces, and Lie groups while respecting the manifold geometry.

Connections Between Diffusion Models and Other Generative Frameworks

Diffusion models do not exist in isolation — they share deep connections with other generative model families, and hybrid approaches often yield the best results. The survey maps these relationships comprehensively.

VAE Connections: DDPMs can be interpreted as hierarchical VAEs with fixed encoders and a specific factorization structure. This perspective enables techniques from the VAE literature — such as learned priors, auxiliary variables, and importance-weighted bounds — to be transferred to diffusion models.

GAN Connections: Denoising diffusion GANs replace the learned Gaussian reverse transitions with GAN-based conditional generators, enabling high-quality generation in very few steps. This hybrid approach combines the training stability of diffusion models with the fast sampling of GANs.

Normalizing Flow Connections: The probability flow ODE establishes a direct link between diffusion models and continuous normalizing flows. Flow matching — a recent innovation — simplifies diffusion training by directly regressing on a vector field that transports noise to data along straight paths, offering simpler objectives and faster training.

Autoregressive Connections: Some works combine diffusion models with autoregressive generation, using diffusion for local detail and autoregression for global structure. This is particularly effective for sequences like audio and video, where temporal coherence matters across long ranges.

Energy-Based Model Connections: Score-based models are intimately related to energy-based models, since the score function is the gradient of the (negative) energy. This connection enables techniques like classifier guidance, where a separately trained classifier’s gradient is used to steer diffusion sampling toward desired classes or attributes.

Applications of Diffusion Models in Computer Vision

Computer vision has been the primary proving ground for diffusion models, with applications spanning an extraordinary range of tasks. The results have been nothing short of transformative.

Image Generation: Diffusion models have set new state-of-the-art results on ImageNet generation, with models like ADM (Dhariwal and Nichol, 2021) and DiT (Peebles and Xie, 2023) achieving FID scores that surpass the best GANs. Classifier-free guidance has become the standard technique for balancing sample quality and diversity.

Image Editing and Inpainting: By starting the reverse process from a partially noised version of an input image, diffusion models enable powerful editing capabilities — from inpainting missing regions to global style transfer, all guided by text prompts or reference images.

Super-Resolution and Restoration: SR3 and similar models use diffusion processes conditioned on low-resolution inputs to produce photorealistic upscaling, outperforming traditional approaches on perceptual metrics. The same framework applies to deblurring, denoising, and artifact removal.

Video Generation: Extending diffusion models to the temporal dimension enables video synthesis. Models like Video Diffusion Models and Make-A-Video generate temporally coherent video from text descriptions, while others tackle video prediction, interpolation, and editing.

3D Generation: DreamFusion pioneered the use of diffusion models for 3D content creation by optimizing a Neural Radiance Field (NeRF) using a diffusion model as a critic through Score Distillation Sampling (SDS). This “2D-to-3D lifting” approach has spawned numerous follow-ups, as tracked in the CB Insights Tech Trends 2025 analysis.

📊 Explore this analysis with interactive data visualizations

Try It Free →

Applications Beyond Vision: NLP, Audio, and Science

While computer vision dominates the diffusion models survey literature, the framework has proven remarkably versatile across domains.

Natural Language Processing: Diffusion-based text generation models like Diffusion-LM (Li et al., 2022) operate in continuous embedding space, diffusing word embeddings and rounding to discrete tokens. This enables fine-grained control over generated text properties — sentiment, syntax, length — through gradient-based guidance, offering capabilities that autoregressive models struggle to provide.

Audio and Speech: WaveGrad and DiffWave apply diffusion models to raw audio waveform generation, producing high-fidelity speech synthesis. Grad-TTS uses diffusion for the mel-spectrogram generation stage, while newer models tackle music generation, sound effect synthesis, and audio super-resolution.

Molecular Design and Drug Discovery: Diffusion models have become a leading approach for generating molecular conformations, protein structures, and drug candidates. Equivariant diffusion models can generate valid 3D molecular geometries that satisfy chemical constraints, dramatically accelerating the drug discovery pipeline. ICML 2022 featured multiple papers on this topic.

Time-Series and Temporal Data: TimeGrad and CSDI apply diffusion models to time-series forecasting and imputation, handling missing data and uncertainty quantification more naturally than deterministic methods. This has applications in finance, healthcare monitoring, and climate modeling.

Robotics and Planning: Diffuser (Janner et al., 2022) frames reinforcement learning as a sequence modeling problem solved with diffusion, generating entire trajectories conditioned on desired outcomes. This enables flexible, multi-modal planning that gracefully handles constraints and objectives.

Guidance and Conditioning Techniques in Diffusion Models

The ability to control what diffusion models generate — through text prompts, class labels, images, or other conditions — is central to their practical utility. The survey covers the main conditioning paradigms.

Classifier Guidance: Dhariwal and Nichol (2021) introduced classifier guidance, which uses the gradient of a separately trained classifier to bias the diffusion sampling process toward a target class. This dramatically improves sample quality at the cost of diversity, controlled by a guidance scale parameter.

Classifier-Free Guidance: Ho and Salimans (2022) eliminated the need for a separate classifier by jointly training conditional and unconditional diffusion models. During sampling, the unconditional prediction is subtracted from the conditional one and amplified, achieving the same quality-diversity tradeoff without an external classifier. This technique underlies virtually all modern text-to-image systems.

Text Conditioning: Large-scale text-to-image models like DALL·E 2, Imagen, and Stable Diffusion condition diffusion on text embeddings from pretrained language models (CLIP, T5). The quality of the text encoder is often as important as the diffusion model architecture itself, and cross-attention mechanisms have become the standard way to inject text information into the denoising network.

Industry Impact: Classifier-free guidance combined with CLIP text conditioning has enabled an entirely new creative industry. Tools like Midjourney, Stable Diffusion, and DALL·E 3 serve millions of users, generating billions of images and transforming graphic design, advertising, and digital art.

Current Challenges and Future Directions for Diffusion Models

Despite remarkable progress, the diffusion models survey identifies several open challenges and promising research directions that will shape the field’s evolution.

Sampling Speed: While significant progress has been made with distillation and fast solvers, diffusion models remain slower than GANs for real-time applications. Consistency models, rectified flows, and architectural innovations continue to push toward single-step generation without quality loss.

Scalability: Training state-of-the-art diffusion models requires enormous computational resources. Latent diffusion (operating in a compressed latent space rather than pixel space) has been the primary solution, but further efficiency gains are needed as models scale to higher resolutions, longer videos, and 3D content.

Evaluation Metrics: FID, IS, and CLIP scores each capture different aspects of generation quality but none fully align with human judgment. Developing better evaluation frameworks remains an open problem for the generative AI community.

Controllability: Fine-grained control over generated content — spatial layout, object relationships, style attributes, physical plausibility — is still imperfect. ControlNet and IP-Adapter represent steps forward, but robust compositional generation remains challenging.

Safety and Ethics: Diffusion models raise concerns around deepfakes, copyright infringement, and harmful content generation. Watermarking, content detection, and training data governance are active research areas with significant societal implications.

Theoretical Understanding: While the SDE framework provides mathematical rigor, many practical observations — such as why classifier-free guidance works so well, or how diffusion models memorize versus generalize — lack complete theoretical explanations.

Interactive Exploration: Diffusion Models Survey

To deepen your understanding of this comprehensive diffusion models survey, explore the full research paper through our interactive experience. Navigate key concepts, visualize model architectures, and dive into the mathematical foundations at your own pace.

Open Full Interactive Experience →

📊 Explore this analysis with interactive data visualizations

Try It Free →

Frequently Asked Questions

What are diffusion models in machine learning?

Diffusion models are a class of deep generative models that learn to generate data by reversing a gradual noising process. They work by first adding Gaussian noise to training data step by step until the data becomes pure noise, then learning to reverse this process to generate new samples from noise. The two main formulations are Denoising Diffusion Probabilistic Models (DDPMs) and Score-Based Generative Models (SGMs).

How do diffusion models compare to GANs and VAEs?

Diffusion models have surpassed GANs in image generation quality on benchmarks like ImageNet while offering more stable training without mode collapse. Unlike VAEs, diffusion models don’t require an explicit encoder and achieve sharper, higher-fidelity outputs. However, diffusion models are generally slower at inference due to their iterative sampling process, though techniques like DDIM and distillation are closing this gap.

What are the main applications of diffusion models?

Diffusion models power a wide range of applications including text-to-image generation (DALL·E, Stable Diffusion, Midjourney), video synthesis, 3D shape generation, audio and music creation, molecular design for drug discovery, medical image analysis, natural language processing, and time-series forecasting. Their versatility stems from the ability to model complex probability distributions across diverse data types.

What is the difference between DDPM and score-based generative models?

DDPMs (Denoising Diffusion Probabilistic Models) define a forward Markov chain that adds noise and learn the reverse chain to generate data, optimizing a variational lower bound on the log-likelihood. Score-based generative models instead estimate the score function (gradient of the log probability density) at each noise level and use Langevin dynamics for sampling. Both approaches were unified under the framework of Stochastic Differential Equations (Score SDEs), showing they are different perspectives of the same underlying process.

How can diffusion models be made faster at inference?

Several techniques accelerate diffusion model sampling: learning-free methods like DDIM use deterministic ODE solvers to skip steps; knowledge distillation trains a student model to match the teacher in fewer steps; truncated diffusion starts the reverse process from a partially noised image rather than pure noise; and optimized discretization selects the most informative timesteps. These approaches can reduce sampling from thousands of steps to as few as 1-4 steps while maintaining quality.

Your documents deserve to be read.

PDFs get ignored. Presentations get skipped. Reports gather dust.

Libertify transforms them into interactive experiences people actually engage with.

No credit card required · 30-second setup