1. Foreword

Diffusion models have emerged in recent years as a powerful approach for generating images and videos, offering significant advantages in temporal consistency and image quality [1]. This article provides a straightforward introduction to the core principles of diffusion models and explores their potential applications in facial interaction generation, with a focus on project-specific implementations.

2. Theoretical Foundations

A diffusion model consists of two main processes [1, 5]: the forward diffusion process and the reverse denoising process. Initially, the model adds noise incrementally to an image, transforming it into a completely random state; subsequently, the model denoises the image, restoring it to a clean state.

2.1. Forward Diffusion Process

The forward diffusion process is a Markov chain that starts at time step $t=0$ and progressively adds noise to the image until, at the final time step $t=T$, the image is entirely covered in noise. The relationship between the image at time step $t$, denoted $X_t$, and the previous time step $X_{t-1}$ is given by:

$$
q(X_t \mid X_{t-1}) = \mathcal{N}(X_t; \sqrt{1 - \beta_t} X_{t-1}, \beta_t \mathbf{I}) \tag{1},
$$
where $\mathcal{N}$ represents a normal distribution, $\beta_t$ is the amount of noise added at each time step, and $\mathbf{I}$ is the identity matrix.

The above figure, taken from a seminal diffusion model paper [2], illustrates the Markov chain process of a diffusion model. To understand the entire process, we can expand it into a joint probability distribution from the clean image $X_0$ to any time step $t$:

$$
q(X_{1:T} \mid X_0) = \prod_{t=1}^T q(X_t \mid X_{t-1}) \tag{2}.
$$

To simplify the model, one can derive $X_t$ directly from $X_0$ by introducing the parameter $\alpha_t = 1 - \beta_t$. And $\bar{\alpha}_t$ is obtained by multiplying all individual $\alpha_s$ values for each time step $s$ up to $t$. Thus, $X_t$ can be expressed as:

$$
q(X_t \mid X_0) = \mathcal{N}(X_t; \sqrt{\bar{\alpha}_t} X_0, (1 - \bar{\alpha}_t) \mathbf{I}) \tag{3}.
$$

This formula indicates that, given the initial image $X_0$, the image at time step $t$, $X_t$, can be obtained by weighting $X_0$ with a time-dependent noise term.

2.2. Reverse Denoising Process

In the reverse denoising process, we start with a fully noise-covered image $X_T$ and gradually denoise it to retrieve the original clean image $X_0$. This process is also known as the “generation process,” where we aim to learn a parameterised model $p_\theta(X_{t-1} \mid X_t)$ to denoise the image iteratively.

The fundamental equation of the reverse process is:

$$
p_\theta(X_{t-1} \mid X_t) = \mathcal{N}(X_{t-1}; \mu_\theta(X_t, t), \Sigma_\theta(X_t, t)) \tag{4}.
$$

Here, $\mu_\theta(X_t, t)$ is the denoising mean predicted by a neural network, such as the U-Net depicted below [3], and $\Sigma_\theta(X_t, t)$ is the denoising variance. In practical implementations, $\Sigma_\theta(X_t, t)$ is often assumed to be fixed to simplify the model’s learning process.

To derive the denoising mean $\mu_\theta(X_t, t)$, we can use the variational lower bound (VLB) to optimise the model’s parameters. Specifically, our objective is to maximise the likelihood of the model given the data, $p_\theta(X_0)$:

By expanding the variational lower bound, we obtain the following loss function:

The Kullback-Leibler (KL) divergence term in this loss function measures the difference between the model’s predicted distribution and the true distribution at each time step. By minimising this difference, the model learns how to progressively denoise, leading to the generation of high-quality images.

2.3. Noise Prediction and Sampling Process

In the practical denoising process, we use a neural network $\epsilon_\theta(X_t, t)$ to predict the noise $\epsilon$ in the image. The sampling formula for the reverse process is given by:

$$
X_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( X_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(X_t, t) \right) + \sqrt{\beta_t} \mathbf{z} \tag{7},
$$
where $\mathbf{z}$ represents noise sampled from a standard normal distribution.

To further understand this equation, we can derive the denoising process as follows:

  1. Noise Prediction: The model learns to predict the noise term $\epsilon_\theta(X_t, t)$ in the input image $X_t$, and uses the predicted noise to correct the image, gradually approaching a noiseless version.

  2. Denoising Update: At each time step $t$, the model calculates the denoising mean $\mu_\theta(X_t, t)$ and adds a random noise term $\mathbf{z}$ to generate the denoised image $X_{t-1}$, gradually reducing noise until an approximation of the original image $X_0$ is obtained.

2.4. Time Encoding in the Denoising Process

In diffusion models, time encoding is a crucial component that helps the model understand the current time step, thereby better utilising historical information in the denoising process. Time encoding is typically implemented using a combination of sine and cosine functions:

$$
\text{PE}(\tau, 2d_{i}) = \sin\left(\frac{\tau}{10000^{\frac{2d_{i}}{d}}}\right),
\quad
\text{PE}(\tau, 2d_{i+1}) = \cos\left(\frac{\tau}{10000^{\frac{2d_{i}}{d}}}\right) \tag{8}.
$$

This encoding scheme effectively captures periodic variations between time steps, allowing the model to maintain temporal consistency and coherence when generating sequential data, such as video frames.

3. Facial Interaction (Project Implementation)

The following section describes specific extensions implemented in our project, which may not be applicable to all diffusion models and interaction generation tasks but are unique to our approach.

In facial interaction generation, diffusion models need to handle multiple characters within a video sequence while ensuring synchronisation between each character’s facial expressions and corresponding audio. To achieve this, we incorporated the following key modules into our diffusion model:

Reference Frame and Previous Frames: The image generated at each time step depends on a reference frame and several previous frames to ensure visual consistency in the generated video. Let $\mathcal{F}_t$ be the current noisy frame, and $\mathcal{F}_r$ be the reference frame. The generation process can then be expressed as:

where $f$ is the learned denoising function, and $\epsilon$ is the noise term. The following image shows the 68 facial landmarks extracted in our project.

Interactive Guidance Module (IGM): To generate interactive videos, the IGM dynamically adjusts the generation parameters for different characters to ensure natural and synchronised interactions. This module optimises the generation process by comparing the dialogue content, facial expression synchronisation, and audio-video coherence between characters.

Loss Function Application: To ensure the quality of the generated results, multiple loss functions were introduced during the training of the diffusion model, including mean squared error (MSE) loss, lip-sync loss, and variational lower bound loss. The lip-sync loss is defined as:

By minimising this loss, the model ensures that the lip movements in the generated video are consistent with the audio content, resulting in a more realistic video.

4. Conclusion and Future Prospects

By delving into the mathematical principles and derivations of diffusion models, we can see their powerful capabilities in generating images and videos. Diffusion models operate by progressively adding and removing noise through forward diffusion and reverse denoising processes, eventually restoring high-quality images. Building on the idea presented in an ICLR 2025 submission [4], which first drew an analogy between diffusion models and evolutionary algorithms, there remains significant potential for further exploring the information contained within diffusion models.

In the future, diffusion models are expected to play a greater role in generating multimodal content, particularly in areas such as virtual assistants, virtual social interactions, and remote education, which require interactive content generation. By combining them with large language models, diffusion models can better understand and respond to user input, generating more intelligent and personalised content. Moreover, these models hold promise for advancing research on agents and Markov chains.

Copyright Notice
This article, except for the referenced content below, is the original work of Junhao. The author retains the exclusive rights to its final interpretation. If there are any issues regarding copyright infringement, please contact me for removal. Reproduction or distribution of this content without my explicit permission is prohibited.

5. References

[1]. Stypułkowski, M., Vougioukas, K., He, S., Zięba, M., Petridis, S. and Pantic, M., 2024. Diffused heads: Diffusion models beat gans on talking-face generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 5091-5100).

[2]. Ho, J., Jain, A. and Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, pp.6840-6851.

[3]. Ronneberger, O., Fischer, P. and Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18 (pp. 234-241). Springer International Publishing.

[4]. Zhang, Y., Hartl, B., Hazan, H. and Levin, M., 2024. Diffusion Models are Evolutionary Algorithms. arXiv preprint arXiv:2410.02543.

[5]. Chen, M., Mei, S., Fan, J. and Wang, M., 2024. An overview of diffusion models: Applications, guided generation, statistical rates and optimisation. arXiv preprint arXiv:2404.07771.