MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

A post-training approach for diffusion models that leverages slowed interpolation mixture to reduce the training–testing discrepancy (exposure bias).

Overview

This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance.

Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep.

MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.

Performance overview
Performance overview.

Slow Flow phenomenon

Slow Flow illustration

Illustrating (1) the Slow Flow phenomenon during the sampling process: the timestep (y-axis), corresponding to the ground truth noisy data that is the nearest to the generated noisy data at the sampling timestep \( t \) (x-axis), is slower (with higher noise), i.e., the shading area is under the line \(x=y\); and (2) the effectiveness of MixFlow training: the range of slowed timesteps for MixFlow training is smaller and closer to the sampling steps than standard training, indicating that MixFlow training effectively alleviates the training-testing discrepancy.

MixFlow

MixFlow diagram

MixFlow samples the training timestep \(t\) from a Beta distribution \(\operatorname{Beta}(2,1)\), and samples the slowed timestep \(m_t\) from a uniform distribution \(\mathcal{U}[(1-\gamma)t, t]\). The prediction network is post-trained on the slowed interpolation mixture at each training timestep to reduce the training-testing discrepancy.

Easy Implementation

Only 5 lines of code are modified to integrate MixFlow into the RAE implementation. Concretely, we update two functions in transport.py. Left: RAE; Right: MixFlow + RAE.

Sample function

(a) RAE
def sample(self, x1):
    """Sampling x0 & t based on shape of x1 (if needed)
      Args:
        x1 - data point; [batch, *dim]
    """
    
    x0 = th.randn_like(x1)
    dist_options = self.time_dist_type.split("_")
    t0, t1 = self.check_interval(self.train_eps, self.sample_eps)
    if dist_options[0] == "uniform":
        t = th.rand((x1.shape[0],)) * (t1 - t0) + t0
    # ...


    t = t.to(x1)
    t = self.time_dist_shift * t / (1 + (self.time_dist_shift - 1) * t)
    return t, x0, x1
(b) MixFlow
def sample(self, x1):
    """Sampling x0 & t based on shape of x1 (if needed)
      Args:
        x1 - data point; [batch, *dim]
    """
    
    x0 = th.randn_like(x1)
    dist_options = self.time_dist_type.split("_")
    t0, t1 = self.check_interval(self.train_eps, self.sample_eps)
    if dist_options[0] == "uniform":
        t = th.rand((x1.shape[0],)) * (t1 - t0) + t0
    # ...
    # Sample t from Beta(2,1)
    t = 1 - th.sqrt(t) 
    t = t.to(x1)
    t = self.time_dist_shift * t / (1 + (self.time_dist_shift - 1) * t)
    return t, x0, x1

Training Losses function

(c) RAE
def training_losses(self, model,  x1, model_kwargs=None):
    """Loss for training the score model
    Args:
    - model: backbone model; could be score, noise, or velocity
    - x1: datapoint
    - model_kwargs: additional arguments for the model
    """
    if model_kwargs == None:
        model_kwargs = {}
    
    t, x0, x1 = self.sample(x1)

    t, xt, ut = self.path_sampler.plan(t, x0, x1)




    model_output = model(xt, t, **model_kwargs)
    B, *_, C = xt.shape
    assert model_output.size() == (B, *xt.size()[1:-1], C)

    terms = {}
    terms['pred'] = model_output
    if self.model_type == ModelType.VELOCITY:
        terms['loss'] = mean_flat(((model_output - ut) ** 2))
    else: 
        # ...
            
    return terms
(d) MixFlow
def training_losses(self, model, x1, gamma=0.4, model_kwargs=None):
    """Loss for training the score model
    Args:
    - model: backbone model; could be score, noise, or velocity
    - x1: datapoint
    - model_kwargs: additional arguments for the model
    """
    if model_kwargs == None:
        model_kwargs = {}
    
    t, x0, x1 = self.sample(x1)
    # Remove the standard interpolated xt 
    t, _, ut = self.path_sampler.plan(t, x0, x1)
    # Sample slowed timestep mt
    mt = t + th.rand_like(t) * gamma * (1 - t)
    # Compute slowed interpolation
    _, xt, __ = self.path_sampler.plan(mt, x0, x1)
    model_output = model(xt, t, **model_kwargs)
    B, *_, C = xt.shape
    assert model_output.size() == (B, *xt.size()[1:-1], C)

    terms = {}
    terms['pred'] = model_output
    if self.model_type == ModelType.VELOCITY:
        terms['loss'] = mean_flat(((model_output - ut) ** 2))
    else: 
        # ...
            
    return terms

Results

Visualization

MixFlow generates high-fidelity images with fine-grained details and textures.

Visualization results generated by MixFlow

Class-conditional image generation

Post-training prior SOTA class-conditional models (SiT, REPA, RAE) with MixFlow improves ImageNet gFID to 1.43 (no guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 / 1.10 at 512 x 512.

Class-conditional image generation results

Various sampling steps

MixFlow yields larger gains when sampling steps are small, where Slow Flow is more pronounced and sampler approximation is coarser, reducing the training-testing discrepancy.

Text-to-image generation

Applying MixFlow to text-to-image diffusion models also boosts generation quality, showing the slowed-interpolation training generalizes beyond class-conditional settings.

Results across various sampling steps
Text-to-image generation results

Citation

@article{mixflow2025,
  title={MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture},
  author={Hui Li and Jiayue Lyu and Fu-yun Wang and Kaihui Cheng and Siyu Zhu and Jingdong Wang},
  journal={arXiv preprint arXiv: 2512.19311},
  year={2025}
}