brown rice porridge

April 23, 2025


i woke up in the middle of night sweating from a exam nightmare.

i talked to B about potential handover work for ucsf. 0

i also talked to B about brats challenge. i always learn so much when i talk to him. i should talk to him more.

i learned about the ZeRO in DeepSpeed. the idea is you reduce memory and copmute requirements of each GPU device by partitioning model training states (weights, gradients, optimizer) across avaiable devices (GPUs and CPUs)

and also about flow matching, which is a different training approach for diffusion models for better quality and efficiency.

started watching fast.ai lectures on diffusion to better understand from scratch what is going on.

core components

  • VAE (encoder) : compresses images into latents
  • Diffusion model (U-net): predicts and removes noise in latent space
  • VAE (decoder): converts denoised latents back to pixel images
  • CLIP: provides text guidance that influence the denoising process

how to train diffusion models?

  1. use pre-trained VAE (typically) and encode training data to obtain latents
  2. add noise to latents
    • for each training step, sample random timestep t
    • add corresponding noise to latent representation
    • amount of noise is determined by noise schedule
  3. train unet on noisy latents:
    • input: noisy latent vector + timestep embedding (+ CLIP)
    • output: exact noise added to clean latent
    • loss function: MSE (or perceptual loss)
  4. integrate clip for conditional generation
    • process text prompts through CLIP encoder to get embeddings
    • feed these embeddings to U-net via cross-attention layers
    • train model to predict noise while conditioned on these embeddings

generate?

  1. encode text with clip
  2. start with random noise in latent space
  3. gradually denoise using U-net guided by clip
  4. decode final latent representation to pixel

advance variants

  1. latent diffusion: operate in VAE latent space (standard)
  2. wavelet diffusion: wavelete decomposition instead of VAE (better for 3d)
  3. cascade models: chain multiple diffusion models at increasing resolution