brown rice porridge

i woke up in the middle of night sweating from a exam nightmare.

i talked to B about potential handover work for ucsf. 0

i also talked to B about brats challenge. i always learn so much when i talk to him. i should talk to him more.

i learned about the ZeRO in DeepSpeed. the idea is you reduce memory and copmute requirements of each GPU device by partitioning model training states (weights, gradients, optimizer) across avaiable devices (GPUs and CPUs)

and also about flow matching, which is a different training approach for diffusion models for better quality and efficiency.

started watching fast.ai lectures on diffusion to better understand from scratch what is going on.

core components

VAE (encoder) : compresses images into latents
Diffusion model (U-net): predicts and removes noise in latent space
VAE (decoder): converts denoised latents back to pixel images
CLIP: provides text guidance that influence the denoising process

how to train diffusion models?

use pre-trained VAE (typically) and encode training data to obtain latents
add noise to latents
- for each training step, sample random timestep t
- add corresponding noise to latent representation
- amount of noise is determined by noise schedule
train unet on noisy latents:
- input: noisy latent vector + timestep embedding (+ CLIP)
- output: exact noise added to clean latent
- loss function: MSE (or perceptual loss)
integrate clip for conditional generation
- process text prompts through CLIP encoder to get embeddings
- feed these embeddings to U-net via cross-attention layers
- train model to predict noise while conditioned on these embeddings

generate?

encode text with clip
start with random noise in latent space
gradually denoise using U-net guided by clip
decode final latent representation to pixel

advance variants

latent diffusion: operate in VAE latent space (standard)
wavelet diffusion: wavelete decomposition instead of VAE (better for 3d)
cascade models: chain multiple diffusion models at increasing resolution