sdpa

baking models today, learned the choices for attn_implementation:

Eager: computes the full $\text{softmax}(\frac{QK^T}{\sqrt{d}})$ matrix and store it in GPU memory, causing $O(N^2)$ memory
SDPA - scaled dot product attention. it picks the best kernel for your hardware, if flash attention available, picks that.
flash attention – breaks the Q,K,V matrices into tiles that fit in SRAM (on-chip memory), computes attention for each tile, and accumulate result without ever materializing the full NxN attention matrix. O(N) memory

mood: 3/5

wake: 10:00
sleep: 0200

meals:

til:

sdpa attention
continuous batching, prefill-decode disaggregation, kv cache, prefix aware and expert layer routing for llm inference
when model training, imagine you’re trying to steer a car to stay in the center of a lane.
Gradient = the steering instruction (“turn left/right”)
Grad norm = how hard you’re yanking the wheel overall
Hủ tiếu = noodle, khô = dry

links:

grateful for: noodles