BENEDICT NEO梁耀恩

sdpa

January 12, 2026


source: unknown

source: unknown

baking models today, learned the choices for attn_implementation:

  1. Eager: computes the full softmax(QKTd)\text{softmax}(\frac{QK^T}{\sqrt{d}}) matrix and store it in GPU memory, causing O(N2)O(N^2) memory
  2. SDPA - scaled dot product attention. it picks the best kernel for your hardware, if flash attention available, picks that.
  3. flash attention – breaks the Q,K,V matrices into tiles that fit in SRAM (on-chip memory), computes attention for each tile, and accumulate result without ever materializing the full NxN attention matrix. O(N) memory

mood: 3/5

wake: 10:00
sleep: 0200

meals:

  • breakfast: tofu, spinach, smoked salmon, pumpkin seeds, yoghurt
  • lunch: hu tieu kho, shrimp, pork belly & crispy shrimp dumpling soup
  • dinner: leftovers

til:

  • sdpa attention
  • continuous batching, prefill-decode disaggregation, kv cache, prefix aware and expert layer routing for llm inference
  • when model training, imagine you’re trying to steer a car to stay in the center of a lane.
  • Gradient = the steering instruction (“turn left/right”)
  • Grad norm = how hard you’re yanking the wheel overall
  • Hủ tiếu = noodle, khô = dry

links:

grateful for: noodles