
source: unknown
baking models today, learned the choices for attn_implementation:
- Eager: computes the full matrix and store it in GPU memory, causing memory
- SDPA - scaled dot product attention. it picks the best kernel for your hardware, if flash attention available, picks that.
- flash attention – breaks the Q,K,V matrices into tiles that fit in SRAM (on-chip memory), computes attention for each tile, and accumulate result without ever materializing the full NxN attention matrix. O(N) memory
mood: 3/5
wake: 10:00
sleep: 0200
meals:
- breakfast: tofu, spinach, smoked salmon, pumpkin seeds, yoghurt
- lunch: hu tieu kho, shrimp, pork belly & crispy shrimp dumpling soup
- dinner: leftovers
til:
- sdpa attention
- continuous batching, prefill-decode disaggregation, kv cache, prefix aware and expert layer routing for llm inference
- when model training, imagine you’re trying to steer a car to stay in the center of a lane.
- Gradient = the steering instruction (“turn left/right”)
- Grad norm = how hard you’re yanking the wheel overall
- Hủ tiếu = noodle, khô = dry
links:
- timecapsule LLM
- Making Software
- If you have multiple interests, do not waste the next 2-3 years
- build an idea museum
- write 1 idea 1000 different ways
grateful for: noodles