ref: support causal mask mode in flash attention by Connor1996 · Pull Request #101 · skyzh/tiny-llm

Connor1996 · 2026-03-03T08:21:15Z

Summary

add flash attention/quantized matmul extension implementation for week 2 tasks
wire attention path updates for causal-mask capable flash attention
remove a leftover debug sentinel write (-233.0) from src/extensions_ref/src/flash_attention.metal

pdm run test --week 2 --day 4 -- -k 'task_2_flash_attention_cpu_small and no_mask' ✅
pdm run test --week 2 --day 4 -- -k task_3_flash_attention_gpu ❌
- runtime error: [metal::Device] Unable to load kernel flash_attention
- causal case also hits ValueError: Invalid type str received in array initialization. in src/tiny_llm/attention.py:149

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Connor1996 added 2 commits February 22, 2026 18:28

impl

567f2db

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

fix(ref): remove flash attention debug sentinel write

759c5c3

Signed-off-by: Connor1996 <zbk602423539@gmail.com>

Connor1996 closed this Mar 3, 2026