Residual Connections Rethought: How Kimi’s ‘Attention Residuals’ Fixed a 10-Year-Old Transformer Flaw
Standard residuals Fixed, uniform weights Embedding h₁ Layer 1 Layer 2 Layer 3 Each layer only sees the accumulated sum Attention Residuals Learned, input-dependent Embedding h₁ Layer 1 Layer 2 Layer 3 Layer 3 selectively attends to any earlier layer Residual connections are one of those rare ideas in deep learning that became so successful, … Read more