TurboQuant Explained: How Google’s “Random Rotation” Trick Shrinks AI Memory by 6x

TurboQuant feature image showing rotated vectors compressed into KV cache memory blocks

KV Cache Compression: Recall vs. Memory Needle-in-Haystack benchmark · Llama-3.1-8B-Instruct · context up to 104k tokens Best recall 0.997 TurboQuant = full precision Memory at 3.5-bit 6x smaller KV cache GPU speedup 8x on H100 at 4-bit Needle-in-Haystack recall score KV cache size (bits) Tested on Llama-3.1-8B-Instruct · Needle-in-Haystack benchmark · context up to 104k … Read more

Residual Connections Rethought: How Kimi’s ‘Attention Residuals’ Fixed a 10-Year-Old Transformer Flaw

Residual connections feature image showing attention-based depth routing in transformer layers

Standard residuals Fixed, uniform weights Embedding h₁ Layer 1 Layer 2 Layer 3 Each layer only sees the accumulated sum Attention Residuals Learned, input-dependent Embedding h₁ Layer 1 Layer 2 Layer 3 Layer 3 selectively attends to any earlier layer Residual connections are one of those rare ideas in deep learning that became so successful, … Read more