TurboQuant Explained: How Google’s “Random Rotation” Trick Shrinks AI Memory by 6x
KV Cache Compression: Recall vs. Memory Needle-in-Haystack benchmark · Llama-3.1-8B-Instruct · context up to 104k tokens Best recall 0.997 TurboQuant = full precision Memory at 3.5-bit 6x smaller KV cache GPU speedup 8x on H100 at 4-bit Needle-in-Haystack recall score KV cache size (bits) Tested on Llama-3.1-8B-Instruct · Needle-in-Haystack benchmark · context up to 104k … Read more