NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache Paper • 2505.18231 • Published May 23 • 3
Gaussian Weight Sampling for Scalable, Efficient and Stable Pseudo-Quantization Training Paper • 2505.11170 • Published May 16 • 3
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention Paper • 2407.02490 • Published Jul 2, 2024 • 27
Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs Paper • 2503.00979 • Published Mar 2 • 1
TokenSkip: Controllable Chain-of-Thought Compression in LLMs Paper • 2502.12067 • Published Feb 17 • 2
Think Before Recommend: Unleashing the Latent Reasoning Power for Sequential Recommendation Paper • 2503.22675 • Published Mar 28 • 36