Skip to content

Previous Latest News

Kawrakow edited this page Jun 9, 2025 · 1 revision

Latest News

  • May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See PR 405
  • May 12 2025: Compatibility issues with mainline llama.cpp GGUFs for DeepSeek models with MLA enabled were resolved in PR 394. The lower prompt processing performance resulting from using llama.cpp-style MLA GGUFs was recovered in PR 409.
  • May 11 2025: πŸš€ Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See PR 408
  • May 9 2025: Support for LlaMA-3-Nemotron models added, see PR 377
  • May 7 2025: πŸš€ Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See PR 386 for details. Caveat: Ampere or newer Nvidia GPU required
  • May 4 2025: πŸš€ Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see PR #370
  • April 29 2025: Qwen3 support added, see PR 355
  • April 26 2025: GLM-4 support added, see PR 344
  • April 26 2025: Command-A support added, see PR 341
  • April 22 2025: Support for the latest Microsoft Bitnet model added, see PR 337
  • April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see PR 336
  • April 17 2025: πŸš€ Better CPU Flash Attention token generation performance, see PR 332
  • April 13 2025: IQ1_M quantization improvements, see PR 327
  • April 10 2025: LLaMA-4 support added, see PR 321. In the PR there are also some custom quantization recipes for L4-Scout provided.
  • April 7 2025: IQ2_XS quantization improvements, see PR 312
  • April 3 2025: πŸš€ Much faster MoE implementation on Metal, see PR 307
  • April 1 2025: Quantization improvements for Q2_K, Q4_K, Q5_K, Q4_1, Q5_1, see PR 302
  • March 28 2025: Quantization imrovements for Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL, see PR 295
  • March 25 2025: πŸš€ Better MoE performance on CUDA
  • March 23 2025: πŸš€ Better batched processing speed for DeepSeek models
  • March 22 2025: Gemma3 support added
  • March 21 2025: πŸš€ FlashMLA-3: fastest CPU-only inference for DeepSeek models
  • March 18 2025: Reduce compute buffer size
  • March 17 2025: πŸš€ FlashMLA-2 performance improvements
  • March 12 2025: Allow Q8_0 KV cache with FlashMLA-2 on CUDA
  • March 10 2025: πŸš€ Better TG performance for MoE models on CUDA
  • March 9 2025: πŸš€ FlashMLA on CUDA
  • March 8 2025: πŸš€ Faster FlashMLA CPU implementation
  • March 7 2025: Custom quantization mixes using regular expressions
  • March 5 2025: πŸš€ FlashMLA on CUDA
  • March 3 2025: πŸš€ Introducing FlashMLA - MLA with Flash Attention
  • March 1 2025: Smart Expert Reduction for faster DeepSeek inference
  • Feb 27 2025: MLA without transposed cache
  • Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
  • Feb 23 2025: πŸš€ Fused FFN ops for faster MoE inference
  • Feb 23 2025: sweep-bench - better performance benchmarking
  • Feb 20 2025: πŸš€ Fast GEMM/GEMV for IQ1_S
  • Feb 19 2025: Q8_KV - new type for 8-bit KV-cache quantization
  • Feb 13 2025: Allow Q8_0 quantized cache with MLA
  • Feb 11 2025: πŸš€ Flash Attention support for DeepSeek models
  • Feb 9 2025: πŸš€ MLA for DeepSeek models
  • Jan 23 2025: DeepSeek-V3 support added
Clone this wiki locally