Skip to content

Milestones

List view

  • - Introduce a Triton-only Transformer Execution Path in vLLM - Apart from NVIDIA GPU, end-to-end support on other chips, such as Cambricon, AMD, Ascend chips. If cannot run, need to figure out the gap to achieve end-to-end support. - Improve Triton kernel performance to meet the CUDA kernel performance. - Fix RoPE Triton kernel to support llama model.

    No due date
    18/18 issues closed
  • When milestone #1 is done, we are able to: - Successfully run OPT-125M or Llama2-7B with vLLM on L4. - Successfully run the example/offline_inference.py with results that makes sense. - Ensure the code path is CUDA free. Out of scope: - Performance - Quantization - MoE - swap_blocks - copy_blocks - Successfully run Llama2-7B with vLLM on AMD MI2xx or AMD MI3xx. All the tasks include - paged_attention_v2_kernel(FlagAttention) - paged_attention_v2_reduce_kernel(FlagAttention) - fused Matrix multiplication operations(FlagGems) - activation_kernels(FlagGems) - fused rms_norm_kernel(FlagGems) - rotary_embedding_kernel(FlagGems) - reshape_and_cache_kernel - Page/flash attention decode => decode_with_kv_cache

    Overdue by 1 year(s)
    Due by November 27, 2024
    6/6 issues closed