List view
- Introduce a Triton-only Transformer Execution Path in vLLM - Apart from NVIDIA GPU, end-to-end support on other chips, such as Cambricon, AMD, Ascend chips. If cannot run, need to figure out the gap to achieve end-to-end support. - Improve Triton kernel performance to meet the CUDA kernel performance. - Fix RoPE Triton kernel to support llama model.
No due date•18/18 issues closedWhen milestone #1 is done, we are able to: - Successfully run OPT-125M or Llama2-7B with vLLM on L4. - Successfully run the example/offline_inference.py with results that makes sense. - Ensure the code path is CUDA free. Out of scope: - Performance - Quantization - MoE - swap_blocks - copy_blocks - Successfully run Llama2-7B with vLLM on AMD MI2xx or AMD MI3xx. All the tasks include - paged_attention_v2_kernel(FlagAttention) - paged_attention_v2_reduce_kernel(FlagAttention) - fused Matrix multiplication operations(FlagGems) - activation_kernels(FlagGems) - fused rms_norm_kernel(FlagGems) - rotary_embedding_kernel(FlagGems) - reshape_and_cache_kernel - Page/flash attention decode => decode_with_kv_cache
Overdue by 1 year(s)•Due by November 27, 2024•6/6 issues closed