|
1 | | -# vortex_torch |
| 1 | +# Vortex |
| 2 | + |
| 3 | +Vortex is a lightweight, modular framework for building **custom sparse attention algorithms** for LLM inference. |
| 4 | +It exists to make it easy for researchers and engineers to **prototype**, **extend**, and **deploy** advanced sparsity patterns on modern inference backends such as SGLang—without modifying core model code. |
| 5 | + |
| 6 | +Vortex allows you to express novel sparse attention behaviors concisely while relying on an optimized execution engine. |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## ✨ Key Features |
| 11 | + |
| 12 | +- **Easy Programming** |
| 13 | + Program sparse attention with pytorch-like frontend and view all the tensors as `batch_size = 1`. No worrying about batching, caching & paged attention. |
| 14 | + |
| 15 | +- **High Performance** |
| 16 | + Built to work with FlashInfer & CUDA Graph & Radix Attention for efficient LLM inference. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## 🚀 Installation |
| 21 | + |
| 22 | +```bash |
| 23 | +git clone -b v1 --recursive https://github.com/Infini-AI-Lab/vortex_torch.git |
| 24 | + |
| 25 | +# Install SGLang dependency (support 0.4.9) |
| 26 | +cd third_party/sglang |
| 27 | +bash install.sh |
| 28 | +cd ../../ |
| 29 | + |
| 30 | +# Install Vortex |
| 31 | +cd vortex_torch |
| 32 | +pip install -e . |
| 33 | +``` |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## 🧩 Quick Example: Custom Sparse Attention |
| 38 | + |
| 39 | +```python |
| 40 | +@register("custom_sparse_attention") |
| 41 | +class CustomSparseAttention(vFlow): |
| 42 | + |
| 43 | + def __init__(self): |
| 44 | + super().__init__() |
| 45 | + # Indexer-side ops |
| 46 | + self.gemv = GeMV() |
| 47 | + self.output_func = topK() |
| 48 | + |
| 49 | + # Cache-side ops |
| 50 | + self.reduction = CMean(dim=1) |
| 51 | + |
| 52 | + def forward_indexer( |
| 53 | + self, |
| 54 | + q: torch.Tensor, # viewed as [1, H_q, D] |
| 55 | + o: torch.Tensor, |
| 56 | + cache: Dict[str, torch.Tensor], # viewed as [S, r, c] depending on create_cache() |
| 57 | + ctx: ContextBase, |
| 58 | + ): |
| 59 | + q_mean = q.mean(dim=1, keepdim=True) |
| 60 | + score = self.gemv(q_mean, cache["centroids"], ctx=ctx) |
| 61 | + self.output_func(score, o, ctx=ctx) |
| 62 | + |
| 63 | + def forward_cache( |
| 64 | + self, |
| 65 | + cache: Dict[str, torch.Tensor], # viewed as [B, r, c] depending on create_cache() |
| 66 | + loc: torch.Tensor, |
| 67 | + ctx: ContextBase, |
| 68 | + ): |
| 69 | + # triggered only when a page is finished |
| 70 | + self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx) |
| 71 | + |
| 72 | + def create_cache(self, page_size: int, head_dim: int): |
| 73 | + return { |
| 74 | + "centroids": (1, head_dim), |
| 75 | + } |
| 76 | +``` |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## 🏃 Using Your Sparse Attention with SGLang |
| 81 | + |
| 82 | +```python |
| 83 | +llm = sgl.Engine( |
| 84 | + model_path="Qwen/Qwen3-0.6B", |
| 85 | + disable_cuda_graph=False, |
| 86 | + page_size=16, |
| 87 | + vortex_topk_val=30, |
| 88 | + disable_overlap_schedule=True, # Mandatory |
| 89 | + attention_backend="flashinfer", # Mandatory |
| 90 | + enable_vortex_sparsity=True, # Otherwise full attention is used |
| 91 | + vortex_page_reserved_bos=1, |
| 92 | + vortex_page_reserved_eos=1, |
| 93 | + vortex_layers_skip=list(range(1)), # Full attention for layer 0 |
| 94 | + vortex_module_path="path/to/custom_sparse_attention.py", |
| 95 | + vortex_module_name="custom_sparse_attention", # the registered name for your algorithm |
| 96 | + vortex_max_seq_lens=8192, |
| 97 | + mem_fraction_static=0.85, |
| 98 | +) |
| 99 | +``` |
| 100 | + |
| 101 | +If `vortex_module_path` is not provided, Vortex will automatically search in |
| 102 | +`vortex_torch.flow.algorithms`. |
| 103 | + |
| 104 | +--- |
| 105 | + |
| 106 | +## 🤖 AI-Generated Sparse Attention |
| 107 | +Vortex is designed not only for hand-crafted sparsity patterns but also for AI-generated sparse attention. |
| 108 | + |
| 109 | +Our demo shows how to use SOTA agents OpenHands (https://openhands.dev/) to generate sparse attention algorithms. |
| 110 | + |
| 111 | +```bash |
| 112 | +export LLM_API_KEY=YOUR_API_KEY |
| 113 | +python openhands_gen.py |
| 114 | + |
| 115 | +``` |
| 116 | + |
| 117 | +The usage and installation guide of OpenHands can be found in https://docs.openhands.dev/sdk. |
| 118 | +Note: Some operators are not yet fused or fully optimized, which may lead to increased memory usage. Tune down the `mem_fraction_static` if CUDA OOM. This can also impact generation speed during inference. |
| 119 | + |
| 120 | +## 📘 API Reference |
| 121 | + |
| 122 | +👉 https://infini-ai-lab.github.io/vortex_torch/ |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## Citation |
| 127 | + |
| 128 | +If you find Vortex useful or relevant to your project and research, please kindly cite our paper: |
| 129 | + |
| 130 | +```bibtex |
| 131 | +@software{chen2025vortex, |
| 132 | + title = {Vortex: A Flexible and Efficient Sparse Attention Framework}, |
| 133 | + author = {Chen, Zhuoming and Yang, Zhou and Chen, Beidi}, |
| 134 | + year = {2025}, |
| 135 | + publisher = {Infini AI Lab}, |
| 136 | + url = {https://github.com/Infini-AI-Lab/vortex_torch}, |
| 137 | + version = {v0.2} |
| 138 | +} |
| 139 | +
|
| 140 | +``` |
0 commit comments