Skip to content

Commit 9f19447

Browse files
readme.md
1 parent 33b12fb commit 9f19447

File tree

3 files changed

+216
-8
lines changed

3 files changed

+216
-8
lines changed

README.md

Lines changed: 140 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,140 @@
1-
# vortex_torch
1+
# Vortex
2+
3+
Vortex is a lightweight, modular framework for building **custom sparse attention algorithms** for LLM inference.
4+
It exists to make it easy for researchers and engineers to **prototype**, **extend**, and **deploy** advanced sparsity patterns on modern inference backends such as SGLang—without modifying core model code.
5+
6+
Vortex allows you to express novel sparse attention behaviors concisely while relying on an optimized execution engine.
7+
8+
---
9+
10+
## ✨ Key Features
11+
12+
- **Easy Programming**
13+
Program sparse attention with pytorch-like frontend and view all the tensors as `batch_size = 1`. No worrying about batching, caching & paged attention.
14+
15+
- **High Performance**
16+
Built to work with FlashInfer & CUDA Graph & Radix Attention for efficient LLM inference.
17+
18+
---
19+
20+
## 🚀 Installation
21+
22+
```bash
23+
git clone -b v1 --recursive https://github.com/Infini-AI-Lab/vortex_torch.git
24+
25+
# Install SGLang dependency (support 0.4.9)
26+
cd third_party/sglang
27+
bash install.sh
28+
cd ../../
29+
30+
# Install Vortex
31+
cd vortex_torch
32+
pip install -e .
33+
```
34+
35+
---
36+
37+
## 🧩 Quick Example: Custom Sparse Attention
38+
39+
```python
40+
@register("custom_sparse_attention")
41+
class CustomSparseAttention(vFlow):
42+
43+
def __init__(self):
44+
super().__init__()
45+
# Indexer-side ops
46+
self.gemv = GeMV()
47+
self.output_func = topK()
48+
49+
# Cache-side ops
50+
self.reduction = CMean(dim=1)
51+
52+
def forward_indexer(
53+
self,
54+
q: torch.Tensor, # viewed as [1, H_q, D]
55+
o: torch.Tensor,
56+
cache: Dict[str, torch.Tensor], # viewed as [S, r, c] depending on create_cache()
57+
ctx: ContextBase,
58+
):
59+
q_mean = q.mean(dim=1, keepdim=True)
60+
score = self.gemv(q_mean, cache["centroids"], ctx=ctx)
61+
self.output_func(score, o, ctx=ctx)
62+
63+
def forward_cache(
64+
self,
65+
cache: Dict[str, torch.Tensor], # viewed as [B, r, c] depending on create_cache()
66+
loc: torch.Tensor,
67+
ctx: ContextBase,
68+
):
69+
# triggered only when a page is finished
70+
self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx)
71+
72+
def create_cache(self, page_size: int, head_dim: int):
73+
return {
74+
"centroids": (1, head_dim),
75+
}
76+
```
77+
78+
---
79+
80+
## 🏃 Using Your Sparse Attention with SGLang
81+
82+
```python
83+
llm = sgl.Engine(
84+
model_path="Qwen/Qwen3-0.6B",
85+
disable_cuda_graph=False,
86+
page_size=16,
87+
vortex_topk_val=30,
88+
disable_overlap_schedule=True, # Mandatory
89+
attention_backend="flashinfer", # Mandatory
90+
enable_vortex_sparsity=True, # Otherwise full attention is used
91+
vortex_page_reserved_bos=1,
92+
vortex_page_reserved_eos=1,
93+
vortex_layers_skip=list(range(1)), # Full attention for layer 0
94+
vortex_module_path="path/to/custom_sparse_attention.py",
95+
vortex_module_name="custom_sparse_attention", # the registered name for your algorithm
96+
vortex_max_seq_lens=8192,
97+
mem_fraction_static=0.85,
98+
)
99+
```
100+
101+
If `vortex_module_path` is not provided, Vortex will automatically search in
102+
`vortex_torch.flow.algorithms`.
103+
104+
---
105+
106+
## 🤖 AI-Generated Sparse Attention
107+
Vortex is designed not only for hand-crafted sparsity patterns but also for AI-generated sparse attention.
108+
109+
Our demo shows how to use SOTA agents OpenHands (https://openhands.dev/) to generate sparse attention algorithms.
110+
111+
```bash
112+
export LLM_API_KEY=YOUR_API_KEY
113+
python openhands_gen.py
114+
115+
```
116+
117+
The usage and installation guide of OpenHands can be found in https://docs.openhands.dev/sdk.
118+
Note: Some operators are not yet fused or fully optimized, which may lead to increased memory usage. Tune down the `mem_fraction_static` if CUDA OOM. This can also impact generation speed during inference.
119+
120+
## 📘 API Reference
121+
122+
👉 https://infini-ai-lab.github.io/vortex_torch/
123+
124+
---
125+
126+
## Citation
127+
128+
If you find Vortex useful or relevant to your project and research, please kindly cite our paper:
129+
130+
```bibtex
131+
@software{chen2025vortex,
132+
title = {Vortex: A Flexible and Efficient Sparse Attention Framework},
133+
author = {Chen, Zhuoming and Yang, Zhou and Chen, Beidi},
134+
year = {2025},
135+
publisher = {Infini AI Lab},
136+
url = {https://github.com/Infini-AI-Lab/vortex_torch},
137+
version = {v0.2}
138+
}
139+
140+
```

docs/index.rst

Lines changed: 48 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Vortex Torch
1+
Vortex
22
============
33

44
A concise description of what your package does and why it exists.
@@ -8,14 +8,55 @@ Installation
88

99
.. code-block:: bash
1010
11-
git clone https://github.com/Infini-AI-Lab/vortex_torch.git
11+
git clone -b v1 --recursive https://github.com/Infini-AI-Lab/vortex_torch.git
12+
cd third_party/sglang
13+
bash install.sh
14+
cd ../../
1215
cd vortex_torch
1316
pip install -e .
1417
1518
Quick Example
1619
-------------
1720
.. code-block:: python
1821
22+
@register("custom_sparse_attention")
23+
class CustomSparseAttention(vFlow):
24+
25+
def __init__(self):
26+
super().__init__()
27+
# Indexer-side ops
28+
self.gemv = GeMV()
29+
self.output_func = topK()
30+
31+
# Cache-side ops
32+
self.reduction = CMean(dim=1)
33+
34+
def forward_indexer(
35+
self,
36+
q: torch.Tensor, # viewed as [1, H_q, D]
37+
o: torch.Tensor,
38+
cache: Dict[str, torch.Tensor], # viewed as [S, r, c] depending on create_cache()
39+
ctx: ContextBase,
40+
):
41+
q_mean = q.mean(dim=1, keepdim=True)
42+
score = self.gemv(q_mean, cache["centroids"], ctx=ctx)
43+
self.output_func(score, o, ctx=ctx)
44+
45+
def forward_cache(
46+
self,
47+
cache: Dict[str, torch.Tensor], # viewed as [B, r, c] depending on create_cache()
48+
loc: torch.Tensor,
49+
ctx: ContextBase,
50+
):
51+
# computation is triggered only when a page is finished
52+
self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx)
53+
54+
def create_cache(self, page_size: int, head_dim: int):
55+
56+
return {
57+
"centroids": (1, head_dim),
58+
}
59+
1960
2061
2162
.. code-block:: python
@@ -24,13 +65,13 @@ Quick Example
2465
disable_cuda_graph=False,
2566
page_size=16,
2667
vortex_topk_val=30,
27-
disable_overlap_schedule=True,
28-
attention_backend="flashinfer",
29-
enable_vortex_sparsity=True,
68+
disable_overlap_schedule=True, # Mandatory
69+
attention_backend="flashinfer", # Mandatory
70+
enable_vortex_sparsity=True, # otherwise will compute full attention
3071
vortex_page_reserved_bos=1,
3172
vortex_page_reserved_eos=1,
32-
vortex_layers_skip=list(range(1)),
33-
vortex_module_path="path/to/custom_sparse_attention.py"
73+
vortex_layers_skip=list(range(1)), # full attention for layer 0
74+
vortex_module_path="path/to/custom_sparse_attention.py" #if not specify, vortex will try to search in vortex_torch.flow.algorithms
3475
vortex_module_name="custom_sparse_attention",
3576
vortex_max_seq_lens=8192,
3677
mem_fraction_static=0.6

openhands_gen.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
import os
2+
3+
from openhands.sdk import LLM, Agent, Conversation, Tool
4+
from openhands.tools.file_editor import FileEditorTool
5+
from openhands.tools.task_tracker import TaskTrackerTool
6+
from openhands.tools.terminal import TerminalTool
7+
8+
9+
llm = LLM(
10+
model="openhands/gpt-5-2025-08-07",
11+
api_key=os.getenv("LLM_API_KEY"),
12+
)
13+
14+
agent = Agent(
15+
llm=llm,
16+
tools=[
17+
Tool(name=TerminalTool.name),
18+
Tool(name=FileEditorTool.name),
19+
Tool(name=TaskTrackerTool.name),
20+
],
21+
)
22+
23+
cwd = os.getcwd()
24+
conversation = Conversation(agent=agent, workspace=cwd)
25+
26+
conversation.send_message("Hi, i wrote a framework called vortex that can enable an abstraction on implementation of sparse attention in sglang, which was very hard to do earlier. Could you based on the example file in this directory: vortex_torch/flow/algorithms.py, propose a better algorithm of dynamic sparse attention? Then, add your algorithm registered name in vortex_torch/examples/verify_algo.sh as the first one.")
27+
conversation.run()
28+
print("All done!")

0 commit comments

Comments
 (0)