readme.md

dreaming-panda · dreaming-panda · commit 9f19447b9816 · 2025-11-26T02:11:04.000-05:00
diff --git a/README.md b/README.md
@@ -1 +1,140 @@
-# vortex_torch
+# Vortex
+
+Vortex is a lightweight, modular framework for building **custom sparse attention algorithms** for LLM inference.  
+It exists to make it easy for researchers and engineers to **prototype**, **extend**, and **deploy** advanced sparsity patterns on modern inference backends such as SGLang—without modifying core model code.
+
+Vortex allows you to express novel sparse attention behaviors concisely while relying on an optimized execution engine.
+
+---
+
+## ✨ Key Features
+
+- **Easy Programming**  
+  Program sparse attention with pytorch-like frontend and view all the tensors as `batch_size = 1`. No worrying about batching, caching & paged attention.
+
+- **High Performance**  
+  Built to work with FlashInfer & CUDA Graph & Radix Attention for efficient LLM inference.
+
+---
+
+## 🚀 Installation
+
+```bash
+git clone -b v1 --recursive https://github.com/Infini-AI-Lab/vortex_torch.git
+
+# Install SGLang dependency (support 0.4.9)
+cd third_party/sglang
+bash install.sh
+cd ../../
+
+# Install Vortex
+cd vortex_torch
+pip install -e .
+```
+
+---
+
+## 🧩 Quick Example: Custom Sparse Attention
+
+```python
+@register("custom_sparse_attention")
+class CustomSparseAttention(vFlow):
+
+    def __init__(self):
+        super().__init__()
+        # Indexer-side ops
+        self.gemv = GeMV()
+        self.output_func = topK()
+
+        # Cache-side ops
+        self.reduction = CMean(dim=1)
+
+    def forward_indexer(
+        self,
+        q: torch.Tensor,                 # viewed as [1, H_q, D]
+        o: torch.Tensor,
+        cache: Dict[str, torch.Tensor],  # viewed as [S, r, c] depending on create_cache()
+        ctx: ContextBase,
+    ):
+        q_mean = q.mean(dim=1, keepdim=True)
+        score = self.gemv(q_mean, cache["centroids"], ctx=ctx)
+        self.output_func(score, o, ctx=ctx)
+
+    def forward_cache(
+        self,
+        cache: Dict[str, torch.Tensor],  # viewed as [B, r, c] depending on create_cache()
+        loc: torch.Tensor,
+        ctx: ContextBase,
+    ):
+        # triggered only when a page is finished
+        self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx)
+
+    def create_cache(self, page_size: int, head_dim: int):
+        return {
+            "centroids": (1, head_dim),
+        }
+```
+
+---
+
+## 🏃 Using Your Sparse Attention with SGLang
+
+```python
+llm = sgl.Engine(
+    model_path="Qwen/Qwen3-0.6B",
+    disable_cuda_graph=False,
+    page_size=16,
+    vortex_topk_val=30,
+    disable_overlap_schedule=True,    # Mandatory
+    attention_backend="flashinfer",   # Mandatory
+    enable_vortex_sparsity=True,      # Otherwise full attention is used
+    vortex_page_reserved_bos=1,
+    vortex_page_reserved_eos=1,
+    vortex_layers_skip=list(range(1)),  # Full attention for layer 0
+    vortex_module_path="path/to/custom_sparse_attention.py",
+    vortex_module_name="custom_sparse_attention", # the registered name for your algorithm
+    vortex_max_seq_lens=8192,
+    mem_fraction_static=0.85,
+)
+```
+
+If `vortex_module_path` is not provided, Vortex will automatically search in  
+`vortex_torch.flow.algorithms`.
+
+---
+
+## 🤖 AI-Generated Sparse Attention
+Vortex is designed not only for hand-crafted sparsity patterns but also for AI-generated sparse attention.
+
+Our demo shows how to use SOTA agents OpenHands (https://openhands.dev/) to generate sparse attention algorithms.
+
+```bash
+export LLM_API_KEY=YOUR_API_KEY
+python openhands_gen.py
+
+```
+
+The usage and installation guide of OpenHands can be found in https://docs.openhands.dev/sdk. 
+Note: Some operators are not yet fused or fully optimized, which may lead to increased memory usage. Tune down the `mem_fraction_static` if CUDA OOM. This can also impact generation speed during inference. 
+
+## 📘 API Reference
+
+👉 https://infini-ai-lab.github.io/vortex_torch/
+
+---
+
+## Citation
+
+If you find Vortex useful or relevant to your project and research, please kindly cite our paper:
+
+```bibtex
+@software{chen2025vortex,
+  title        = {Vortex: A Flexible and Efficient Sparse Attention Framework},
+  author       = {Chen, Zhuoming and Yang, Zhou and Chen, Beidi},
+  year         = {2025},
+  publisher    = {Infini AI Lab},
+  url          = {https://github.com/Infini-AI-Lab/vortex_torch},
+  version      = {v0.2}
+}
+
+```
diff --git a/docs/index.rst b/docs/index.rst
@@ -1,4 +1,4 @@
-Vortex Torch
+Vortex
 ============
 
 A concise description of what your package does and why it exists.
@@ -8,14 +8,55 @@ Installation
 
 .. code-block:: bash
 
-   git clone https://github.com/Infini-AI-Lab/vortex_torch.git
+   git clone -b v1 --recursive https://github.com/Infini-AI-Lab/vortex_torch.git
+   cd third_party/sglang
+   bash install.sh
+   cd ../../
    cd vortex_torch
    pip install -e .
 
 Quick Example
 -------------
 .. code-block:: python
 
+   @register("custom_sparse_attention")
+   class CustomSparseAttention(vFlow):
+    
+    def __init__(self):
+        super().__init__()
+        # Indexer-side ops
+        self.gemv = GeMV()
+        self.output_func = topK()
+
+        # Cache-side ops
+        self.reduction = CMean(dim=1)
+
+    def forward_indexer(
+        self,
+        q: torch.Tensor, # viewed as [1, H_q, D]
+        o: torch.Tensor,
+        cache: Dict[str, torch.Tensor], # viewed as [S, r, c] depending on create_cache()
+        ctx: ContextBase,
+    ):  
+        q_mean = q.mean(dim=1, keepdim=True)
+        score = self.gemv(q_mean, cache["centroids"], ctx=ctx)
+        self.output_func(score, o, ctx=ctx)
+
+    def forward_cache(
+        self,
+        cache: Dict[str, torch.Tensor], # viewed as [B, r, c] depending on create_cache()
+        loc: torch.Tensor,
+        ctx: ContextBase,
+    ):  
+        # computation is triggered only when a page is finished
+        self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx)
+
+    def create_cache(self, page_size: int, head_dim: int):
+        
+        return {
+            "centroids": (1, head_dim),
+        }
+
    
 
 .. code-block:: python
@@ -24,13 +65,13 @@ Quick Example
                     disable_cuda_graph=False,
                     page_size=16,
                     vortex_topk_val=30,   
-                    disable_overlap_schedule=True,
-                    attention_backend="flashinfer",
-                    enable_vortex_sparsity=True,
+                    disable_overlap_schedule=True,  # Mandatory
+                    attention_backend="flashinfer", # Mandatory
+                    enable_vortex_sparsity=True, # otherwise will compute full attention
                     vortex_page_reserved_bos=1,
                     vortex_page_reserved_eos=1,
-                    vortex_layers_skip=list(range(1)),
-                    vortex_module_path="path/to/custom_sparse_attention.py"
+                    vortex_layers_skip=list(range(1)), # full attention for layer 0
+                    vortex_module_path="path/to/custom_sparse_attention.py" #if not specify, vortex will try to search in vortex_torch.flow.algorithms
                     vortex_module_name="custom_sparse_attention",
                     vortex_max_seq_lens=8192,
                     mem_fraction_static=0.6
diff --git a/openhands_gen.py b/openhands_gen.py
@@ -0,0 +1,28 @@
+import os
+
+from openhands.sdk import LLM, Agent, Conversation, Tool
+from openhands.tools.file_editor import FileEditorTool
+from openhands.tools.task_tracker import TaskTrackerTool
+from openhands.tools.terminal import TerminalTool
+
+
+llm = LLM(
+    model="openhands/gpt-5-2025-08-07",
+    api_key=os.getenv("LLM_API_KEY"),
+)
+
+agent = Agent(
+    llm=llm,
+    tools=[
+        Tool(name=TerminalTool.name),
+        Tool(name=FileEditorTool.name),
+        Tool(name=TaskTrackerTool.name),
+    ],
+)
+
+cwd = os.getcwd()
+conversation = Conversation(agent=agent, workspace=cwd)
+
+conversation.send_message("Hi, i wrote a framework called vortex that can enable an abstraction on implementation of sparse attention in sglang, which was very hard to do earlier. Could you based on the example file in this directory: vortex_torch/flow/algorithms.py, propose a better algorithm of dynamic sparse attention? Then, add your algorithm registered name in vortex_torch/examples/verify_algo.sh as the first one.")
+conversation.run()
+print("All done!")