Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
88efd6b
Reorganize and refactor Suffix Decoding (#182)
sfc-gh-aqiao Sep 16, 2025
e8d252f
Add environment variable to skip version check (#186)
sfc-gh-aqiao Sep 18, 2025
e27ae09
Enable SwiftKV when FlashInfer is not available (#187)
sfc-gh-pjoziak Sep 18, 2025
9ecce3a
Fix hybrid mode(spec decoding + suffix) crash on structured_output (#…
sfc-gh-yewang Sep 23, 2025
8b1d693
Make Arctic Inference plugin opt-in instead of opt-out (#188)
sfc-gh-aqiao Sep 23, 2025
e5aa688
Simplify min_score selection logic, correct type hint for `propose_su…
CptTZ Sep 25, 2025
02a8a51
Add op_builder for jitting the kernels (#193)
sfc-gh-reyazda Sep 25, 2025
87e2f77
Update links in README for Shift Parallelism (#196)
sfc-gh-mhidayetoglu Sep 26, 2025
8452500
bump to v0.0.10 (#194)
sfc-gh-jrasley Sep 26, 2025
170082f
init
sfc-gh-yewang Sep 29, 2025
5a410f1
Revert "init"
sfc-gh-yewang Sep 29, 2025
cbeb679
Move SwiftKV ops to JIT-build (#198)
sfc-gh-yewang Sep 29, 2025
c4cb213
Add @sfc-gh-reyazda as code owner (#199)
sfc-gh-yewang Sep 29, 2025
1408c80
Explicitly initialize CUDA buffers for next tokens (#201)
sfc-gh-yewang Oct 6, 2025
353e102
Port suffix decoding to nanobind (#206)
sfc-gh-aqiao Oct 13, 2025
e1d1ff2
upgrade to vllm 0.10.1 (#162)
sfc-gh-yewang Oct 13, 2025
6adb69f
Suffix decoding: break out of speculate loop early (#207)
sfc-gh-aqiao Oct 13, 2025
1bc893f
Suffix decoding speculation optimization (#211)
sfc-gh-aqiao Oct 17, 2025
2a14b27
reshape_and_cache_flash fp4 kernel (#210)
sfc-gh-yewang Oct 17, 2025
8d5d124
More suffix decoding optimizations (#212)
sfc-gh-aqiao Oct 20, 2025
3988caf
remove ulysses moe patch (#213)
sfc-gh-mhidayetoglu Oct 21, 2025
3145f23
Bump version from 0.0.10 to 0.1.0 (#214)
sfc-gh-jrasley Oct 21, 2025
d88d4de
Silence logging if plugin is diabled (#221)
sfc-gh-aqiao Nov 6, 2025
db56537
Bump version from 0.1.0 to 0.1.1 (#222)
sfc-gh-jrasley Nov 6, 2025
d096fdf
Communication Fusing (#224)
sfc-gh-mhidayetoglu Nov 19, 2025
5e08f0f
patch for running traces with timestamps (#228)
sfc-gh-mhidayetoglu Dec 5, 2025
0ea6a68
rebase to vllm 0.11.0 (#216)
sfc-gh-yewang Dec 30, 2025
c6bee37
Bump version from 0.1.1 to 0.1.2
sfc-gh-yewang Jan 24, 2026
d223cb5
Bump version from 0.1.2 to 0.1.3
sfc-gh-yewang Jan 24, 2026
cf431e4
Reproducibility extension (#239)
sfc-gh-mhidayetoglu Jan 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* @sfc-gh-aqiao @sfc-gh-jrasley @sfc-gh-mhidayetoglu @sfc-gh-yewang @sfc-gh-goliaro
* @sfc-gh-aqiao @sfc-gh-jrasley @sfc-gh-mhidayetoglu @sfc-gh-yewang @sfc-gh-goliaro @sfc-gh-reyazda
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ Arctic Inference achieves high throughput and low latency through a wholistic se
<tbody>
<tr>
<td align="left">
Arctic Ulysses (<a href="https://www.snowflake.com/en/engineering-blog/ulysses-low-latency-llm-inference/">blog</a>,
<a href="https://arxiv.org/abs/2507.11830">paper</a>)
Arctic Ulysses (<a href="https://www.snowflake.com/en/engineering-blog/ulysses-low-latency-llm-inference/">blog</a>)
<br>
Shift Parallelism (<a href="https://www.snowflake.com/en/engineering-blog/arctic-inference-shift-parallelism/">blog</a>)
Shift Parallelism (<a href="https://www.snowflake.com/en/engineering-blog/arctic-inference-shift-parallelism/">blog</a>,
<a href="https://arxiv.org/abs/2509.16495">paper</a>)
</td>
<td align="left">
Arctic Speculator (<a href="https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/">blog</a>)
Expand Down Expand Up @@ -105,7 +105,7 @@ By using the examples below, you can get benefits from Shift Parallelism, Specul
#### Serving

```console
vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
ARCTIC_INFERENCE_ENABLED=1 vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
--quantization "fp8" \
--tensor-parallel-size 1 \
--ulysses-sequence-parallel-size 2 \
Expand All @@ -121,6 +121,8 @@ vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \

#### Offline

Save the following script to `arctic_example.py`:

```python
import vllm
from vllm import LLM, SamplingParams
Expand Down Expand Up @@ -156,6 +158,12 @@ outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
```

Run the script with Arctic Inference enabled:

```console
ARCTIC_INFERENCE_ENABLED=1 python arctic_example.py
```

## Citation
```
@misc{arcticinference2025,
Expand Down
8 changes: 8 additions & 0 deletions arctic_inference/envs.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,18 @@
ARCTIC_INFERENCE_SKIP_SPEC_MODEL_CHECK: bool = False

environment_variables: dict[str, Callable[[], Any]] = {
"ARCTIC_INFERENCE_ENABLED":
lambda: os.getenv("ARCTIC_INFERENCE_ENABLED", "0") == "1",
"ARCTIC_INFERENCE_SKIP_PLATFORM_CHECK":
lambda: os.getenv("ARCTIC_INFERENCE_SKIP_PLATFORM_CHECK", "0") == "1",
"ARCTIC_INFERENCE_SKIP_SPEC_MODEL_CHECK":
lambda: os.getenv("ARCTIC_INFERENCE_SKIP_SPEC_MODEL_CHECK", "0") == "1",
"ARCTIC_INFERENCE_SKIP_VERSION_CHECK":
lambda: os.getenv("ARCTIC_INFERENCE_SKIP_VERSION_CHECK", "0") == "1",
}

# temporary workaround for gpt-oss model
ARCTIC_INFERENCE_SKIP_SPEC_MODEL_CHECK = 1

def __getattr__(name: str) -> Any:
if name in environment_variables:
Expand Down
Empty file.
Loading