Skip to content

Commit f57263b

Browse files
authored
feat: Add NoWait skill (#155)
1 parent fa2dde3 commit f57263b

File tree

3 files changed

+590
-0
lines changed

3 files changed

+590
-0
lines changed
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
name: nowait-reasoning-optimizer
3+
description: Implements the NOWAIT technique for efficient reasoning in R1-style LLMs. Use when optimizing inference of reasoning models (QwQ, DeepSeek-R1, Phi4-Reasoning, Qwen3, Kimi-VL, QvQ), reducing chain-of-thought token usage by 27-51% while preserving accuracy. Triggers on "optimize reasoning", "reduce thinking tokens", "efficient inference", "suppress reflection tokens", or when working with verbose CoT outputs.
4+
---
5+
6+
# NOWAIT Reasoning Optimizer
7+
8+
Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).
9+
10+
## Overview
11+
12+
NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by **27-51%** without compromising model utility.
13+
14+
## When to Use
15+
16+
- Deploying R1-style reasoning models with limited compute
17+
- Reducing inference latency for production systems
18+
- Optimizing token costs for reasoning tasks
19+
- Working with verbose CoT outputs that need streamlining
20+
21+
## Supported Models
22+
23+
| Model Series | Type | Token Reduction |
24+
|--------------|------|-----------------|
25+
| QwQ-32B | RL-based | 16-31% |
26+
| Phi4-Reasoning-Plus | RL-based | 23-28% |
27+
| Qwen3-32B | RL-based | 13-16% |
28+
| Kimi-VL-A3B | Multimodal | 40-60% |
29+
| QvQ-72B-Preview | Multimodal | 20-30% |
30+
31+
**Important**: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.
32+
33+
## Quick Start
34+
35+
### 1. Basic Implementation
36+
37+
```python
38+
from scripts.nowait_processor import NOWAITLogitProcessor
39+
40+
# Initialize processor for your model's tokenizer
41+
processor = NOWAITLogitProcessor(tokenizer)
42+
43+
# Use during generation
44+
outputs = model.generate(
45+
inputs,
46+
logits_processor=[processor],
47+
max_new_tokens=32768
48+
)
49+
```
50+
51+
### 2. Keywords Suppressed
52+
53+
See `references/keywords.md` for the complete list. Core keywords:
54+
55+
```
56+
wait, alternatively, hmm, but, however, check,
57+
double-check, maybe, verify, again, oh, ah
58+
```
59+
60+
## How It Works
61+
62+
1. **Initialize Keywords**: Identify reflection keywords from empirical analysis
63+
2. **Expand to Token Variants**: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")
64+
3. **Suppress During Inference**: Set logits of reflection tokens to large negative values during decoding
65+
66+
```
67+
Logits (Before) Logits (After)
68+
Wait 0.8 → Wait -inf
69+
First 0.6 → First 0.6
70+
Hmm 0.5 → Hmm -inf
71+
Let 0.4 → Let 0.4
72+
```
73+
74+
## Key Findings
75+
76+
### Why It Works
77+
78+
- NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip **unnecessary** "waiting" reasoning
79+
- Models still perform essential verification at key decision points
80+
- Results in more linear, straightforward reasoning paths
81+
82+
### RL vs Distilled Models
83+
84+
| Model Type | NOWAIT Effect | Recommendation |
85+
|------------|---------------|----------------|
86+
| RL-based (QwQ, Phi4, Qwen3-32B) | Stable accuracy, significant token reduction | ✅ Recommended |
87+
| Distilled (Qwen3-4B/8B/14B) | Accuracy degradation on hard tasks | ⚠️ Use with caution |
88+
89+
Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.
90+
91+
## Integration Examples
92+
93+
### HuggingFace Transformers
94+
95+
```python
96+
from transformers import AutoModelForCausalLM, AutoTokenizer
97+
from scripts.nowait_processor import NOWAITLogitProcessor
98+
99+
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
100+
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
101+
102+
processor = NOWAITLogitProcessor(tokenizer)
103+
104+
response = model.generate(
105+
tokenizer(prompt, return_tensors="pt").input_ids,
106+
logits_processor=[processor],
107+
max_new_tokens=32768,
108+
do_sample=True,
109+
temperature=0.7
110+
)
111+
```
112+
113+
### vLLM
114+
115+
```python
116+
from vllm import LLM, SamplingParams
117+
from scripts.nowait_processor import get_nowait_bad_words_ids
118+
119+
llm = LLM(model="Qwen/QwQ-32B")
120+
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())
121+
122+
sampling_params = SamplingParams(
123+
max_tokens=32768,
124+
bad_words_ids=bad_words_ids
125+
)
126+
```
127+
128+
## Expected Results
129+
130+
| Task Type | Original Tokens | NOWAIT Tokens | Reduction |
131+
|-----------|-----------------|---------------|-----------|
132+
| Math (AIME) | 15,000 | 10,500 | 30% |
133+
| Visual QA (MMMU) | 2,900 | 1,450 | 50% |
134+
| Video QA (MMVU) | 1,700 | 1,250 | 27% |
135+
136+
## Limitations
137+
138+
- Less effective on very simple problems where CoT overhead is already minimal
139+
- Distilled models may suffer accuracy loss on challenging tasks
140+
- Some domains may require model-specific keyword tuning
141+
142+
## References
143+
144+
- Paper: arXiv:2506.08343v2
145+
- Complete keyword list: `references/keywords.md`
146+
- Implementation: `scripts/nowait_processor.py`
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# NOWAIT Keywords Reference
2+
3+
Complete reference for reflection keywords used in the NOWAIT technique.
4+
5+
## Primary Keywords (from paper)
6+
7+
These keywords were empirically identified from 32 independent runs of QwQ-32B on AIME 2025, using `\n\n` as delimiters to identify the 15 most frequent monolingual transition words.
8+
9+
### Core Suppression List
10+
11+
```python
12+
KEYWORDS = [
13+
"wait", # Most common reflection trigger
14+
"alternatively", # Indicates exploring different approach
15+
"hmm", # Hesitation marker
16+
"but", # Contradiction/reconsideration
17+
"however", # Contradiction/reconsideration
18+
"alternative", # Exploring options
19+
"another", # Switching approach
20+
"check", # Verification trigger
21+
"double-check", # Re-verification
22+
"oh", # Realization marker
23+
"maybe", # Uncertainty/reconsideration
24+
"verify", # Verification trigger
25+
"other", # Exploring alternatives
26+
"again", # Repetition/re-check
27+
"now", # Transition marker
28+
"ah", # Realization marker
29+
"any", # Exploring possibilities
30+
]
31+
```
32+
33+
## Excluded Patterns
34+
35+
These patterns should NOT be suppressed as they are false positives:
36+
37+
```python
38+
EXCLUDED = [
39+
"ohio", # Contains "oh" but is a proper noun
40+
"butane", # Contains "but" but is a chemical
41+
"button", # Contains "but" but is a UI element
42+
"butterfly", # Contains "but" but is a noun
43+
"checkout", # Contains "check" but is a noun/verb
44+
"checksum", # Contains "check" but is technical term
45+
"another's", # Possessive form, often necessary
46+
]
47+
```
48+
49+
## Token Expansion
50+
51+
For each keyword, the processor expands to all vocabulary variants:
52+
53+
| Keyword | Expanded Variants |
54+
|---------|-------------------|
55+
| wait | wait, Wait, WAIT, " wait", " Wait", ".wait", ",wait", etc. |
56+
| hmm | hmm, Hmm, HMM, " hmm", "...hmm", etc. |
57+
| alternatively | alternatively, Alternatively, " Alternatively", etc. |
58+
59+
## Model-Specific Tuning
60+
61+
Different models may benefit from adjusted keyword lists:
62+
63+
### QwQ-32B / DeepSeek-R1
64+
- Use full default list
65+
- High reduction potential (30%+)
66+
67+
### Phi4-Reasoning-Plus
68+
- Use full default list
69+
- Consider adding: "let me think", "I wonder"
70+
71+
### Kimi-VL (Multimodal)
72+
- Use full default list
73+
- Very high reduction (40-60%)
74+
- May need domain-specific additions for visual tasks
75+
76+
### Qwen3 Series
77+
- RL-based (32B): Use full list
78+
- Distilled (4B/8B/14B): Consider removing "but", "however" to preserve some reasoning flow
79+
80+
## Keyword Categories
81+
82+
### Self-Reflection Markers
83+
- `wait`, `hmm`, `oh`, `ah`
84+
- Signal: Model is pausing to reconsider
85+
86+
### Verification Triggers
87+
- `check`, `double-check`, `verify`
88+
- Signal: Model is validating previous work
89+
90+
### Alternative Exploration
91+
- `alternatively`, `alternative`, `another`, `other`
92+
- Signal: Model is exploring different approaches
93+
94+
### Contradiction/Reconsideration
95+
- `but`, `however`, `maybe`
96+
- Signal: Model is reconsidering previous conclusion
97+
98+
### Transition Markers
99+
- `now`, `again`, `any`
100+
- Signal: Model is shifting focus or repeating
101+
102+
## Benchmark Results by Keyword Removal
103+
104+
| Keywords Removed | AIME 2025 ACC | Token Reduction |
105+
|-----------------|---------------|-----------------|
106+
| None (baseline) | 66.67% | 0% |
107+
| wait only | 67.33% | 15% |
108+
| wait + hmm | 67.67% | 22% |
109+
| All 17 keywords | 68.00% | 31% |
110+
111+
## Implementation Notes
112+
113+
### Logit Suppression Value
114+
- Default: `-1e10` (effectively negative infinity)
115+
- Alternative: `-100` (softer suppression, allows rare occurrence)
116+
117+
### Vocabulary Iteration
118+
```python
119+
def build_suppressed_tokens(tokenizer, keywords):
120+
suppressed = set()
121+
vocab = tokenizer.get_vocab()
122+
123+
for token_text, token_id in vocab.items():
124+
for keyword in keywords:
125+
if keyword.lower() in token_text.lower():
126+
suppressed.add(token_id)
127+
break
128+
129+
return suppressed
130+
```
131+
132+
### Performance Considerations
133+
- Token set is built once at initialization
134+
- Lookup is O(1) per token during generation
135+
- Memory overhead: ~few KB for token ID set

0 commit comments

Comments
 (0)