feat: Add NoWait skill (#155)

Barba2k2 · web-flow · commit f57263b2b63a · 2025-12-31T20:57:25.000-05:00
diff --git a/cli-tool/components/skills/productivity/nowait/SKILL.md b/cli-tool/components/skills/productivity/nowait/SKILL.md
@@ -0,0 +1,146 @@
+---
+name: nowait-reasoning-optimizer
+description: Implements the NOWAIT technique for efficient reasoning in R1-style LLMs. Use when optimizing inference of reasoning models (QwQ, DeepSeek-R1, Phi4-Reasoning, Qwen3, Kimi-VL, QvQ), reducing chain-of-thought token usage by 27-51% while preserving accuracy. Triggers on "optimize reasoning", "reduce thinking tokens", "efficient inference", "suppress reflection tokens", or when working with verbose CoT outputs.
+---
+
+# NOWAIT Reasoning Optimizer
+
+Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).
+
+## Overview
+
+NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by **27-51%** without compromising model utility.
+
+## When to Use
+
+- Deploying R1-style reasoning models with limited compute
+- Reducing inference latency for production systems
+- Optimizing token costs for reasoning tasks
+- Working with verbose CoT outputs that need streamlining
+
+## Supported Models
+
+| Model Series | Type | Token Reduction |
+|--------------|------|-----------------|
+| QwQ-32B | RL-based | 16-31% |
+| Phi4-Reasoning-Plus | RL-based | 23-28% |
+| Qwen3-32B | RL-based | 13-16% |
+| Kimi-VL-A3B | Multimodal | 40-60% |
+| QvQ-72B-Preview | Multimodal | 20-30% |
+
+**Important**: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.
+
+## Quick Start
+
+### 1. Basic Implementation
+
+```python
+from scripts.nowait_processor import NOWAITLogitProcessor
+
+# Initialize processor for your model's tokenizer
+processor = NOWAITLogitProcessor(tokenizer)
+
+# Use during generation
+outputs = model.generate(
+    inputs,
+    logits_processor=[processor],
+    max_new_tokens=32768
+)
+```
+
+### 2. Keywords Suppressed
+
+See `references/keywords.md` for the complete list. Core keywords:
+
+```
+wait, alternatively, hmm, but, however, check, 
+double-check, maybe, verify, again, oh, ah
+```
+
+## How It Works
+
+1. **Initialize Keywords**: Identify reflection keywords from empirical analysis
+2. **Expand to Token Variants**: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")
+3. **Suppress During Inference**: Set logits of reflection tokens to large negative values during decoding
+
+```
+Logits (Before)         Logits (After)
+Wait     0.8     →     Wait     -inf
+First    0.6     →     First    0.6
+Hmm      0.5     →     Hmm      -inf
+Let      0.4     →     Let      0.4
+```
+
+## Key Findings
+
+### Why It Works
+
+- NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip **unnecessary** "waiting" reasoning
+- Models still perform essential verification at key decision points
+- Results in more linear, straightforward reasoning paths
+
+### RL vs Distilled Models
+
+| Model Type | NOWAIT Effect | Recommendation |
+|------------|---------------|----------------|
+| RL-based (QwQ, Phi4, Qwen3-32B) | Stable accuracy, significant token reduction | ✅ Recommended |
+| Distilled (Qwen3-4B/8B/14B) | Accuracy degradation on hard tasks | ⚠️ Use with caution |
+
+Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.
+
+## Integration Examples
+
+### HuggingFace Transformers
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from scripts.nowait_processor import NOWAITLogitProcessor
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
+
+processor = NOWAITLogitProcessor(tokenizer)
+
+response = model.generate(
+    tokenizer(prompt, return_tensors="pt").input_ids,
+    logits_processor=[processor],
+    max_new_tokens=32768,
+    do_sample=True,
+    temperature=0.7
+)
+```
+
+### vLLM
+
+```python
+from vllm import LLM, SamplingParams
+from scripts.nowait_processor import get_nowait_bad_words_ids
+
+llm = LLM(model="Qwen/QwQ-32B")
+bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())
+
+sampling_params = SamplingParams(
+    max_tokens=32768,
+    bad_words_ids=bad_words_ids
+)
+```
+
+## Expected Results
+
+| Task Type | Original Tokens | NOWAIT Tokens | Reduction |
+|-----------|-----------------|---------------|-----------|
+| Math (AIME) | 15,000 | 10,500 | 30% |
+| Visual QA (MMMU) | 2,900 | 1,450 | 50% |
+| Video QA (MMVU) | 1,700 | 1,250 | 27% |
+
+## Limitations
+
+- Less effective on very simple problems where CoT overhead is already minimal
+- Distilled models may suffer accuracy loss on challenging tasks
+- Some domains may require model-specific keyword tuning
+
+## References
+
+- Paper: arXiv:2506.08343v2
+- Complete keyword list: `references/keywords.md`
+- Implementation: `scripts/nowait_processor.py`
diff --git a/cli-tool/components/skills/productivity/nowait/refrences/keywords.md b/cli-tool/components/skills/productivity/nowait/refrences/keywords.md
@@ -0,0 +1,135 @@
+# NOWAIT Keywords Reference
+
+Complete reference for reflection keywords used in the NOWAIT technique.
+
+## Primary Keywords (from paper)
+
+These keywords were empirically identified from 32 independent runs of QwQ-32B on AIME 2025, using `\n\n` as delimiters to identify the 15 most frequent monolingual transition words.
+
+### Core Suppression List
+
+```python
+KEYWORDS = [
+    "wait",          # Most common reflection trigger
+    "alternatively", # Indicates exploring different approach
+    "hmm",           # Hesitation marker
+    "but",           # Contradiction/reconsideration
+    "however",       # Contradiction/reconsideration
+    "alternative",   # Exploring options
+    "another",       # Switching approach
+    "check",         # Verification trigger
+    "double-check",  # Re-verification
+    "oh",            # Realization marker
+    "maybe",         # Uncertainty/reconsideration
+    "verify",        # Verification trigger
+    "other",         # Exploring alternatives
+    "again",         # Repetition/re-check
+    "now",           # Transition marker
+    "ah",            # Realization marker
+    "any",           # Exploring possibilities
+]
+```
+
+## Excluded Patterns
+
+These patterns should NOT be suppressed as they are false positives:
+
+```python
+EXCLUDED = [
+    "ohio",       # Contains "oh" but is a proper noun
+    "butane",     # Contains "but" but is a chemical
+    "button",     # Contains "but" but is a UI element
+    "butterfly",  # Contains "but" but is a noun
+    "checkout",   # Contains "check" but is a noun/verb
+    "checksum",   # Contains "check" but is technical term
+    "another's",  # Possessive form, often necessary
+]
+```
+
+## Token Expansion
+
+For each keyword, the processor expands to all vocabulary variants:
+
+| Keyword | Expanded Variants |
+|---------|-------------------|
+| wait | wait, Wait, WAIT, " wait", " Wait", ".wait", ",wait", etc. |
+| hmm | hmm, Hmm, HMM, " hmm", "...hmm", etc. |
+| alternatively | alternatively, Alternatively, " Alternatively", etc. |
+
+## Model-Specific Tuning
+
+Different models may benefit from adjusted keyword lists:
+
+### QwQ-32B / DeepSeek-R1
+- Use full default list
+- High reduction potential (30%+)
+
+### Phi4-Reasoning-Plus
+- Use full default list
+- Consider adding: "let me think", "I wonder"
+
+### Kimi-VL (Multimodal)
+- Use full default list
+- Very high reduction (40-60%)
+- May need domain-specific additions for visual tasks
+
+### Qwen3 Series
+- RL-based (32B): Use full list
+- Distilled (4B/8B/14B): Consider removing "but", "however" to preserve some reasoning flow
+
+## Keyword Categories
+
+### Self-Reflection Markers
+- `wait`, `hmm`, `oh`, `ah`
+- Signal: Model is pausing to reconsider
+
+### Verification Triggers  
+- `check`, `double-check`, `verify`
+- Signal: Model is validating previous work
+
+### Alternative Exploration
+- `alternatively`, `alternative`, `another`, `other`
+- Signal: Model is exploring different approaches
+
+### Contradiction/Reconsideration
+- `but`, `however`, `maybe`
+- Signal: Model is reconsidering previous conclusion
+
+### Transition Markers
+- `now`, `again`, `any`
+- Signal: Model is shifting focus or repeating
+
+## Benchmark Results by Keyword Removal
+
+| Keywords Removed | AIME 2025 ACC | Token Reduction |
+|-----------------|---------------|-----------------|
+| None (baseline) | 66.67% | 0% |
+| wait only | 67.33% | 15% |
+| wait + hmm | 67.67% | 22% |
+| All 17 keywords | 68.00% | 31% |
+
+## Implementation Notes
+
+### Logit Suppression Value
+- Default: `-1e10` (effectively negative infinity)
+- Alternative: `-100` (softer suppression, allows rare occurrence)
+
+### Vocabulary Iteration
+```python
+def build_suppressed_tokens(tokenizer, keywords):
+    suppressed = set()
+    vocab = tokenizer.get_vocab()
+    
+    for token_text, token_id in vocab.items():
+        for keyword in keywords:
+            if keyword.lower() in token_text.lower():
+                suppressed.add(token_id)
+                break
+    
+    return suppressed
+```
+
+### Performance Considerations
+- Token set is built once at initialization
+- Lookup is O(1) per token during generation
+- Memory overhead: ~few KB for token ID set
diff --git a/cli-tool/components/skills/productivity/nowait/scripts/nowait_processor.py b/cli-tool/components/skills/productivity/nowait/scripts/nowait_processor.py