Skip to content

Commit 04d2460

Browse files
rmitschRaphael Mitsch
andauthored
feat: Add token count tracking (#238)
* feat: Add token count tracking. * chore: Update AGENTS.md. Cleanup. --------- Co-authored-by: Raphael Mitsch <raphael@climatiq.com>
1 parent 959cdbe commit 04d2460

32 files changed

+478
-91
lines changed

AGENTS.md

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -306,14 +306,20 @@ Enforced via CI pipeline:
306306

307307
## Observability & Serialization
308308

309-
- **Logging:** Loguru integrated; logs task execution and model wrapper calls
309+
- **Logging:** `loguru` integrated; logs task execution and model wrapper calls.
310+
- **Raw Model Outputs:** Captured in `doc.meta[task_id]['raw']` as a list of raw responses per chunk when `include_meta=True` (default).
311+
- **Token Usage Tracking:**
312+
- Tracked across the entire pipeline and aggregated in `doc.meta['usage']`.
313+
- Also available per task in `doc.meta[task_id]['usage']`.
314+
- Includes `input_tokens` and `output_tokens`.
315+
- Uses native metadata for DSPy/LangChain and approximate estimation for other backends.
310316
- **Pipeline persistence:**
311317
```python
312-
pipe.dump("pipeline.yml") # Save config
313-
loaded = Pipeline.load("pipeline.yml", task_kwargs) # Reload with model kwargs
318+
pipe.dump("pipeline.yml") # Save config.
319+
loaded = Pipeline.load("pipeline.yml", task_kwargs) # Reload with model kwargs.
314320
```
315-
- **Document persistence:** Use pickle (models not serialized)
316-
- **Config format:** YAML-compatible via `sieves.serialization.Config`
321+
- **Document persistence:** Use pickle (models not serialized).
322+
- **Config format:** YAML-compatible via `sieves.serialization.Config`.
317323

318324
---
319325

@@ -428,7 +434,8 @@ Then run: `uv run pytest sieves/tests/test_my_feature.py -v`
428434

429435
Key changes that affect development (last ~2-3 months):
430436

431-
1. **Information Extraction Single/Multi Mode** - Added `mode` parameter to `InformationExtraction` task for single vs multi entity extraction.
437+
1. **Token Counting and Raw Output Observability** - Implemented comprehensive token usage tracking (input/output) and raw model response capturing in `doc.meta`. Usage is aggregated per-task and per-document.
438+
2. **Information Extraction Single/Multi Mode** - Added `mode` parameter to `InformationExtraction` task for single vs multi entity extraction.
432439
2. **GliNERBridge Refactoring** - Consolidated NER logic into `GliNERBridge`, removing dedicated `GlinerNER` class.
433440
3. **Documentation Enhancements** - Standardized documentation with usage snippets (tested) and library links across all tasks and model wrappers.
434441
4. **All Model wrappers as Core Dependencies** (#210) - Outlines, DSPy, LangChain, Transformers, and GLiNER2 are now included in base installation

README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,10 +96,16 @@ pipeline = Pipeline(task)
9696
# Define documents to analyze.
9797
doc = Doc(text="The new telescope captures images of distant galaxies.")
9898

99-
# Run pipeline a print results.
99+
# Run pipeline and print results.
100100
results = list(pipeline([doc]))
101-
print(results[0].results)
102-
# This produces: {'Classification': [('science', 1.0), ('politics', 0.0)]}
101+
docs = list(pipeline([doc]))
102+
# The `results` field contains the structured task output.
103+
print(docs[0].results) # {'Classification': [('science', 1.0), ('politics', 0.0)]}
104+
# The `meta` field contains more information helpful for observability and debugging, such as raw model output and token count information.
105+
print(docs[0].meta) # {'Classification': {
106+
# 'raw': ['{ "science": 1.0, "politics": 0 }'],
107+
# 'usage': {'input_tokens': 2, 'output_tokens': 2, 'chunks': [{'input_tokens': 2, 'output_tokens': 2}]}}, 'usage': {'input_tokens': 2, 'output_tokens': 2}
108+
# }
103109
```
104110

105111
**3. Advanced: End-to-end document AI with a hosted LLM**

docs/doc.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
11
# Doc
22

3-
The `Doc` class is the fundamental unit of data in Sieves. It encapsulates the text to be processed, its associated metadata, and the results generated by various tasks in a pipeline.
3+
The `Doc` class is the fundamental unit of data in `sieves`. It encapsulates the text to be processed, its associated metadata, and the results generated by various tasks in a pipeline.
44

55
## Usage
66

77
```python
88
--8<-- "sieves/tests/docs/test_doc_usage.py:doc-usage"
99
```
1010

11+
## Metadata and Observability
12+
13+
The `meta` field stores detailed execution traces, including raw model outputs and token usage statistics. This is particularly useful for debugging and cost monitoring. For a deep dive into how to use these features, see the [Observability and Usage Tracking guide](guides/observability.md).
14+
1115
---
1216

1317
::: sieves.data.doc

docs/guides/custom_tasks.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ def consolidate(self, results: Sequence[TaskResult], docs_offsets: list[tuple[in
156156
**Why separate methods?**
157157

158158
- Documents may exceed model context limits (e.g., 100-page PDFs vs 8K token limit)
159-
- Sieves automatically splits long documents into chunks for processing
159+
- `sieves` automatically splits long documents into chunks for processing
160160
- `integrate()` handles per-chunk results (stores immediately, no processing)
161161
- `consolidate()` aggregates chunks back into per-document results (averaging, voting, etc.)
162162

docs/guides/models.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Model Setup
22

3-
This guide explains how to set up models for use with sieves across different frameworks and providers.
3+
This guide explains how to set up models for use with `sieves` across different frameworks and providers.
44

55
## Overview
66

7-
sieves supports multiple a bunch of language model frameworks - each allowing different usage modes, pros and cons, and
7+
`sieves` supports multiple a bunch of language model frameworks - each allowing different usage modes, pros and cons, and
88
supporting different use cases.
99

1010
This table attempts to capture essential properties of each supported framework, including a very coarse categorization

docs/guides/observability.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Observability and Usage Tracking
2+
3+
`sieves` provides built-in tools for monitoring your Document AI pipelines. By enabling metadata collection, you can inspect raw model responses and track token consumption for both local and remote models.
4+
5+
## The `meta` Field
6+
7+
Every `Doc` object in `sieves` contains a `meta` dictionary. When `include_meta=True` (which is the default for predictive tasks), this dictionary is populated with detailed execution traces.
8+
9+
### Raw Model Outputs
10+
11+
`sieves` captures the "raw" output from the underlying language model before it is parsed into your final structured format. This is invaluable for debugging prompt failures or investigating unexpected model behavior.
12+
13+
The raw outputs are stored in `doc.meta[task_id]['raw']`. Since documents can be split into multiple chunks, this field contains a list of raw responses—one for each chunk.
14+
15+
#### Example: Inspecting Raw Output
16+
17+
```python
18+
from sieves.tasks import Classification
19+
20+
# The include_meta flag is True by default.
21+
task = Classification(labels=["science", "politics"], model=model)
22+
results = list(task(docs))
23+
24+
# Inspect raw model responses for the first document.
25+
print(results[0].meta['Classification']['raw'])
26+
```
27+
28+
**Example Result for DSPy:**
29+
```python
30+
[
31+
{
32+
'prompt': None,
33+
'messages': [...],
34+
'response': ModelResponse(...),
35+
'usage': {'prompt_tokens': 556, 'completion_tokens': 32, ...}
36+
}
37+
]
38+
```
39+
40+
**Example Result for Outlines (JSON mode):**
41+
```python
42+
['{
43+
"science": 0.95,
44+
"politics": 0.05
45+
}']
46+
```
47+
48+
---
49+
50+
## Token Usage Tracking
51+
52+
`sieves` automatically tracks input and output tokens across your pipeline. Token data is aggregated at three levels:
53+
54+
1. **Per Chunk**: Detailed usage for every individual model call.
55+
2. **Per Task**: Aggregated usage for a specific task within a document.
56+
3. **Per Document**: Total running total of tokens consumed by a document across all tasks.
57+
58+
### Accessing Usage Data
59+
60+
Usage statistics are stored under the `usage` key in the metadata.
61+
62+
* **Task-specific usage**: `doc.meta[task_id]['usage']`
63+
* **Total document usage**: `doc.meta['usage']`
64+
65+
#### Example Usage Structure
66+
67+
```python
68+
# The total tokens consumed by this document across the entire pipeline.
69+
total_usage = doc.meta['usage']
70+
print(f"Total Input: {total_usage['input_tokens']}, Total Output: {total_usage['output_tokens']}")
71+
72+
# The detailed usage for a specific classification task.
73+
task_meta = doc.meta['Classification']
74+
print(f"Task Input: {task_meta['usage']['input_tokens']}")
75+
76+
# The per-chunk usage for the classification task.
77+
for i, chunk_usage in enumerate(task_meta['usage']['chunks']):
78+
print(f"Chunk {i}: {chunk_usage['input_tokens']} in, {chunk_usage['output_tokens']} out")
79+
```
80+
81+
---
82+
83+
## Native vs. Approximate Counting
84+
85+
`sieves` uses a multi-tiered approach to ensure you always have token data, even when model frameworks don't provide it natively.
86+
87+
### Native Tracking (DSPy & LangChain)
88+
For backends like **DSPy** and **LangChain**, `sieves` extracts token counts directly from the model provider's metadata (e.g., OpenAI or Anthropic response headers). This is the most accurate form of tracking.
89+
90+
!!! note "DSPy Caching"
91+
DSPy's internal caching may return 0 or `None` for tokens if a result is retrieved from the local cache rather than the remote API.
92+
93+
### Approximate Estimation (Outlines, HuggingFace, GliNER)
94+
For local models or frameworks that don't expose native counts, `sieves` uses the model's own **tokenizer** to estimate usage:
95+
1. **Input Tokens**: Counted by encoding the fully rendered prompt string.
96+
2. **Output Tokens**: Counted by encoding the raw generated output string.
97+
98+
If a local tokenizer is not available (e.g., when using a remote API via Outlines without a local weight clone), `sieves` will attempt to fall back to `tiktoken` (for OpenAI-compatible models) or return `None`.
99+
100+
```

docs/guides/optimization.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ Optimizer(
202202

203203
## Learning More About Optimization
204204

205-
Sieves optimization is built on [DSPy's MIPROv2 optimizer](https://dspy-docs.vercel.app/api/optimizers/MIPROv2). For in-depth guidance on optimization techniques, training data quality, and interpreting results, we recommend exploring these external resources:
205+
`sieves` optimization is built on [DSPy's MIPROv2 optimizer](https://dspy-docs.vercel.app/api/optimizers/MIPROv2). For in-depth guidance on optimization techniques, training data quality, and interpreting results, we recommend exploring these external resources:
206206

207207
### Understanding MIPROv2
208208

@@ -217,13 +217,13 @@ Sieves optimization is built on [DSPy's MIPROv2 optimizer](https://dspy-docs.ver
217217
- ⚙️ **Hyperparameter Tuning** - Adjusting `num_trials`, `num_candidates`, and other optimizer settings for better results
218218
- 🎯 **Evaluation Metrics** - Choosing the right metrics for your task (see Evaluation Metrics section above)
219219

220-
### Sieves-Specific Integration
220+
### `sieves`-Specific Integration
221221

222-
The main differences when using optimization in Sieves:
222+
The main differences when using optimization in `sieves`:
223223

224224
- **Simplified API**: Use `task.optimize(optimizer)` instead of calling DSPy optimizers directly
225225
- **Automatic integration**: Optimized prompts and few-shot examples are automatically integrated into the task
226226
- **Task compatibility**: Works with all `PredictiveTask` subclasses (Classification, NER, InformationExtraction, etc.)
227227
- **Full parameter access**: All DSPy optimizer parameters are available via the `Optimizer` class constructor
228228

229-
For questions specific to Sieves optimization integration, see the [Troubleshooting](#troubleshooting) section above or consult the [task-specific documentation](../tasks/predictive/classification.md) for evaluation metrics.
229+
For questions specific to `sieves` optimization integration, see the [Troubleshooting](#troubleshooting) section above or consult the [task-specific documentation](../tasks/predictive/classification.md) for evaluation metrics.

docs/guides/serialization.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -117,11 +117,11 @@ The configuration file contains:
117117
**Solution**: Check the YAML file to see which parameters are marked as placeholders:
118118

119119
```python
120-
# Read the config to see what placeholders exist
120+
# Read the config to see what placeholders exist.
121121
import yaml
122122
with open("pipeline.yml", "r") as f:
123123
config = yaml.safe_load(f)
124-
print(config) # Look for "is_placeholder: true" entries
124+
print(config) # Look for "is_placeholder: true" entries.
125125
```
126126

127127
Provide `init_params` for each task that has placeholders:
@@ -130,17 +130,17 @@ Provide `init_params` for each task that has placeholders:
130130
loaded_pipeline = Pipeline.load(
131131
"pipeline.yml",
132132
[
133-
{"model": your_model}, # Task 0 placeholders
134-
{"tokenizer": your_tokenizer}, # Task 1 placeholders
133+
{"model": your_model}, # Task 0 placeholders.
134+
{"tokenizer": your_tokenizer}, # Task 1 placeholders.
135135
]
136136
)
137137
```
138138

139139
#### Version compatibility warnings
140140

141-
**Symptom**: Warning about sieves version mismatch when loading pipelines.
141+
**Symptom**: Warning about `sieves` version mismatch when loading pipelines.
142142

143-
**Cause**: The pipeline was saved with a different version of sieves than you're currently using.
143+
**Cause**: The pipeline was saved with a different version of `sieves` than you're currently using.
144144

145145
**Impact**:
146146

@@ -149,10 +149,10 @@ loaded_pipeline = Pipeline.load(
149149

150150
**Solution**:
151151
```bash
152-
# Install the version that was used to create the pipeline
153-
pip install sieves==0.11.1 # Match the version in the YAML
152+
# Install the version that was used to create the pipeline.
153+
pip install sieves==0.11.1 # Match the version in the YAML.
154154
155-
# Or: Update the pipeline by re-saving it with the current version
155+
# Or: Update the pipeline by re-saving it with the current version.
156156
pipeline.dump("pipeline_updated.yml")
157157
```
158158

@@ -165,30 +165,30 @@ pipeline.dump("pipeline_updated.yml")
165165
**Solution**: Mark these as placeholders by ensuring they're provided during pipeline creation, then supply them again during load:
166166

167167
```python
168-
# When creating the pipeline
168+
# When creating the pipeline.
169169
custom_task = MyCustomTask(complex_object=my_object)
170170
pipeline = Pipeline([custom_task])
171-
pipeline.dump("pipeline.yml") # complex_object becomes a placeholder
171+
pipeline.dump("pipeline.yml") # complex_object becomes a placeholder.
172172
173-
# When loading
173+
# When loading.
174174
loaded = Pipeline.load("pipeline.yml", [{"complex_object": my_object}])
175175
```
176176

177177
#### Model weights not loading
178178

179179
**Symptom**: Loaded pipeline doesn't have model weights.
180180

181-
**Cause**: sieves doesn't save model weights in configuration files (they're too large).
181+
**Cause**: `sieves` doesn't save model weights in configuration files (they're too large).
182182

183183
**Solution**: Always provide fresh model instances in `init_params`:
184184

185185
```python
186-
# Load the model separately (weights will be downloaded/loaded)
186+
# Load the model separately (weights will be downloaded/loaded).
187187
model = outlines.models.transformers(
188188
"HuggingFaceTB/SmolLM-135M-Instruct"
189189
)
190190
191-
# Then load the pipeline with the model
191+
# Then load the pipeline with the model.
192192
loaded = Pipeline.load("pipeline.yml", [{"model": model}])
193193
```
194194

@@ -201,14 +201,14 @@ loaded = Pipeline.load("pipeline.yml", [{"model": model}])
201201
**Solution**: Specify explicit task IDs when creating tasks:
202202

203203
```python
204-
# When creating
204+
# When creating.
205205
classifier = tasks.Classification(
206206
labels=["science", "politics"],
207207
model=model,
208-
task_id="my_classifier" # Explicit ID
208+
task_id="my_classifier" # Explicit ID.
209209
)
210210
211-
# The results will always be in doc.results["my_classifier"]
211+
# The results will always be in doc.results["my_classifier"].
212212
```
213213

214214
### Best Practices
@@ -217,7 +217,7 @@ classifier = tasks.Classification(
217217
2. **Document init_params**: Add comments explaining what placeholders need
218218
3. **Test load immediately**: After saving, try loading to catch serialization issues
219219
4. **Separate model loading**: Keep model initialization code separate from pipeline config
220-
5. **Use version pinning**: Pin sieves version in requirements.txt for reproducibility
220+
5. **Use version pinning**: Pin `sieves` version in requirements.txt for reproducibility
221221

222222
## Related Guides
223223

docs/model_wrappers/dspy.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# DSPy
22

3-
[DSPy](https://dspy.ai/) is a framework for programming with language models. Sieves integrates with DSPy's `dspy.LM` class.
3+
[DSPy](https://dspy.ai/) is a framework for programming with language models. `sieves` integrates with DSPy's `dspy.LM` class.
44

55
## Usage
66

docs/model_wrappers/huggingface.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Hugging Face
22

3-
Sieves supports [Hugging Face](https://huggingface.co/) `pipelines` for zero-shot classification.
3+
`sieves` supports [Hugging Face](https://huggingface.co/) `pipelines` for zero-shot classification.
44

55
## Usage
66

0 commit comments

Comments
 (0)