MantisAI
diff --git a/‎AGENTS.md‎
Lines changed: 13 additions & 6 deletions b/‎AGENTS.md‎
Lines changed: 13 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 9 additions & 3 deletions b/‎README.md‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎docs/doc.md‎
Lines changed: 5 additions & 1 deletion b/‎docs/doc.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎docs/guides/custom_tasks.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/guides/custom_tasks.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/guides/models.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/guides/models.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/guides/observability.md‎
Lines changed: 100 additions & 0 deletions b/‎docs/guides/observability.md‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎docs/guides/optimization.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/guides/optimization.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/guides/serialization.md‎
Lines changed: 19 additions & 19 deletions b/‎docs/guides/serialization.md‎
Lines changed: 19 additions & 19 deletions
diff --git a/‎docs/model_wrappers/dspy.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/model_wrappers/dspy.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/model_wrappers/huggingface.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/model_wrappers/huggingface.md‎
Lines changed: 1 addition & 1 deletion
@@ -306,14 +306,20 @@ Enforced via CI pipeline:
 
 ## Observability & Serialization
 
-- **Logging:** Loguru integrated; logs task execution and model wrapper calls
+- **Logging:** `loguru` integrated; logs task execution and model wrapper calls.
+- **Raw Model Outputs:** Captured in `doc.meta[task_id]['raw']` as a list of raw responses per chunk when `include_meta=True` (default).
+- **Token Usage Tracking:**
+  - Tracked across the entire pipeline and aggregated in `doc.meta['usage']`.
+  - Also available per task in `doc.meta[task_id]['usage']`.
+  - Includes `input_tokens` and `output_tokens`.
+  - Uses native metadata for DSPy/LangChain and approximate estimation for other backends.
 - **Pipeline persistence:**
   ```python
-  pipe.dump("pipeline.yml")                        # Save config
-  loaded = Pipeline.load("pipeline.yml", task_kwargs)  # Reload with model kwargs
+  pipe.dump("pipeline.yml")                        # Save config.
+  loaded = Pipeline.load("pipeline.yml", task_kwargs)  # Reload with model kwargs.
   ```
-- **Document persistence:** Use pickle (models not serialized)
-- **Config format:** YAML-compatible via `sieves.serialization.Config`
+- **Document persistence:** Use pickle (models not serialized).
+- **Config format:** YAML-compatible via `sieves.serialization.Config`.
 
 ---
 
@@ -428,7 +434,8 @@ Then run: `uv run pytest sieves/tests/test_my_feature.py -v`
 
 Key changes that affect development (last ~2-3 months):
 
-1. **Information Extraction Single/Multi Mode** - Added `mode` parameter to `InformationExtraction` task for single vs multi entity extraction.
+1. **Token Counting and Raw Output Observability** - Implemented comprehensive token usage tracking (input/output) and raw model response capturing in `doc.meta`. Usage is aggregated per-task and per-document.
+2. **Information Extraction Single/Multi Mode** - Added `mode` parameter to `InformationExtraction` task for single vs multi entity extraction.
 2. **GliNERBridge Refactoring** - Consolidated NER logic into `GliNERBridge`, removing dedicated `GlinerNER` class.
 3. **Documentation Enhancements** - Standardized documentation with usage snippets (tested) and library links across all tasks and model wrappers.
 4. **All Model wrappers as Core Dependencies** (#210) - Outlines, DSPy, LangChain, Transformers, and GLiNER2 are now included in base installation
 
@@ -96,10 +96,16 @@ pipeline = Pipeline(task)
 # Define documents to analyze.
 doc = Doc(text="The new telescope captures images of distant galaxies.")
 
-# Run pipeline a print results.
+# Run pipeline and print results.
 results = list(pipeline([doc]))
-print(results[0].results)
-# This produces: {'Classification': [('science', 1.0), ('politics', 0.0)]}
+docs = list(pipeline([doc]))
+# The `results` field contains the structured task output.
+print(docs[0].results) # {'Classification': [('science', 1.0), ('politics', 0.0)]}
+# The `meta` field contains more information helpful for observability and debugging, such as raw model output and token count information.
+print(docs[0].meta)    # {'Classification': {
+                       #    'raw': ['{ "science": 1.0, "politics": 0 }'],
+                       #    'usage': {'input_tokens': 2, 'output_tokens': 2, 'chunks': [{'input_tokens': 2, 'output_tokens': 2}]}}, 'usage': {'input_tokens': 2, 'output_tokens': 2}
+                       #  }
 ```
 
 **3. Advanced: End-to-end document AI with a hosted LLM**
 
@@ -1,13 +1,17 @@
 # Doc
 
-The `Doc` class is the fundamental unit of data in Sieves. It encapsulates the text to be processed, its associated metadata, and the results generated by various tasks in a pipeline.
+The `Doc` class is the fundamental unit of data in `sieves`. It encapsulates the text to be processed, its associated metadata, and the results generated by various tasks in a pipeline.
 
 ## Usage
 
 ```python
 --8<-- "sieves/tests/docs/test_doc_usage.py:doc-usage"
 ```
 
+## Metadata and Observability
+
+The `meta` field stores detailed execution traces, including raw model outputs and token usage statistics. This is particularly useful for debugging and cost monitoring. For a deep dive into how to use these features, see the [Observability and Usage Tracking guide](guides/observability.md).
+
 ---
 
 ::: sieves.data.doc
@@ -156,7 +156,7 @@ def consolidate(self, results: Sequence[TaskResult], docs_offsets: list[tuple[in
 **Why separate methods?**
 
 - Documents may exceed model context limits (e.g., 100-page PDFs vs 8K token limit)
-- Sieves automatically splits long documents into chunks for processing
+- `sieves` automatically splits long documents into chunks for processing
 - `integrate()` handles per-chunk results (stores immediately, no processing)
 - `consolidate()` aggregates chunks back into per-document results (averaging, voting, etc.)
 
 
@@ -1,10 +1,10 @@
 # Model Setup
 
-This guide explains how to set up models for use with sieves across different frameworks and providers.
+This guide explains how to set up models for use with `sieves` across different frameworks and providers.
 
 ## Overview
 
-sieves supports multiple a bunch of language model frameworks - each allowing different usage modes, pros and cons, and
+`sieves` supports multiple a bunch of language model frameworks - each allowing different usage modes, pros and cons, and
 supporting different use cases.
 
 This table attempts to capture essential properties of each supported framework, including a very coarse categorization
 
@@ -0,0 +1,100 @@
+# Observability and Usage Tracking
+
+`sieves` provides built-in tools for monitoring your Document AI pipelines. By enabling metadata collection, you can inspect raw model responses and track token consumption for both local and remote models.
+
+## The `meta` Field
+
+Every `Doc` object in `sieves` contains a `meta` dictionary. When `include_meta=True` (which is the default for predictive tasks), this dictionary is populated with detailed execution traces.
+
+### Raw Model Outputs
+
+`sieves` captures the "raw" output from the underlying language model before it is parsed into your final structured format. This is invaluable for debugging prompt failures or investigating unexpected model behavior.
+
+The raw outputs are stored in `doc.meta[task_id]['raw']`. Since documents can be split into multiple chunks, this field contains a list of raw responses—one for each chunk.
+
+#### Example: Inspecting Raw Output
+
+```python
+from sieves.tasks import Classification
+
+# The include_meta flag is True by default.
+task = Classification(labels=["science", "politics"], model=model)
+results = list(task(docs))
+
+# Inspect raw model responses for the first document.
+print(results[0].meta['Classification']['raw'])
+```
+
+**Example Result for DSPy:**
+```python
+[
+    {
+        'prompt': None,
+        'messages': [...],
+        'response': ModelResponse(...),
+        'usage': {'prompt_tokens': 556, 'completion_tokens': 32, ...}
+    }
+]
+```
+
+**Example Result for Outlines (JSON mode):**
+```python
+['{
+  "science": 0.95,
+  "politics": 0.05
+}']
+```
+
+---
+
+## Token Usage Tracking
+
+`sieves` automatically tracks input and output tokens across your pipeline. Token data is aggregated at three levels:
+
+1.  **Per Chunk**: Detailed usage for every individual model call.
+2.  **Per Task**: Aggregated usage for a specific task within a document.
+3.  **Per Document**: Total running total of tokens consumed by a document across all tasks.
+
+### Accessing Usage Data
+
+Usage statistics are stored under the `usage` key in the metadata.
+
+*   **Task-specific usage**: `doc.meta[task_id]['usage']`
+*   **Total document usage**: `doc.meta['usage']`
+
+#### Example Usage Structure
+
+```python
+# The total tokens consumed by this document across the entire pipeline.
+total_usage = doc.meta['usage']
+print(f"Total Input: {total_usage['input_tokens']}, Total Output: {total_usage['output_tokens']}")
+
+# The detailed usage for a specific classification task.
+task_meta = doc.meta['Classification']
+print(f"Task Input: {task_meta['usage']['input_tokens']}")
+
+# The per-chunk usage for the classification task.
+for i, chunk_usage in enumerate(task_meta['usage']['chunks']):
+    print(f"Chunk {i}: {chunk_usage['input_tokens']} in, {chunk_usage['output_tokens']} out")
+```
+
+---
+
+## Native vs. Approximate Counting
+
+`sieves` uses a multi-tiered approach to ensure you always have token data, even when model frameworks don't provide it natively.
+
+### Native Tracking (DSPy & LangChain)
+For backends like **DSPy** and **LangChain**, `sieves` extracts token counts directly from the model provider's metadata (e.g., OpenAI or Anthropic response headers). This is the most accurate form of tracking.
+
+!!! note "DSPy Caching"
+    DSPy's internal caching may return 0 or `None` for tokens if a result is retrieved from the local cache rather than the remote API.
+
+### Approximate Estimation (Outlines, HuggingFace, GliNER)
+For local models or frameworks that don't expose native counts, `sieves` uses the model's own **tokenizer** to estimate usage:
+1.  **Input Tokens**: Counted by encoding the fully rendered prompt string.
+2.  **Output Tokens**: Counted by encoding the raw generated output string.
+
+If a local tokenizer is not available (e.g., when using a remote API via Outlines without a local weight clone), `sieves` will attempt to fall back to `tiktoken` (for OpenAI-compatible models) or return `None`.
+
+```
@@ -202,7 +202,7 @@ Optimizer(
 
 ## Learning More About Optimization
 
-Sieves optimization is built on [DSPy's MIPROv2 optimizer](https://dspy-docs.vercel.app/api/optimizers/MIPROv2). For in-depth guidance on optimization techniques, training data quality, and interpreting results, we recommend exploring these external resources:
+`sieves` optimization is built on [DSPy's MIPROv2 optimizer](https://dspy-docs.vercel.app/api/optimizers/MIPROv2). For in-depth guidance on optimization techniques, training data quality, and interpreting results, we recommend exploring these external resources:
 
 ### Understanding MIPROv2
 
@@ -217,13 +217,13 @@ Sieves optimization is built on [DSPy's MIPROv2 optimizer](https://dspy-docs.ver
 - ⚙️ **Hyperparameter Tuning** - Adjusting `num_trials`, `num_candidates`, and other optimizer settings for better results
 - 🎯 **Evaluation Metrics** - Choosing the right metrics for your task (see Evaluation Metrics section above)
 
-### Sieves-Specific Integration
+### `sieves`-Specific Integration
 
-The main differences when using optimization in Sieves:
+The main differences when using optimization in `sieves`:
 
 - **Simplified API**: Use `task.optimize(optimizer)` instead of calling DSPy optimizers directly
 - **Automatic integration**: Optimized prompts and few-shot examples are automatically integrated into the task
 - **Task compatibility**: Works with all `PredictiveTask` subclasses (Classification, NER, InformationExtraction, etc.)
 - **Full parameter access**: All DSPy optimizer parameters are available via the `Optimizer` class constructor
 
-For questions specific to Sieves optimization integration, see the [Troubleshooting](#troubleshooting) section above or consult the [task-specific documentation](../tasks/predictive/classification.md) for evaluation metrics.
+For questions specific to `sieves` optimization integration, see the [Troubleshooting](#troubleshooting) section above or consult the [task-specific documentation](../tasks/predictive/classification.md) for evaluation metrics.
@@ -117,11 +117,11 @@ The configuration file contains:
 **Solution**: Check the YAML file to see which parameters are marked as placeholders:
 
 ```python
-# Read the config to see what placeholders exist
+# Read the config to see what placeholders exist.
 import yaml
 with open("pipeline.yml", "r") as f:
     config = yaml.safe_load(f)
-    print(config)  # Look for "is_placeholder: true" entries
+    print(config)  # Look for "is_placeholder: true" entries.
 ```
 
 Provide `init_params` for each task that has placeholders:
@@ -130,17 +130,17 @@ Provide `init_params` for each task that has placeholders:
 loaded_pipeline = Pipeline.load(
     "pipeline.yml",
     [
-        {"model": your_model},           # Task 0 placeholders
-        {"tokenizer": your_tokenizer},   # Task 1 placeholders
+        {"model": your_model},           # Task 0 placeholders.
+        {"tokenizer": your_tokenizer},   # Task 1 placeholders.
     ]
 )
 ```
 
 #### Version compatibility warnings
 
-**Symptom**: Warning about sieves version mismatch when loading pipelines.
+**Symptom**: Warning about `sieves` version mismatch when loading pipelines.
 
-**Cause**: The pipeline was saved with a different version of sieves than you're currently using.
+**Cause**: The pipeline was saved with a different version of `sieves` than you're currently using.
 
 **Impact**:
 
@@ -149,10 +149,10 @@ loaded_pipeline = Pipeline.load(
 
 **Solution**:
 ```bash
-# Install the version that was used to create the pipeline
-pip install sieves==0.11.1  # Match the version in the YAML
+# Install the version that was used to create the pipeline.
+pip install sieves==0.11.1  # Match the version in the YAML.
 
-# Or: Update the pipeline by re-saving it with the current version
+# Or: Update the pipeline by re-saving it with the current version.
 pipeline.dump("pipeline_updated.yml")
 ```
 
@@ -165,30 +165,30 @@ pipeline.dump("pipeline_updated.yml")
 **Solution**: Mark these as placeholders by ensuring they're provided during pipeline creation, then supply them again during load:
 
 ```python
-# When creating the pipeline
+# When creating the pipeline.
 custom_task = MyCustomTask(complex_object=my_object)
 pipeline = Pipeline([custom_task])
-pipeline.dump("pipeline.yml")  # complex_object becomes a placeholder
+pipeline.dump("pipeline.yml")  # complex_object becomes a placeholder.
 
-# When loading
+# When loading.
 loaded = Pipeline.load("pipeline.yml", [{"complex_object": my_object}])
 ```
 
 #### Model weights not loading
 
 **Symptom**: Loaded pipeline doesn't have model weights.
 
-**Cause**: sieves doesn't save model weights in configuration files (they're too large).
+**Cause**: `sieves` doesn't save model weights in configuration files (they're too large).
 
 **Solution**: Always provide fresh model instances in `init_params`:
 
 ```python
-# Load the model separately (weights will be downloaded/loaded)
+# Load the model separately (weights will be downloaded/loaded).
 model = outlines.models.transformers(
     "HuggingFaceTB/SmolLM-135M-Instruct"
 )
 
-# Then load the pipeline with the model
+# Then load the pipeline with the model.
 loaded = Pipeline.load("pipeline.yml", [{"model": model}])
 ```
 
@@ -201,14 +201,14 @@ loaded = Pipeline.load("pipeline.yml", [{"model": model}])
 **Solution**: Specify explicit task IDs when creating tasks:
 
 ```python
-# When creating
+# When creating.
 classifier = tasks.Classification(
     labels=["science", "politics"],
     model=model,
-    task_id="my_classifier"  # Explicit ID
+    task_id="my_classifier"  # Explicit ID.
 )
 
-# The results will always be in doc.results["my_classifier"]
+# The results will always be in doc.results["my_classifier"].
 ```
 
 ### Best Practices
@@ -217,7 +217,7 @@ classifier = tasks.Classification(
 2. **Document init_params**: Add comments explaining what placeholders need
 3. **Test load immediately**: After saving, try loading to catch serialization issues
 4. **Separate model loading**: Keep model initialization code separate from pipeline config
-5. **Use version pinning**: Pin sieves version in requirements.txt for reproducibility
+5. **Use version pinning**: Pin `sieves` version in requirements.txt for reproducibility
 
 ## Related Guides
 
 
@@ -1,6 +1,6 @@
 # DSPy
 
-[DSPy](https://dspy.ai/) is a framework for programming with language models. Sieves integrates with DSPy's `dspy.LM` class.
+[DSPy](https://dspy.ai/) is a framework for programming with language models. `sieves` integrates with DSPy's `dspy.LM` class.
 
 ## Usage
 
 
@@ -1,6 +1,6 @@
 # Hugging Face
 
-Sieves supports [Hugging Face](https://huggingface.co/) `pipelines` for zero-shot classification.
+`sieves` supports [Hugging Face](https://huggingface.co/) `pipelines` for zero-shot classification.
 
 ## Usage