@@ -40,24 +40,26 @@ docs/ # MkDocs documentation
4040sieves/tests/ # Comprehensive test suite
4141pyproject.toml # Dependencies, metadata, tool config
4242mkdocs.yml # Documentation build config
43- CLAUDE .md # This file (Claude Code guidelines)
43+ AGENTS .md # This file (Claude Code guidelines)
4444```
4545
4646---
4747
4848## Installation & Setup
4949
50- ** Python requirement:** 3.12+
50+ ** Python requirement:** 3.12 (exact version required)
5151
5252### Using ` uv ` (preferred)
5353
5454``` bash
55- uv sync # Base installation
55+ uv sync # Base installation (includes all engines)
5656uv sync --extra distill # Add distillation (SetFit, Model2Vec)
57- uv sync --extra ingestion # Add document parsing (Docling, Marker)
57+ uv sync --extra ingestion # Add document parsing (Docling, Marker, NLTK )
5858uv sync --all-extras # Everything (includes test tools)
5959```
6060
61+ ** Note:** As of recent updates, all engines (Outlines, DSPy, LangChain, Transformers, GLiNER2) are now core dependencies included in the base installation.
62+
6163### Using pip (editable)
6264
6365``` bash
@@ -147,9 +149,10 @@ uv run python -c "import sieves; print(sieves.__name__)"
147149### Core Abstractions
148150
1491511 . ** Doc** (` sieves.data.Doc ` )
150- - Container for text, URI, chunks, and processing results
151- - Auto-chunks text on initialization
152+ - Container for text, URI, chunks, images, and processing results
153+ - Auto-chunks text on initialization using Chonkie
152154 - ` results ` dict stores task outputs keyed by task ID
155+ - Supports image inputs via PIL (stored in ` images ` field)
153156
1541572 . ** Pipeline** (` sieves.pipeline.Pipeline ` )
155158 - Orchestrates sequential task execution
@@ -161,11 +164,14 @@ uv run python -c "import sieves; print(sieves.__name__)"
161164 - Base class for all processing steps
162165 - Subclasses: ` PredictiveTask ` , ` Ingestion ` , ` Chunking ` , ` Optimization `
163166 - Defines ` __call__(docs) ` for processing
167+ - Supports conditional execution via ` condition ` parameter
168+ - Configurable batching via ` batch_size ` parameter
164169
1651704 . ** Engine** (` sieves.engines.core.Engine ` )
166171 - Generic interface to structured generation frameworks
167- - Implementations: DSPy, Outlines (default), LangChain, Transformers, GLiNER
172+ - Implementations: DSPy (v3) , Outlines (default), LangChain, Transformers, GLiNER2
168173 - Each engine implements ` build_executable() ` to compile prompts
174+ - All engines are now core dependencies (no longer optional)
169175
1701765 . ** Bridge** (` sieves.tasks.predictive.bridges.Bridge ` )
171177 - Connects tasks to engines
@@ -174,7 +180,8 @@ uv run python -c "import sieves; print(sieves.__name__)"
174180
1751816 . ** GenerationSettings** (` sieves.engines.types.GenerationSettings ` )
176182 - Configures structured generation behavior
177- - Fields: ` init_kwargs ` , ` inference_kwargs ` , ` strict_mode ` , etc.
183+ - Fields: ` init_kwargs ` , ` inference_kwargs ` , ` inference_mode ` , ` strict_mode ` , batch settings
184+ - ` strict_mode=True ` : raises on inference failure; ` False ` : yields None for failed docs
178185
179186### Data Flow
180187
@@ -196,13 +203,13 @@ Return docs with populated results
196203
197204### Supported Engines
198205
199- | Engine | Type | Notes |
200- | ---| ---| ---|
201- | ** Outlines** | Structured generation | Default; JSON schema constrained |
202- | ** DSPy** | Modular prompting | Few-shot, optimizer support (MIPROv2) |
203- | ** LangChain** | LLM wrapper | Chat models, tool calling |
204- | ** Transformers** | Direct inference | Zero -shot classification |
205- | ** GLiNER ** | Specialized | Domain-specific NER |
206+ | Engine | Type | Inference Modes | Notes |
207+ | ---| ---| ---| --- |
208+ | ** Outlines** | Structured generation | text, choice, regex, json | Default; JSON schema constrained |
209+ | ** DSPy** (v3) | Modular prompting | predict, chain_of_thought, react, module | Few-shot, optimizer support (MIPROv2) |
210+ | ** LangChain** | LLM wrapper | structured | Chat models, tool calling |
211+ | ** Transformers** | Direct inference | zero_shot_classification | HuggingFace zero -shot classification pipeline |
212+ | ** GLiNER2 ** | Specialized NER | (specialized) | Domain-specific NER, zero-shot entity recognition |
206213
207214---
208215
@@ -258,6 +265,8 @@ Enforced via CI pipeline:
258265- Create under ` sieves/tasks/preprocessing/<type_>/ `
259266- Subclass ` Task ` or specialized base (e.g., ` Chunking ` )
260267- Examples: custom chunkers, PDF parsers, text normalizers
268+ - ** Built-in chunking** : Uses Chonkie framework (token-based) or NaiveChunker (interval-based)
269+ - ** Built-in ingestion** : Docling (default) and Marker converters for PDF/DOCX parsing
261270
262271### Few-Shot Examples
263272
@@ -272,19 +281,22 @@ Enforced via CI pipeline:
272281
273282### Model Distillation
274283
275- - Call ` task.to_hf_dataset(docs, threshold=...) ` to export results
284+ - Call ` task.to_hf_dataset(docs, threshold=...) ` to export results to HuggingFace dataset format
276285- Use ` task.distill(dataset, framework="setfit", ...) ` to train smaller model
277286- Supported frameworks: SetFit, Model2Vec
287+ - Available for classification, NER, and other predictive tasks
278288
279289---
280290
281291## Caching & Performance
282292
283- - ** Document-level caching:** Pipeline hashes documents by text or URI ; cache stores results
293+ - ** Document-level caching:** Pipeline hashes documents by ` hash(doc. text or doc.uri) ` ; cache stores results
284294- ** Disable when needed:** ` Pipeline(use_cache=False) `
285- - ** Batch processing:** Configure ` _batch_size ` in ` GenerationSettings ` (−1 = batch all)
295+ - ** Batch processing:** Configure ` batch_size ` in task initialization or GenerationSettings (−1 = batch all)
286296- ** Streaming:** Tasks accept ` Iterable[Doc] ` for lazy evaluation on large corpora
297+ - ** Conditional execution:** Use ` condition ` parameter on tasks to filter documents: ` task(docs, condition=lambda d: len(d.text) > 100) `
287298- ** Observability:** Loguru logging during execution; access cache stats via pipeline
299+ - ** Progress bars:** Configurable via task parameters (can be disabled)
288300
289301---
290302
@@ -307,9 +319,10 @@ Enforced via CI pipeline:
307319
308320- Adhere to typing and lint rules; run mypy/ruff/black before proposing changes
309321- Keep patches minimal and focused; avoid unrelated refactors
310- - Respect optional dependencies; gate engine-specific imports behind extras
322+ - Respect optional dependencies; gate ingestion/distillation imports behind extras (engines are now core)
311323- Update docs (` docs/ ` ) if you add public features
312324- Write tests for new functionality
325+ - Consider conditional execution and error handling (` strict_mode ` ) for robust pipelines
313326
314327### Don't
315328
@@ -344,9 +357,11 @@ Before proposing changes, ensure:
344357## Known Constraints & Limitations
345358
346359- Some engines do not support batching or few-shotting uniformly; bridge logic handles compatibility
347- - Optional extras gate heavy dependencies (transformers, Docling, SetFit, etc.)
360+ - Optional extras gate heavy dependencies (Docling, Marker for ingestion; SetFit, Model2Vec for distillation)
361+ - ** All engines** (Outlines, DSPy, LangChain, Transformers, GLiNER2) are now ** core dependencies**
348362- Serialization excludes complex third-party objects (models, converters); must pass at load time
349363- Ingestion tasks may require system packages (Tesseract for OCR, etc.)
364+ - Python 3.12 exact version required (not 3.12+)
350365
351366---
352367
@@ -403,4 +418,20 @@ Then run: `uv run pytest sieves/tests/test_my_feature.py -v`
403418
404419---
405420
421+ ## Recent Major Updates
422+
423+ Key changes that affect development (last ~ 2-3 months):
424+
425+ 1 . ** All Engines as Core Dependencies** (#210 ) - Outlines, DSPy, LangChain, Transformers, and GLiNER2 are now included in base installation
426+ 2 . ** DSPy v3 Migration** (#192 ) - Upgraded to DSPy v3 (breaking API changes from v2)
427+ 3 . ** GliNER2 Migration** (#202 ) - Migrated from GliNER v1 to GLiNER2 for improved NER performance
428+ 4 . ** GenerationSettings Refactoring** (#194 ) - ` inference_mode ` moved into GenerationSettings (simplified task init)
429+ 5 . ** Conditional Task Execution** (#195 ) - Added ` condition ` parameter for filtering docs during execution
430+ 6 . ** Non-strict Execution Support** (#196 ) - Better error handling; ` strict_mode=False ` allows graceful failures
431+ 7 . ** Standardized Output Fields** (#206 ) - Normalized descriptive/ID attribute naming across tasks
432+ 8 . ** Chonkie Integration** - Token-based chunking framework now primary chunking backend
433+ 9 . ** Optional Progress Bars** (#197 ) - Progress display now configurable per task
434+
435+ ---
436+
406437For questions or updates to these guidelines, refer to maintainers or GitHub issues.
0 commit comments