Skip to content

Commit d6258fb

Browse files
rmitschRaphael Mitsch
andauthored
docs: Revise docs (#225)
* docs: Update docs. Link code shown in docs with tests. * docs: Decompose large code snippets into several smaller ones. * docs: Improve guides. * docs: Improvements phase 2. * docs: Update docs. * test: Update docs tests. * docs: Update readme. * fix: Fix serialization example. * docs: Update readme. * docs: Update readme. --------- Co-authored-by: Raphael Mitsch <raphael@climatiq.com>
1 parent a1c1d54 commit d6258fb

31 files changed

+4233
-1227
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,5 +43,5 @@ repos:
4343
hooks:
4444
- id: ty-check
4545
name: Type checking
46-
entry: bash -c 'uvx ty check --config-file ty.toml'
46+
entry: bash -c 'uvx ty==0.0.1a29 check --config-file ty.toml'
4747
language: system

AGENTS.md

Lines changed: 51 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -40,24 +40,26 @@ docs/ # MkDocs documentation
4040
sieves/tests/ # Comprehensive test suite
4141
pyproject.toml # Dependencies, metadata, tool config
4242
mkdocs.yml # Documentation build config
43-
CLAUDE.md # This file (Claude Code guidelines)
43+
AGENTS.md # This file (Claude Code guidelines)
4444
```
4545

4646
---
4747

4848
## Installation & Setup
4949

50-
**Python requirement:** 3.12+
50+
**Python requirement:** 3.12 (exact version required)
5151

5252
### Using `uv` (preferred)
5353

5454
```bash
55-
uv sync # Base installation
55+
uv sync # Base installation (includes all engines)
5656
uv sync --extra distill # Add distillation (SetFit, Model2Vec)
57-
uv sync --extra ingestion # Add document parsing (Docling, Marker)
57+
uv sync --extra ingestion # Add document parsing (Docling, Marker, NLTK)
5858
uv sync --all-extras # Everything (includes test tools)
5959
```
6060

61+
**Note:** As of recent updates, all engines (Outlines, DSPy, LangChain, Transformers, GLiNER2) are now core dependencies included in the base installation.
62+
6163
### Using pip (editable)
6264

6365
```bash
@@ -147,9 +149,10 @@ uv run python -c "import sieves; print(sieves.__name__)"
147149
### Core Abstractions
148150

149151
1. **Doc** (`sieves.data.Doc`)
150-
- Container for text, URI, chunks, and processing results
151-
- Auto-chunks text on initialization
152+
- Container for text, URI, chunks, images, and processing results
153+
- Auto-chunks text on initialization using Chonkie
152154
- `results` dict stores task outputs keyed by task ID
155+
- Supports image inputs via PIL (stored in `images` field)
153156

154157
2. **Pipeline** (`sieves.pipeline.Pipeline`)
155158
- Orchestrates sequential task execution
@@ -161,11 +164,14 @@ uv run python -c "import sieves; print(sieves.__name__)"
161164
- Base class for all processing steps
162165
- Subclasses: `PredictiveTask`, `Ingestion`, `Chunking`, `Optimization`
163166
- Defines `__call__(docs)` for processing
167+
- Supports conditional execution via `condition` parameter
168+
- Configurable batching via `batch_size` parameter
164169

165170
4. **Engine** (`sieves.engines.core.Engine`)
166171
- Generic interface to structured generation frameworks
167-
- Implementations: DSPy, Outlines (default), LangChain, Transformers, GLiNER
172+
- Implementations: DSPy (v3), Outlines (default), LangChain, Transformers, GLiNER2
168173
- Each engine implements `build_executable()` to compile prompts
174+
- All engines are now core dependencies (no longer optional)
169175

170176
5. **Bridge** (`sieves.tasks.predictive.bridges.Bridge`)
171177
- Connects tasks to engines
@@ -174,7 +180,8 @@ uv run python -c "import sieves; print(sieves.__name__)"
174180

175181
6. **GenerationSettings** (`sieves.engines.types.GenerationSettings`)
176182
- Configures structured generation behavior
177-
- Fields: `init_kwargs`, `inference_kwargs`, `strict_mode`, etc.
183+
- Fields: `init_kwargs`, `inference_kwargs`, `inference_mode`, `strict_mode`, batch settings
184+
- `strict_mode=True`: raises on inference failure; `False`: yields None for failed docs
178185

179186
### Data Flow
180187

@@ -196,13 +203,13 @@ Return docs with populated results
196203

197204
### Supported Engines
198205

199-
| Engine | Type | Notes |
200-
|---|---|---|
201-
| **Outlines** | Structured generation | Default; JSON schema constrained |
202-
| **DSPy** | Modular prompting | Few-shot, optimizer support (MIPROv2) |
203-
| **LangChain** | LLM wrapper | Chat models, tool calling |
204-
| **Transformers** | Direct inference | Zero-shot classification |
205-
| **GLiNER** | Specialized | Domain-specific NER |
206+
| Engine | Type | Inference Modes | Notes |
207+
|---|---|---|---|
208+
| **Outlines** | Structured generation | text, choice, regex, json | Default; JSON schema constrained |
209+
| **DSPy** (v3) | Modular prompting | predict, chain_of_thought, react, module | Few-shot, optimizer support (MIPROv2) |
210+
| **LangChain** | LLM wrapper | structured | Chat models, tool calling |
211+
| **Transformers** | Direct inference | zero_shot_classification | HuggingFace zero-shot classification pipeline |
212+
| **GLiNER2** | Specialized NER | (specialized) | Domain-specific NER, zero-shot entity recognition |
206213

207214
---
208215

@@ -258,6 +265,8 @@ Enforced via CI pipeline:
258265
- Create under `sieves/tasks/preprocessing/<type_>/`
259266
- Subclass `Task` or specialized base (e.g., `Chunking`)
260267
- Examples: custom chunkers, PDF parsers, text normalizers
268+
- **Built-in chunking**: Uses Chonkie framework (token-based) or NaiveChunker (interval-based)
269+
- **Built-in ingestion**: Docling (default) and Marker converters for PDF/DOCX parsing
261270

262271
### Few-Shot Examples
263272

@@ -272,19 +281,22 @@ Enforced via CI pipeline:
272281

273282
### Model Distillation
274283

275-
- Call `task.to_hf_dataset(docs, threshold=...)` to export results
284+
- Call `task.to_hf_dataset(docs, threshold=...)` to export results to HuggingFace dataset format
276285
- Use `task.distill(dataset, framework="setfit", ...)` to train smaller model
277286
- Supported frameworks: SetFit, Model2Vec
287+
- Available for classification, NER, and other predictive tasks
278288

279289
---
280290

281291
## Caching & Performance
282292

283-
- **Document-level caching:** Pipeline hashes documents by text or URI; cache stores results
293+
- **Document-level caching:** Pipeline hashes documents by `hash(doc.text or doc.uri)`; cache stores results
284294
- **Disable when needed:** `Pipeline(use_cache=False)`
285-
- **Batch processing:** Configure `_batch_size` in `GenerationSettings` (−1 = batch all)
295+
- **Batch processing:** Configure `batch_size` in task initialization or GenerationSettings (−1 = batch all)
286296
- **Streaming:** Tasks accept `Iterable[Doc]` for lazy evaluation on large corpora
297+
- **Conditional execution:** Use `condition` parameter on tasks to filter documents: `task(docs, condition=lambda d: len(d.text) > 100)`
287298
- **Observability:** Loguru logging during execution; access cache stats via pipeline
299+
- **Progress bars:** Configurable via task parameters (can be disabled)
288300

289301
---
290302

@@ -307,9 +319,10 @@ Enforced via CI pipeline:
307319

308320
- Adhere to typing and lint rules; run mypy/ruff/black before proposing changes
309321
- Keep patches minimal and focused; avoid unrelated refactors
310-
- Respect optional dependencies; gate engine-specific imports behind extras
322+
- Respect optional dependencies; gate ingestion/distillation imports behind extras (engines are now core)
311323
- Update docs (`docs/`) if you add public features
312324
- Write tests for new functionality
325+
- Consider conditional execution and error handling (`strict_mode`) for robust pipelines
313326

314327
### Don't
315328

@@ -344,9 +357,11 @@ Before proposing changes, ensure:
344357
## Known Constraints & Limitations
345358

346359
- Some engines do not support batching or few-shotting uniformly; bridge logic handles compatibility
347-
- Optional extras gate heavy dependencies (transformers, Docling, SetFit, etc.)
360+
- Optional extras gate heavy dependencies (Docling, Marker for ingestion; SetFit, Model2Vec for distillation)
361+
- **All engines** (Outlines, DSPy, LangChain, Transformers, GLiNER2) are now **core dependencies**
348362
- Serialization excludes complex third-party objects (models, converters); must pass at load time
349363
- Ingestion tasks may require system packages (Tesseract for OCR, etc.)
364+
- Python 3.12 exact version required (not 3.12+)
350365

351366
---
352367

@@ -403,4 +418,20 @@ Then run: `uv run pytest sieves/tests/test_my_feature.py -v`
403418

404419
---
405420

421+
## Recent Major Updates
422+
423+
Key changes that affect development (last ~2-3 months):
424+
425+
1. **All Engines as Core Dependencies** (#210) - Outlines, DSPy, LangChain, Transformers, and GLiNER2 are now included in base installation
426+
2. **DSPy v3 Migration** (#192) - Upgraded to DSPy v3 (breaking API changes from v2)
427+
3. **GliNER2 Migration** (#202) - Migrated from GliNER v1 to GLiNER2 for improved NER performance
428+
4. **GenerationSettings Refactoring** (#194) - `inference_mode` moved into GenerationSettings (simplified task init)
429+
5. **Conditional Task Execution** (#195) - Added `condition` parameter for filtering docs during execution
430+
6. **Non-strict Execution Support** (#196) - Better error handling; `strict_mode=False` allows graceful failures
431+
7. **Standardized Output Fields** (#206) - Normalized descriptive/ID attribute naming across tasks
432+
8. **Chonkie Integration** - Token-based chunking framework now primary chunking backend
433+
9. **Optional Progress Bars** (#197) - Progress display now configurable per task
434+
435+
---
436+
406437
For questions or updates to these guidelines, refer to maintainers or GitHub issues.

0 commit comments

Comments
 (0)