You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/pipeline.md
+80Lines changed: 80 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,4 +40,84 @@ Note: Ingestion libraries (e.g., `docling`) are optional and not installed by de
40
40
pip install "sieves[ingestion]"
41
41
```
42
42
43
+
## Conditional Task Execution
44
+
45
+
Tasks support optional conditional execution via the `condition` parameter. This allows you to skip processing certain documents based on custom logic, without materializing all documents upfront.
46
+
47
+
### Basic Usage
48
+
49
+
Pass a callable `Condition[[Doc], bool]` to any task to conditionally process documents:
50
+
51
+
```python
52
+
from sieves import Pipeline, tasks, Doc
53
+
54
+
docs = [
55
+
Doc(text="short"),
56
+
Doc(text="this is a much longer document that will be processed"),
57
+
Doc(text="med"),
58
+
]
59
+
60
+
# Define a condition function
61
+
defis_long(doc: Doc) -> bool:
62
+
returnlen(doc.text or"") >20
63
+
64
+
# Create a task with a condition
65
+
task = tasks.Classification(
66
+
labels=["science", "politics"],
67
+
model=model,
68
+
condition=is_long
69
+
)
70
+
71
+
# Run pipeline
72
+
pipe = Pipeline([task])
73
+
for doc in pipe(docs):
74
+
# doc.results[task.id] will be None for documents that failed the condition
75
+
print(doc.results[task.id])
76
+
```
77
+
78
+
### Key Behaviors
79
+
80
+
-**Per-document evaluation**: The condition is evaluated for each document individually
81
+
-**Lazy evaluation**: Documents are not materialized upfront; passing documents are batched together for efficient processing
82
+
-**Result tracking**: Skipped documents have `results[task_id] = None`
83
+
-**Order preservation**: Document order is always maintained, regardless of which documents are skipped
84
+
-**No-op when None**: If `condition=None`, all documents are processed
85
+
86
+
### Multiple Tasks with Different Conditions
87
+
88
+
Different tasks in a pipeline can have different conditions:
89
+
90
+
```python
91
+
from sieves import Pipeline, tasks, Doc
92
+
93
+
docs = [
94
+
Doc(text="short"),
95
+
Doc(text="this is a much longer document"),
96
+
Doc(text="medium text here"),
97
+
]
98
+
99
+
# Task 1: Process only documents longer than 10 characters
All tasks support optional conditional execution through the `condition` parameter. This feature allows you to skip processing certain documents based on custom criteria without materializing all documents upfront.
6
+
7
+
### Overview
8
+
9
+
The `condition` parameter accepts an optional callable with signature `Callable[[Doc], bool]`:
10
+
11
+
```python
12
+
defcondition(doc: Doc) -> bool:
13
+
# Return True to process the document
14
+
# Return False to skip it
15
+
returnTrue
16
+
```
17
+
18
+
### Implementation Details
19
+
20
+
When a task is executed with a condition:
21
+
22
+
1.**Per-Document Evaluation**: Each document is evaluated against the condition individually
23
+
2.**Lazy Batching**: Only documents that pass the condition are batched together and sent to the task's `_call()` method
24
+
3.**Order Preservation**: Documents are returned in their original order, even if some were skipped
25
+
4.**Result Storage**: Skipped documents have `results[task_id] = None`
26
+
27
+
### Examples
28
+
29
+
#### Skip Documents by Size
30
+
31
+
```python
32
+
from sieves import tasks, Pipeline, Doc
33
+
34
+
# Only process documents longer than 100 characters
35
+
task = tasks.Classification(
36
+
labels=["positive", "negative"],
37
+
model=model,
38
+
condition=lambdadoc: len(doc.text or"") >100
39
+
)
40
+
41
+
pipe = Pipeline([task])
42
+
docs = [Doc(text="short"), Doc(text="a very long document "*10)]
43
+
results =list(pipe(docs))
44
+
45
+
# First doc: results[task.id] == None (skipped)
46
+
# Second doc: results[task.id] contains classification results
47
+
```
48
+
49
+
#### Skip Documents Based on Metadata
50
+
51
+
```python
52
+
# Only process documents from specific sources
53
+
defshould_process(doc: Doc) -> bool:
54
+
return doc.meta.get("source") in ["source_a", "source_b"]
-**No Materialization**: Documents are processed using iterators; passing documents are batched together without materializing the entire document collection upfront
88
+
-**Index-Based Tracking**: The implementation uses document indices for efficient filtering and reordering
89
+
-**All Engines Supported**: Conditional execution works with all supported engines (DSPy, LangChain, Outlines, HuggingFace, GLiNER, etc.)
90
+
-**Serialization**: Non-callable condition values (like `None`) serialize naturally; callable conditions are serialized as placeholders
0 commit comments