Skip to content

Commit 7071ade

Browse files
fix: improve caching and don't raise error for bad gather configs (#373)
* merge * fix: improve caching and don't raise error for bad gather configs * fix: improve caching and don't raise error for bad gather configs * fix: improve caching and don't raise error for bad gather configs
1 parent ea60f38 commit 7071ade

File tree

5 files changed

+99
-0
lines changed

5 files changed

+99
-0
lines changed

docetl/operations/gather.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,11 @@ def render_hierarchy_headers(
295295

296296
# Find the largest/highest level in the current chunk
297297
current_chunk_headers = current_chunk.get(doc_header_key, [])
298+
299+
# If there are no headers in the current chunk, return an empty string
300+
if not current_chunk_headers:
301+
return ""
302+
298303
highest_level = float("inf") # Initialize with positive infinity
299304
for header_info in current_chunk_headers:
300305
try:

docetl/operations/map.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -702,6 +702,7 @@ def process_prompt(item, prompt_config):
702702
tools=prompt_config.get("tools", None),
703703
timeout_seconds=self.config.get("timeout", 120),
704704
max_retries_per_timeout=self.config.get("max_retries_per_timeout", 2),
705+
gleaning_config=prompt_config.get("gleaning", None),
705706
bypass_cache=self.config.get("bypass_cache", False),
706707
litellm_completion_kwargs=self.config.get(
707708
"litellm_completion_kwargs", {}

docetl/runner.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -626,6 +626,34 @@ def _save_checkpoint(
626626
with open(checkpoint_path, "w") as f:
627627
json.dump(data, f)
628628

629+
# Update the intermediate config file with the hash for this step/operation
630+
# so that future runs can validate and reuse this checkpoint.
631+
if self.intermediate_dir:
632+
intermediate_config_path = os.path.join(
633+
self.intermediate_dir, ".docetl_intermediate_config.json"
634+
)
635+
636+
# Initialize or load existing intermediate configuration
637+
if os.path.exists(intermediate_config_path):
638+
try:
639+
with open(intermediate_config_path, "r") as cfg_file:
640+
intermediate_config: Dict[str, Dict[str, str]] = json.load(cfg_file)
641+
except json.JSONDecodeError:
642+
# If the file is corrupted, start fresh to avoid crashes
643+
intermediate_config = {}
644+
else:
645+
intermediate_config = {}
646+
647+
# Ensure nested dict structure exists
648+
step_dict = intermediate_config.setdefault(step_name, {})
649+
650+
# Write (or overwrite) the hash for the current operation
651+
step_dict[operation_name] = self.step_op_hashes[step_name][operation_name]
652+
653+
# Persist the updated configuration
654+
with open(intermediate_config_path, "w") as cfg_file:
655+
json.dump(intermediate_config, cfg_file, indent=2)
656+
629657
self.console.log(
630658
f"[green]✓ [italic]Intermediate saved for operation '{operation_name}' in step '{step_name}' at {checkpoint_path}[/italic][/green]"
631659
)

docs/concepts/operators.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,7 +155,11 @@ To enable gleaning, specify:
155155

156156
- `validation_prompt`: Instructions for the LLM to evaluate and improve the output.
157157
- `num_rounds`: The maximum number of refinement iterations.
158+
<<<<<<< HEAD
158159
- `model` (optional): The model to use for the LLM executing the validation prompt. Defaults to the model specified for this operation. **Note that if the validator LLM determines the output needs to be improved, the final output will be generated by the model specified for this operation.**
160+
=======
161+
- `model` (optional): The model to use for the LLM executing the validation prompt. Defaults to the model specified for that operation.
162+
>>>>>>> 070110d (docs: improve gleaning description)
159163

160164
Example:
161165

@@ -193,7 +197,11 @@ Example map operation (with a different model for the validation prompt):
193197

194198
!!! tip "Choosing a Different Model for Validation"
195199

200+
<<<<<<< HEAD
196201
In the example above, the `gpt-4o` model is used to generate the main outputs, while the `gpt-4o-mini` model is used only for the validation and refinement steps. This means the more powerful (and expensive) model produces the final output, but a less expensive model handles the iterative validation, helping to reduce costs without sacrificing output quality.
202+
=======
203+
You may want to use a different model for the validation prompt. For example, you can use a more powerful (and expensive) model for generating outputs, but a cheaper model for validation—especially if the validation only checks a single aspect. This approach helps reduce costs while still ensuring quality, since the final output is always produced by the more capable model.
204+
>>>>>>> 070110d (docs: improve gleaning description)
197205

198206
### How Gleaning Works
199207

docs/operators/parallel-map.md

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Each prompt configuration in the `prompts` list should contain:
2626
- `prompt`: The prompt template to use for the transformation
2727
- `output_keys`: List of keys that this prompt will generate
2828
- `model` (optional): The language model to use for this specific prompt
29+
- `gleaning` (optional): Advanced validation settings for this prompt (see Per-Prompt Gleaning section below)
2930

3031
### Optional Parameters
3132

@@ -59,6 +60,10 @@ Here's an example of a parallel map operation that processes job applications by
5960
prompt: "Given the following resume: '{{ input.resume }}', list the top 5 relevant skills for a software engineering position."
6061
output_keys:
6162
- skills
63+
gleaning:
64+
num_rounds: 1
65+
validation_prompt: |
66+
Confirm the skills list contains **exactly** 5 distinct skills and each skill is one or two words long.
6267
model: gpt-4o-mini
6368
- name: calculate_experience
6469
prompt: "Based on the work history in this resume: '{{ input.resume }}', calculate the total years of relevant experience for a software engineering role."
@@ -79,6 +84,57 @@ Here's an example of a parallel map operation that processes job applications by
7984
8085
This Parallel Map operation processes job applications by concurrently extracting skills, calculating experience, and evaluating cultural fit.
8186
87+
## Advanced Validation: Per-Prompt Gleaning
88+
89+
Each prompt in a Parallel Map operation can include its own `gleaning` configuration. Gleaning works exactly as described in the [operators overview](../concepts/operators.md#advanced-validation-gleaning) but is **scoped to the individual LLM call** for that prompt. This allows you to tailor validation logic—and even the model used—to the specific transformation being performed.
90+
91+
The structure of the `gleaning` block is identical:
92+
93+
```yaml
94+
gleaning:
95+
num_rounds: 1 # maximum refinement iterations
96+
validation_prompt: | # judge prompt appended to the chat thread
97+
Ensure the extracted skills list contains at least 5 distinct items.
98+
model: gpt-4o-mini # (optional) model for the validator LLM
99+
```
100+
101+
### 📄 Example with Per-Prompt Gleaning
102+
103+
```yaml
104+
- name: process_job_application
105+
type: parallel_map
106+
prompts:
107+
- name: extract_skills
108+
prompt: "Given the following resume: '{{ input.resume }}', list the top 5 relevant skills for a software engineering position."
109+
output_keys:
110+
- skills
111+
gleaning:
112+
num_rounds: 1
113+
validation_prompt: |
114+
Confirm the skills list contains **exactly** 5 distinct skills and each skill is one or two words long.
115+
model: gpt-4o-mini
116+
- name: calculate_experience
117+
prompt: "Based on the work history in this resume: '{{ input.resume }}', calculate the total years of relevant experience for a software engineering role."
118+
output_keys:
119+
- years_experience
120+
gleaning:
121+
num_rounds: 2
122+
validation_prompt: |
123+
Verify that the years of experience is a non-negative number and round to one decimal place if necessary.
124+
- name: evaluate_cultural_fit
125+
prompt: "Analyze the following cover letter: '{{ input.cover_letter }}'. Rate the candidate's potential cultural fit on a scale of 1-10, where 10 is the highest."
126+
output_keys:
127+
- cultural_fit_score
128+
model: gpt-4o-mini
129+
output:
130+
schema:
131+
skills: list[string]
132+
years_experience: float
133+
cultural_fit_score: integer
134+
```
135+
136+
In this configuration, only the `extract_skills` and `calculate_experience` prompts use gleaning. Each prompt's validator runs **immediately after** its own LLM call and before the overall outputs are merged.
137+
82138
## Advantages
83139

84140
1. **Concurrency**: Multiple transformations are applied simultaneously, potentially reducing overall processing time.
@@ -91,3 +147,4 @@ This Parallel Map operation processes job applications by concurrently extractin
91147
1. **Independent Transformations**: Ensure that the prompts in a Parallel Map operation are truly independent of each other to maximize the benefits of concurrent execution.
92148
2. **Balanced Prompts**: Try to design prompts that have similar complexity and execution times to optimize overall performance.
93149
3. **Output Schema Alignment**: Ensure that the output schema correctly captures all the fields generated by the individual prompts.
150+
4. **Lightweight Validators**: When using per-prompt gleaning, keep validation prompts concise so that the cost and latency overhead stays manageable.

0 commit comments

Comments
 (0)