fix: improve caching and don't raise error for bad gather configs (#373)

shreyashankar · web-flow · commit 7071ade53986 · 2025-06-30T23:19:17.000-07:00
* merge

* fix: improve caching and don't raise error for bad gather configs

* fix: improve caching and don't raise error for bad gather configs

* fix: improve caching and don't raise error for bad gather configs
diff --git a/docetl/operations/gather.py b/docetl/operations/gather.py
@@ -295,6 +295,11 @@ def render_hierarchy_headers(
 
         # Find the largest/highest level in the current chunk
         current_chunk_headers = current_chunk.get(doc_header_key, [])
+
+        # If there are no headers in the current chunk, return an empty string
+        if not current_chunk_headers:
+            return ""
+
         highest_level = float("inf")  # Initialize with positive infinity
         for header_info in current_chunk_headers:
             try:
diff --git a/docetl/operations/map.py b/docetl/operations/map.py
@@ -702,6 +702,7 @@ def process_prompt(item, prompt_config):
                 tools=prompt_config.get("tools", None),
                 timeout_seconds=self.config.get("timeout", 120),
                 max_retries_per_timeout=self.config.get("max_retries_per_timeout", 2),
+                gleaning_config=prompt_config.get("gleaning", None),
                 bypass_cache=self.config.get("bypass_cache", False),
                 litellm_completion_kwargs=self.config.get(
                     "litellm_completion_kwargs", {}
diff --git a/docetl/runner.py b/docetl/runner.py
@@ -626,6 +626,34 @@ def _save_checkpoint(
         with open(checkpoint_path, "w") as f:
             json.dump(data, f)
 
+        # Update the intermediate config file with the hash for this step/operation
+        # so that future runs can validate and reuse this checkpoint.
+        if self.intermediate_dir:
+            intermediate_config_path = os.path.join(
+                self.intermediate_dir, ".docetl_intermediate_config.json"
+            )
+
+            # Initialize or load existing intermediate configuration
+            if os.path.exists(intermediate_config_path):
+                try:
+                    with open(intermediate_config_path, "r") as cfg_file:
+                        intermediate_config: Dict[str, Dict[str, str]] = json.load(cfg_file)
+                except json.JSONDecodeError:
+                    # If the file is corrupted, start fresh to avoid crashes
+                    intermediate_config = {}
+            else:
+                intermediate_config = {}
+
+            # Ensure nested dict structure exists
+            step_dict = intermediate_config.setdefault(step_name, {})
+
+            # Write (or overwrite) the hash for the current operation
+            step_dict[operation_name] = self.step_op_hashes[step_name][operation_name]
+
+            # Persist the updated configuration
+            with open(intermediate_config_path, "w") as cfg_file:
+                json.dump(intermediate_config, cfg_file, indent=2)
+
         self.console.log(
             f"[green]✓ [italic]Intermediate saved for operation '{operation_name}' in step '{step_name}' at {checkpoint_path}[/italic][/green]"
         )
diff --git a/docs/concepts/operators.md b/docs/concepts/operators.md
@@ -155,7 +155,11 @@ To enable gleaning, specify:
 
 - `validation_prompt`: Instructions for the LLM to evaluate and improve the output.
 - `num_rounds`: The maximum number of refinement iterations.
+<<<<<<< HEAD
 - `model` (optional): The model to use for the LLM executing the validation prompt. Defaults to the model specified for this operation. **Note that if the validator LLM determines the output needs to be improved, the final output will be generated by the model specified for this operation.**
+=======
+- `model` (optional): The model to use for the LLM executing the validation prompt. Defaults to the model specified for that operation.
+>>>>>>> 070110d (docs: improve gleaning description)
 
 Example:
 
@@ -193,7 +197,11 @@ Example map operation (with a different model for the validation prompt):
 
 !!! tip "Choosing a Different Model for Validation"
 
+<<<<<<< HEAD
     In the example above, the `gpt-4o` model is used to generate the main outputs, while the `gpt-4o-mini` model is used only for the validation and refinement steps. This means the more powerful (and expensive) model produces the final output, but a less expensive model handles the iterative validation, helping to reduce costs without sacrificing output quality.
+=======
+    You may want to use a different model for the validation prompt. For example, you can use a more powerful (and expensive) model for generating outputs, but a cheaper model for validation—especially if the validation only checks a single aspect. This approach helps reduce costs while still ensuring quality, since the final output is always produced by the more capable model.
+>>>>>>> 070110d (docs: improve gleaning description)
 
 ### How Gleaning Works
 
diff --git a/docs/operators/parallel-map.md b/docs/operators/parallel-map.md
@@ -26,6 +26,7 @@ Each prompt configuration in the `prompts` list should contain:
 - `prompt`: The prompt template to use for the transformation
 - `output_keys`: List of keys that this prompt will generate
 - `model` (optional): The language model to use for this specific prompt
+- `gleaning` (optional): Advanced validation settings for this prompt (see Per-Prompt Gleaning section below)
 
 ### Optional Parameters
 
@@ -59,6 +60,10 @@ Here's an example of a parallel map operation that processes job applications by
       prompt: "Given the following resume: '{{ input.resume }}', list the top 5 relevant skills for a software engineering position."
       output_keys:
         - skills
+      gleaning:
+        num_rounds: 1
+        validation_prompt: |
+          Confirm the skills list contains **exactly** 5 distinct skills and each skill is one or two words long.
       model: gpt-4o-mini
     - name: calculate_experience
       prompt: "Based on the work history in this resume: '{{ input.resume }}', calculate the total years of relevant experience for a software engineering role."
@@ -79,6 +84,57 @@ Here's an example of a parallel map operation that processes job applications by
 
 This Parallel Map operation processes job applications by concurrently extracting skills, calculating experience, and evaluating cultural fit.
 
+## Advanced Validation: Per-Prompt Gleaning
+
+Each prompt in a Parallel Map operation can include its own `gleaning` configuration. Gleaning works exactly as described in the [operators overview](../concepts/operators.md#advanced-validation-gleaning) but is **scoped to the individual LLM call** for that prompt. This allows you to tailor validation logic—and even the model used—to the specific transformation being performed.
+
+The structure of the `gleaning` block is identical:
+
+```yaml
+gleaning:
+  num_rounds: 1               # maximum refinement iterations
+  validation_prompt: |        # judge prompt appended to the chat thread
+    Ensure the extracted skills list contains at least 5 distinct items.
+  model: gpt-4o-mini          # (optional) model for the validator LLM
+```
+
+### 📄 Example with Per-Prompt Gleaning
+
+```yaml
+- name: process_job_application
+  type: parallel_map
+  prompts:
+    - name: extract_skills
+      prompt: "Given the following resume: '{{ input.resume }}', list the top 5 relevant skills for a software engineering position."
+      output_keys:
+        - skills
+      gleaning:
+        num_rounds: 1
+        validation_prompt: |
+          Confirm the skills list contains **exactly** 5 distinct skills and each skill is one or two words long.
+      model: gpt-4o-mini
+    - name: calculate_experience
+      prompt: "Based on the work history in this resume: '{{ input.resume }}', calculate the total years of relevant experience for a software engineering role."
+      output_keys:
+        - years_experience
+      gleaning:
+        num_rounds: 2
+        validation_prompt: |
+          Verify that the years of experience is a non-negative number and round to one decimal place if necessary.
+    - name: evaluate_cultural_fit
+      prompt: "Analyze the following cover letter: '{{ input.cover_letter }}'. Rate the candidate's potential cultural fit on a scale of 1-10, where 10 is the highest."
+      output_keys:
+        - cultural_fit_score
+      model: gpt-4o-mini
+  output:
+    schema:
+      skills: list[string]
+      years_experience: float
+      cultural_fit_score: integer
+```
+
+In this configuration, only the `extract_skills` and `calculate_experience` prompts use gleaning. Each prompt's validator runs **immediately after** its own LLM call and before the overall outputs are merged.
+
 ## Advantages
 
 1. **Concurrency**: Multiple transformations are applied simultaneously, potentially reducing overall processing time.
@@ -91,3 +147,4 @@ This Parallel Map operation processes job applications by concurrently extractin
 1. **Independent Transformations**: Ensure that the prompts in a Parallel Map operation are truly independent of each other to maximize the benefits of concurrent execution.
 2. **Balanced Prompts**: Try to design prompts that have similar complexity and execution times to optimize overall performance.
 3. **Output Schema Alignment**: Ensure that the output schema correctly captures all the fields generated by the individual prompts.
+4. **Lightweight Validators**: When using per-prompt gleaning, keep validation prompts concise so that the cost and latency overhead stays manageable.