Skip to content

[WIP] Multi-branch executor #916

Open
yxdyc wants to merge 5 commits intomainfrom
dev/branch_exec
Open

[WIP] Multi-branch executor #916
yxdyc wants to merge 5 commits intomainfrom
dev/branch_exec

Conversation

@yxdyc
Copy link
Collaborator

@yxdyc yxdyc commented Feb 13, 2026

Multi-branch execution from a common checkpoint: executor_type: branch with common_process and per-branch process / export_path.

  • supports thread (default) and ray backends
  • optional run_common_on_ray for running the common stage on Ray
  • uses object store when possible
  • minimal serializable configs
  • fail_fast + retries
  • docs in BranchExecution.md
  • future TODOs in code.

yxdyc and others added 5 commits February 11, 2026 17:54
- Add BranchExecutor for parallel branch execution from common checkpoint
- Support common_process and branches configuration
- Auto-deduplicate common processing steps
- Each branch has its own export_path
- Update factory and config to support branch executor type
…mo configs

- Remove branch_mvp / branch_mvp_ray / branch_mvp_ray_pure yamls (configs live in docs + dj-hub)
- Consolidate MVP examples in BranchExecution.md; point to dj-hub for full configs
- Add Future TODOs in branch_executor.py (DAG mixin, Ray resources, validation, tests)
- Keep demos/branch_execution_example.yaml as single reference

Co-authored-by: Cursor <cursoragent@cursor.com>
…te demos/branch_execution_example.yaml

Co-authored-by: Cursor <cursoragent@cursor.com>
@yxdyc yxdyc added enhancement New feature or request dj:core issues/PRs about the core functions of Data-Juicer labels Feb 13, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yxdyc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a powerful new multi-branch execution capability, allowing users to define a common data processing pipeline followed by multiple divergent branches that operate on the common output. This design promotes efficiency by executing shared steps only once and enables parallel exploration of different processing paths, with support for both local threading and distributed Ray backends. It also includes robust error handling with fail-fast and retry mechanisms, along with detailed documentation to guide usage.

Highlights

  • New Multi-Branch Executor: Introduced a new branch executor type, enabling multi-branch data processing from a common checkpoint, with support for both thread and ray backends.
  • Flexible Execution Options: Implemented features such as fail_fast behavior, per-branch retries, and configurable parallel execution for branches, enhancing robustness and control.
  • Enhanced Documentation: Added comprehensive documentation in BranchExecution.md detailing the new multi-branch execution feature, including configuration, use cases, and implementation specifics.
  • Refactored Executor Initialization: Refactored the main process_data.py script to use an ExecutorFactory for dynamic instantiation of executor types, improving modularity and extensibility.
  • Filter Operator Updates: Integrated CombinedLogicalFilter and LogicalFilter into the filter operator registry, expanding filtering capabilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • data_juicer/config/config.py
    • Added 'branch' as a new choice for the 'executor_type' configuration option.
  • data_juicer/core/executor/branch_executor.py
    • Added a new BranchExecutor class to manage multi-branch data processing workflows.
    • Implemented logic for executing a common process and then running multiple branches in parallel or sequentially.
    • Included support for 'thread' and 'ray' backends, with an option to run the common process on Ray.
    • Integrated event logging and a minimal PipelineDAG for monitoring branch execution.
    • Provided mechanisms for 'fail_fast' execution and per-branch retries to enhance robustness.
    • Handled serialization of configurations for Ray tasks to ensure proper distributed execution.
  • data_juicer/core/executor/factory.py
    • Registered the new BranchExecutor within the ExecutorFactory for dynamic instantiation based on configuration.
  • data_juicer/ops/filter/init.py
    • Imported and exposed CombinedLogicalFilter and LogicalFilter modules to the filter operator collection.
  • docs/BranchExecution.md
    • Added a new documentation file detailing the branch execution feature, including its overview, configuration, use cases, benefits, execution modes (thread/Ray), and implementation details.
  • docs/Operators.md
    • Updated the total count of filter operators in the operator table.
    • Added an entry for combined_logical_filter in the operator table with its description.
  • tools/process_data.py
    • Refactored the executor initialization logic to use the ExecutorFactory instead of a series of conditional statements, improving code organization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yxdyc yxdyc linked an issue Feb 13, 2026 that may be closed by this pull request
3 tasks
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new BranchExecutor for running multi-branch data processing pipelines from a common checkpoint. This is a significant and well-implemented feature, even in its WIP state, that supports both thread and ray backends, includes fail_fast and retry mechanisms, and uses minimal serializable configs for ray for efficiency. The code is well-structured, and the addition of BranchExecution.md provides clear documentation.

My review includes a few minor suggestions for improving code clarity and efficiency, such as avoiding redundant calculations and imports. I also have a suggestion to improve the readability of the new documentation file.

Overall, this is a great addition to the project.

ref = ray.put(dataset)
return {"ref": ref}
except Exception:
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The os module is already imported at the top of the file. This redundant import inside an except block should be removed for code clarity.

Comment on lines +218 to +224
if hasattr(self, "_log_event"):
self._log_event(
EventType.JOB_COMPLETE,
f"Branch job completed in {time.time() - run_start:.2f}s",
duration=time.time() - run_start,
metadata={"completed": list(branch_results.keys()), "failed": list(errors.keys())},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The duration time.time() - run_start is calculated twice. It's more efficient and readable to calculate it once and store it in a variable.

Suggested change
if hasattr(self, "_log_event"):
self._log_event(
EventType.JOB_COMPLETE,
f"Branch job completed in {time.time() - run_start:.2f}s",
duration=time.time() - run_start,
metadata={"completed": list(branch_results.keys()), "failed": list(errors.keys())},
)
if hasattr(self, "_log_event"):
job_duration = time.time() - run_start
self._log_event(
EventType.JOB_COMPLETE,
f"Branch job completed in {job_duration:.2f}s",
duration=job_duration,
metadata={"completed": list(branch_results.keys()), "failed": list(errors.keys())},
)

Comment on lines +500 to +506
if hasattr(self, "_log_event"):
self._log_event(
EventType.DAG_NODE_COMPLETE,
f"Branch {branch_name} completed in {time.time() - start:.2f}s",
operation_name=branch_name,
duration=time.time() - start,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The duration time.time() - start is calculated twice. It's more efficient and readable to calculate it once and store it in a variable.

Suggested change
if hasattr(self, "_log_event"):
self._log_event(
EventType.DAG_NODE_COMPLETE,
f"Branch {branch_name} completed in {time.time() - start:.2f}s",
operation_name=branch_name,
duration=time.time() - start,
)
if hasattr(self, "_log_event"):
duration = time.time() - start
self._log_event(
EventType.DAG_NODE_COMPLETE,
f"Branch {branch_name} completed in {duration:.2f}s",
operation_name=branch_name,
duration=duration,
)

retries: 0 # per-branch retry count (default: 0)
```

- With `backend: ray`, each branch runs as a Ray remote task; common dataset is shared via object store when possible, else via disk. Branch configs are minimal. When `run_common_on_ray: true`, the common process also runs as one Ray task (dataset_path and work_dir must be worker-accessible).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line is quite long and contains multiple distinct points. For better readability, consider breaking it down into a nested list.

Suggested change
- With `backend: ray`, each branch runs as a Ray remote task; common dataset is shared via object store when possible, else via disk. Branch configs are minimal. When `run_common_on_ray: true`, the common process also runs as one Ray task (dataset_path and work_dir must be worker-accessible).
- With `backend: ray`:
- Each branch runs as a Ray remote task.
- The common dataset is shared via the object store when possible, otherwise via disk.
- Branch configs are minimal to reduce serialization overhead.
- When `run_common_on_ray: true`, the common process also runs as a Ray task, which requires `dataset_path` and `work_dir` to be accessible by all workers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dj:core issues/PRs about the core functions of Data-Juicer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-Branch Execution (DAG feature enhancment)

1 participant