[WIP] Multi-branch executor by yxdyc · Pull Request #916 · datajuicer/data-juicer

yxdyc · 2026-02-13T08:49:02Z

Multi-branch execution from a common checkpoint: executor_type: branch with common_process and per-branch process / export_path.

supports thread (default) and ray backends
optional run_common_on_ray for running the common stage on Ray
uses object store when possible
minimal serializable configs
fail_fast + retries
docs in BranchExecution.md
future TODOs in code.

… check

- Add BranchExecutor for parallel branch execution from common checkpoint - Support common_process and branches configuration - Auto-deduplicate common processing steps - Each branch has its own export_path - Update factory and config to support branch executor type

…mo configs - Remove branch_mvp / branch_mvp_ray / branch_mvp_ray_pure yamls (configs live in docs + dj-hub) - Consolidate MVP examples in BranchExecution.md; point to dj-hub for full configs - Add Future TODOs in branch_executor.py (DAG mixin, Ray resources, validation, tests) - Keep demos/branch_execution_example.yaml as single reference Co-authored-by: Cursor <cursoragent@cursor.com>

…te demos/branch_execution_example.yaml Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist · 2026-02-13T08:49:23Z

Summary of Changes

Hello @yxdyc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a powerful new multi-branch execution capability, allowing users to define a common data processing pipeline followed by multiple divergent branches that operate on the common output. This design promotes efficiency by executing shared steps only once and enables parallel exploration of different processing paths, with support for both local threading and distributed Ray backends. It also includes robust error handling with fail-fast and retry mechanisms, along with detailed documentation to guide usage.

Highlights

New Multi-Branch Executor: Introduced a new branch executor type, enabling multi-branch data processing from a common checkpoint, with support for both thread and ray backends.
Flexible Execution Options: Implemented features such as fail_fast behavior, per-branch retries, and configurable parallel execution for branches, enhancing robustness and control.
Enhanced Documentation: Added comprehensive documentation in BranchExecution.md detailing the new multi-branch execution feature, including configuration, use cases, and implementation specifics.
Refactored Executor Initialization: Refactored the main process_data.py script to use an ExecutorFactory for dynamic instantiation of executor types, improving modularity and extensibility.
Filter Operator Updates: Integrated CombinedLogicalFilter and LogicalFilter into the filter operator registry, expanding filtering capabilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

data_juicer/config/config.py
- Added 'branch' as a new choice for the 'executor_type' configuration option.
data_juicer/core/executor/branch_executor.py
- Added a new BranchExecutor class to manage multi-branch data processing workflows.
- Implemented logic for executing a common process and then running multiple branches in parallel or sequentially.
- Included support for 'thread' and 'ray' backends, with an option to run the common process on Ray.
- Integrated event logging and a minimal PipelineDAG for monitoring branch execution.
- Provided mechanisms for 'fail_fast' execution and per-branch retries to enhance robustness.
- Handled serialization of configurations for Ray tasks to ensure proper distributed execution.
data_juicer/core/executor/factory.py
- Registered the new BranchExecutor within the ExecutorFactory for dynamic instantiation based on configuration.
data_juicer/ops/filter/init.py
- Imported and exposed CombinedLogicalFilter and LogicalFilter modules to the filter operator collection.
docs/BranchExecution.md
- Added a new documentation file detailing the branch execution feature, including its overview, configuration, use cases, benefits, execution modes (thread/Ray), and implementation details.
docs/Operators.md
- Updated the total count of filter operators in the operator table.
- Added an entry for combined_logical_filter in the operator table with its description.
tools/process_data.py
- Refactored the executor initialization logic to use the ExecutorFactory instead of a series of conditional statements, improving code organization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new BranchExecutor for running multi-branch data processing pipelines from a common checkpoint. This is a significant and well-implemented feature, even in its WIP state, that supports both thread and ray backends, includes fail_fast and retry mechanisms, and uses minimal serializable configs for ray for efficiency. The code is well-structured, and the addition of BranchExecution.md provides clear documentation.

My review includes a few minor suggestions for improving code clarity and efficiency, such as avoiding redundant calculations and imports. I also have a suggestion to improve the readability of the new documentation file.

Overall, this is a great addition to the project.

gemini-code-assist · 2026-02-13T08:50:53Z

data_juicer/core/executor/branch_executor.py

+        ref = ray.put(dataset)
+        return {"ref": ref}
+    except Exception:
+        import os


The os module is already imported at the top of the file. This redundant import inside an except block should be removed for code clarity.

gemini-code-assist · 2026-02-13T08:50:53Z

data_juicer/core/executor/branch_executor.py

+        if hasattr(self, "_log_event"):
+            self._log_event(
+                EventType.JOB_COMPLETE,
+                f"Branch job completed in {time.time() - run_start:.2f}s",
+                duration=time.time() - run_start,
+                metadata={"completed": list(branch_results.keys()), "failed": list(errors.keys())},
+            )


The duration time.time() - run_start is calculated twice. It's more efficient and readable to calculate it once and store it in a variable.

Suggested change

if hasattr(self, "_log_event"):

self._log_event(

EventType.JOB_COMPLETE,

f"Branch job completed in {time.time() - run_start:.2f}s",

duration=time.time() - run_start,

metadata={"completed": list(branch_results.keys()), "failed": list(errors.keys())},

)

if hasattr(self, "_log_event"):

job_duration = time.time() - run_start

self._log_event(

EventType.JOB_COMPLETE,

f"Branch job completed in {job_duration:.2f}s",

duration=job_duration,

metadata={"completed": list(branch_results.keys()), "failed": list(errors.keys())},

)

gemini-code-assist · 2026-02-13T08:50:53Z

data_juicer/core/executor/branch_executor.py

+                if hasattr(self, "_log_event"):
+                    self._log_event(
+                        EventType.DAG_NODE_COMPLETE,
+                        f"Branch {branch_name} completed in {time.time() - start:.2f}s",
+                        operation_name=branch_name,
+                        duration=time.time() - start,
+                    )


The duration time.time() - start is calculated twice. It's more efficient and readable to calculate it once and store it in a variable.

Suggested change

if hasattr(self, "_log_event"):

self._log_event(

EventType.DAG_NODE_COMPLETE,

f"Branch {branch_name} completed in {time.time() - start:.2f}s",

operation_name=branch_name,

duration=time.time() - start,

)

if hasattr(self, "_log_event"):

duration = time.time() - start

self._log_event(

EventType.DAG_NODE_COMPLETE,

f"Branch {branch_name} completed in {duration:.2f}s",

operation_name=branch_name,

duration=duration,

)

gemini-code-assist · 2026-02-13T08:50:54Z

docs/BranchExecution.md

+  retries: 0            # per-branch retry count (default: 0)
+```
+
+- With `backend: ray`, each branch runs as a Ray remote task; common dataset is shared via object store when possible, else via disk. Branch configs are minimal. When `run_common_on_ray: true`, the common process also runs as one Ray task (dataset_path and work_dir must be worker-accessible).


This line is quite long and contains multiple distinct points. For better readability, consider breaking it down into a nested list.

Suggested change

- With `backend: ray`, each branch runs as a Ray remote task; common dataset is shared via object store when possible, else via disk. Branch configs are minimal. When `run_common_on_ray: true`, the common process also runs as one Ray task (dataset_path and work_dir must be worker-accessible).

- With `backend: ray`:

- Each branch runs as a Ray remote task.

- The common dataset is shared via the object store when possible, otherwise via disk.

- Branch configs are minimal to reduce serialization overhead.

- When `run_common_on_ray: true`, the common process also runs as a Ray task, which requires `dataset_path` and `work_dir` to be accessible by all workers.

yxdyc and others added 5 commits February 11, 2026 17:54

Optimize issue templates: emphasize English usage and add Q&A Copilot…

ed448f0

… check

Merge origin/main into dev/branch_exec

06268f0

fix: filter count 57 in Operators.md, remove logical_filter row; dele…

c26d281

…te demos/branch_execution_example.yaml Co-authored-by: Cursor <cursoragent@cursor.com>

yxdyc requested a deployment to Testing February 13, 2026 08:49 — with GitHub Actions Waiting

yxdyc added enhancement New feature or request dj:core issues/PRs about the core functions of Data-Juicer labels Feb 13, 2026

github-project-automation bot added this to data-juicer Feb 13, 2026

github-project-automation bot moved this to Todo in data-juicer Feb 13, 2026

yxdyc linked an issue Feb 13, 2026 that may be closed by this pull request

Multi-Branch Execution (DAG feature enhancment) #915

Open

3 tasks

gemini-code-assist bot reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Multi-branch executor #916

[WIP] Multi-branch executor #916
yxdyc wants to merge 5 commits intomainfrom
dev/branch_exec

yxdyc commented Feb 13, 2026

Uh oh!

gemini-code-assist bot commented Feb 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 13, 2026

Uh oh!

gemini-code-assist bot Feb 13, 2026

Uh oh!

gemini-code-assist bot Feb 13, 2026

Uh oh!

gemini-code-assist bot Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-- With `backend: ray`, each branch runs as a Ray remote task; common dataset is shared via object store when possible, else via disk. Branch configs are minimal. When `run_common_on_ray: true`, the common process also runs as one Ray task (dataset_path and work_dir must be worker-accessible).
+- With `backend: ray`:
+  - Each branch runs as a Ray remote task.
+  - The common dataset is shared via the object store when possible, otherwise via disk.
+  - Branch configs are minimal to reduce serialization overhead.
+  - When `run_common_on_ray: true`, the common process also runs as a Ray task, which requires `dataset_path` and `work_dir` to be accessible by all workers.

Conversation

yxdyc commented Feb 13, 2026

Uh oh!

gemini-code-assist bot commented Feb 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant