-
Notifications
You must be signed in to change notification settings - Fork 87
Description
Describe the bug
When running a pipeline locally (e.g., executor="none") that involves a JobGroup or multiple bundled tasks in a single step, the execution fails with an AssertionError: no app_id collisions expected.
This occurs because the local scheduler in nemo_run (via torchx) does not correctly handle the dryrun_info for multiple executables within a single group. Instead of maintaining unique information for each executable, it appears to overwrite the dry-run information during iteration, causing the underlying scheduler to detect an App ID collision when it attempts to schedule the tasks.
Steps/Code to reproduce bug
- Define a simple experiment with multiple tasks bundled together.
- configure the run to use the local executor.
- Execute the experiment.
Here is a minimal example snippet:
import nemo_run as run
# Define two simple tasks
task1 = run.Script(inline='echo "Hello Task 1"')
task2 = run.Script(inline='echo "Hello Task 2"')
# Create an experiment
with run.Experiment("local_collision_test") as exp:
# Add tasks as a bundle/group (this creates a JobGroup internally)
exp.add([task1, task2], name="my_job_group")
# Run locally
# This triggers the AssertionError
exp.run()Expected behavior
The local executor should be able to accept a list of tasks (a JobGroup), generate unique App IDs for each task, and execute them sequentially or in parallel without crashing due to ID collisions.
Additional context
Traceback:
File "/.../site-packages/nemo_run/run/torchx_backend/schedulers/local.py", line 106, in schedule
app_id = super().schedule(dryrun_info=dryrun_info)
File "/.../site-packages/torchx/schedulers/local_scheduler.py", line 791, in schedule
app_id not in self._apps
AssertionError: no app_id collisions expected since uuid4 suffix is used