Skip to content

Conversation

@agahkarakuzu
Copy link
Contributor

Need

  • In some projects, the output of one notebook serves as the input for another. The default parallel execution can break such dependency chains.
  • When multiple executable resources are present, parallel execution can overwhelm available computational resources.

Proposed solution

This PR introduces batching and sequencing logic to manage execution order based on the toc definition in myst.yml. The implementation allows users to control both concurrency and dependency ordering of executable documents.

Adding the documentation snippet for clarity:


How to manage the order of execution?

Implicit TOC

If no table of contents (toc) is defined in your myst.yml, all executable sources are run in parallel by default.

Explicit TOC

Managing concurrency without dependency order

By default, executable files are processed concurrently in batches of 5.

You can modify this behavior by passing the --execute-concurrency <n> option to your build command, where <n> specifies how many executable documents should run simultaneously.

  • You can pass --execute-concurrency <n> to your build command to change the number of executable documents that will be executed together.

Defining a specific execution order

To define a sequential execution order, use the execution_order field within the toc element. For example:

  toc:
  - file: paper.md
  - file: evidence/figure_1.ipynb
    execution_order: 0
  - file: evidence/figure_2.ipynb
    execution_order: 1
  - file: evidence/figure_3.ipynb
    execution_order: 1

In this example, figure_2.ipynb and figure_3.ipynb will both wait for figure_1.ipynb to finish before being executed concurrently.

Warning

If a notebook that other notebooks depend on fails during execution, the build process will continue by default. To stop the build whenever an error occurs (including for notebooks without dependencies) pass the --strict flag to your build command.


Tests

I tested this locally across several builds and appears to function correctly. The current implementation provides a straightforward mechanism for managing execution order. It is not a full-fledged workflow-like approach, but I believe still a meaningful improvement for build control.

I am (and @fwkoch) not sure if toc is the right place to define this logic, but it serves as a reasonable starting point for now.

Relates to: #1794, #2055

@changeset-bot
Copy link

changeset-bot bot commented Nov 8, 2025

⚠️ No Changeset found

Latest commit: 21af2c8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 8, 2025
@agoose77
Copy link
Contributor

agoose77 commented Nov 8, 2025

@agahkarakuzu it's funny that you looked at this — I was just thinking about this over the JupyterCon flights.

I like the concurrency limit, and the idea of a hard-coded ordering. Here are some of my unstructured thoughts:

Relative vs Absolute Ordering

When I was thinking about this, I considered the idea of relative ordering using before and after in addition to absolute ordering (at the ToC level). I wonder whether it makes sense to implement ordering in a relative sense, because at the notebook level we are often thinking "this needs to execute after XXX", rather than "this needs to execute fifth in the order".

There are pros/cons to both approaches, and I'm easily convinced either way.

Render vs Execution Ordering

As it stands, this PR changes the rendering ordering. This is a problem if people use continuous numbering, which will break with this PR if the ordering disagrees.

I think we need to introduce a higher level abstraction for execution (an ExecutionOrchestrator) that exposes an async locking interface. This would let us define the execution strategy (such as batching, and batch order), but keep it invisible in the execution transform. Something like

function executionTransform(mdast, vfile, frontmatter, session) {
   // Wait for our turn
   await session.executionOrchestrator.wantsExecution(frontmatter.path);
}

where this promise is resolved by the orchestrator once the previous notebooks have executed.

I'd love to work with you on this, and welcome any other thoughts about things like ordering from the team!

@fwkoch fwkoch added enhancement New feature or request and removed documentation Improvements or additions to documentation labels Nov 8, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 8, 2025
@fwkoch
Copy link
Member

fwkoch commented Nov 8, 2025

@agoose77 - regarding your second point, I'm less worried about the ordering - this concurrency management only applies to the first processMdast phase of project processing. This function operates on individual files and requires no shared state or assumptions about ordering. Enumeration happens after this (in selectPageReferenceStates, specifically order is maintained by this map; https://github.com/jupyter-book/mystmd/blob/main/packages/myst-cli/src/process/site.ts#L340).

Now, I haven't actually tested this - we should probably have some end-to-end tests that combine (1) batched execution and (2) page enumeration.

(@agoose77 - I think your code snippet is how we would need to do this once we streamline into a single "better" processing pipeline, as we prototyped in #1699)

@fwkoch
Copy link
Member

fwkoch commented Nov 8, 2025

Regarding how we define this ordering - that question feels fuzzier and more flexible.

Where should it go (toc vs. page vs. elsewhere in the project frontmatter) and how should it be defined (number vs. before/after vs. something else...)? Currently it's in the toc, which is nice and centralized. But not all projects necessarily have an explicit toc. We could also define it on the individual pages - this makes sense for named before/after dependencies (we probably just need after I think?). Spreading ordering-by-number across different pages feels not-so-good.

We probably don't want to allow multiple ways to do this - then we get into resolving conflicts... 😕

My vague preference is individual pages keep track of their execution_dependencies and we combine that with --execution-concurrency, no changes to the toc. But my opinions are not very strong.

@agoose77
Copy link
Contributor

agoose77 commented Nov 9, 2025

@agoose77 - regarding your second point, I'm less worried about the ordering - this concurrency management only applies to the first processMdast phase of project processing. This function operates on individual files and requires no shared state or assumptions about ordering. Enumeration happens after this (in selectPageReferenceStates, specifically order is maintained by this map; https://github.com/jupyter-book/mystmd/blob/main/packages/myst-cli/src/process/site.ts#L340).

Now, I haven't actually tested this - we should probably have some end-to-end tests that combine (1) batched execution and (2) page enumeration.

(@agoose77 - I think your code snippet is how we would need to do this once we streamline into a single "better" processing pipeline, as we prototyped in #1699)

Ah, of course. I didn't test that yesterday when I was musing about this!

@agoose77 agoose77 removed the documentation Improvements or additions to documentation label Nov 9, 2025
@agahkarakuzu
Copy link
Contributor Author

@fwkoch @agoose77 thank you so much for looking into this! Sorry for the late response, been navigating the insane air travelling conditions in the states.

Currently it's in the toc, which is nice and centralized. But not all projects necessarily have an explicit toc.

Indeed, it does not take effect when TOC is implicit.

Switching to a DAG-based model with explicit dependencies (instead of numerical expressions, dependencies can be defined as notebook or page names) would bring more sophistication. Maybe using p-graph would be a good option for that, converging to orchestration approach a bit more.

Then again, I was not really sure about defining these at the page level from the get go. If you see a better middle ground between impeccable orchestration and this simple implementation, happy to give it a stab.

@bsipocz
Copy link
Contributor

bsipocz commented Nov 10, 2025

Could we separate out the execution-concurrency for which I don't see any objections or questions about from execution_order?

(Having a control over the parallelism would solve #1831 and potential resource limit hits, etc.)

@agahkarakuzu
Copy link
Contributor Author

@bsipocz by separating out, do you mean addressing it in another PR?

@agoose77
Copy link
Contributor

@agahkarakuzu yes, I believe so.

Let's do that, if it's not too much work. It's a bit more effort overall, but makes it easier to review and avoid blocking the useful fix!

@agahkarakuzu
Copy link
Contributor Author

@agoose77 no worries, sent it separately at #2428.

@bsipocz
Copy link
Contributor

bsipocz commented Nov 12, 2025

@bsipocz by separating out, do you mean addressing it in another PR?

I'm sorry for getting back to this just now, but Angus has answered exactly of what I meant. Overall it's just very helpful separating out new features/fixes/etc into separate PRs, you can expect a much quicker turnaround as it's both cleaner/easier to review such contributions but also they are not blocking each other to get in when it's only parts are under discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants