Skip to content

Compute node setup#12

Merged
emlys merged 222 commits intomainfrom
feature/compute-note-playbook
Feb 5, 2026
Merged

Compute node setup#12
emlys merged 222 commits intomainfrom
feature/compute-note-playbook

Conversation

@emlys
Copy link
Member

@emlys emlys commented Jan 5, 2026

I'm using Github Actions as a sort of dev environment, since it can easily spin up a slurm cluster for testing, hence all the "debugging" commits. Let's squash this PR when it gets merged.

Sorry this ended up being a bit of a catch-all PR - there's a lot of miscellaneous improvements in here, all in the direction of making the API functional for real clients. Happy to go over it on a call if that's helpful!

pygeoapi server updates

  • Moved all pygeoapi stuff into the invest_subprocesses subdirectory
  • Datastacks are now referenced by a URL to a datastack archive, rather than a local path. This is necessary for practical use of the API - clients will include a datastack URL in their request.
  • Job info, including job status, start, and end time, can now be queried. This data is pulled from slurm using sacct and scontrol.
  • Additional job metadata is now stored in the slurm job database (as a JSON string in the job comment field). While this won't scale well to lengthy metadata, for our current needs, it's handy because we don't need a second database to track jobs. Slurm serves as the single source of truth for all job info.
  • A separate thread monitors each job running in async mode, and performs post-processing and uploads the workspace to GCP when the job finishes. Previously this didn't happen in async mode.
  • Add tests for executing processes in both sync and async modes.

infrastructure updates

  • Add a startup script to install dependencies that runs on each of the compute instances when they are dynamically spun up
  • Prepare to set up an API Gateway. This relies on using an OpenAPI yml that describes all the endpoints. Unfortunately, the openapi.yml generated by pygeoapi has several issues and doesn't work out of the box. bundled-openapi.yml includes some fixes as well as bundling in external specifications. I hope to find a cleaner solution for this eventually. Next step will be defining the API Gateway with terraform.

@emlys emlys marked this pull request as ready for review January 24, 2026 00:35
@emlys emlys requested a review from phargogh January 24, 2026 00:35
@emlys emlys self-assigned this Jan 24, 2026
Copy link
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emlys ! This looks very good to me. I think pretty much everything I commented on is pretty minor, with I think the biggest question in my mind about the longevity of the polling strategy for checking slurm job status (and also not wanting to overcomplicate things, of course!).

else:
LOGGER.debug('Job and post processing completed.')

return {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the job result dict's structure defined by pygeoapi? Or is the schema one that we define?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elements of this dict are required by pygeoapi. This pygeoapi function expects all these keys to exist. I don't think it's perfect for our use case (for instance, the updated property doesn't really apply), but I filled out all the fields to get it working.

Comment on lines +457 to +459
monitor_thread = threading.Thread(
target=self.monitor_job_status,
args=(job_id, workspace_dir, processor.process_output))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for testing I'm sure this polling thread is a-ok! But just to confirm my understanding, are we likely to need a different status-checking approach in the near future? Just thinking that if we have enough database connections (perhaps even with worker nodes consuming db connections), we might get into a state where we nee to handle the volume of traffic. I suppose that'll be a good problem to have when we get there!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related, I just found out about strigger (docs), in case that helps!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's interesting. I'm not sure yet what the maximum number of database connections is, it seems like it would depend on the specific DB configuration that the cluster toolkit is creating for us, which I'm not familiar with.

I guess the number of concurrent jobs is limited by slurm, but it's possible users would request more and we'd get a backlog of pending jobs. strigger looks useful, the callback design could certainly be an improvement over the monitoring thread! I'll make a separate issue for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13

tuple of extracted json datastack path and model id
"""
# Download the datastack from the given URL and
response = requests.get(datastack_url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the datastack is large, might I suggest chunking and streaming the file? Requests has a streaming=true parameter, and if you pair that with iter_content() you can write out the file chunk-by-chunk without exhausting system memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart! thanks for the suggestion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I didn't review this file closely, but given your description of the file, hopefully that's ok? Please let me know if you'd like me to take a close look at this!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's fine!


- name: Install package
run: cd invest_processes && pip install .
run: pip install git+https://github.com/emlys/invest@debug-compute && cd invest_processes && pip install .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we find it useful to use a fork-and-pull model, would it be worth using git+https://github.com/{{ github.repository }}@debug-compute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oop, I think I can remove this now. The regular natcap.invest from PyPI should work.

@emlys emlys requested a review from phargogh February 4, 2026 01:26
@emlys
Copy link
Member Author

emlys commented Feb 5, 2026

I think I addressed all your comments so I'll go ahead and merge!

@emlys emlys merged commit 94919bb into main Feb 5, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants