Compute node setup by emlys · Pull Request #12 · natcap/invest-compute

emlys · 2026-01-05T17:46:40Z

I'm using Github Actions as a sort of dev environment, since it can easily spin up a slurm cluster for testing, hence all the "debugging" commits. Let's squash this PR when it gets merged.

Sorry this ended up being a bit of a catch-all PR - there's a lot of miscellaneous improvements in here, all in the direction of making the API functional for real clients. Happy to go over it on a call if that's helpful!

pygeoapi server updates

Moved all pygeoapi stuff into the invest_subprocesses subdirectory
Datastacks are now referenced by a URL to a datastack archive, rather than a local path. This is necessary for practical use of the API - clients will include a datastack URL in their request.
Job info, including job status, start, and end time, can now be queried. This data is pulled from slurm using sacct and scontrol.
Additional job metadata is now stored in the slurm job database (as a JSON string in the job comment field). While this won't scale well to lengthy metadata, for our current needs, it's handy because we don't need a second database to track jobs. Slurm serves as the single source of truth for all job info.
A separate thread monitors each job running in async mode, and performs post-processing and uploads the workspace to GCP when the job finishes. Previously this didn't happen in async mode.
Add tests for executing processes in both sync and async modes.

infrastructure updates

Add a startup script to install dependencies that runs on each of the compute instances when they are dynamically spun up
Prepare to set up an API Gateway. This relies on using an OpenAPI yml that describes all the endpoints. Unfortunately, the openapi.yml generated by pygeoapi has several issues and doesn't work out of the box. bundled-openapi.yml includes some fixes as well as bundling in external specifications. I hope to find a cleaner solution for this eventually. Next step will be defining the API Gateway with terraform.

phargogh

Thanks @emlys ! This looks very good to me. I think pretty much everything I commented on is pretty minor, with I think the biggest question in my mind about the longevity of the polling strategy for checking slurm job status (and also not wanting to overcomplicate things, of course!).

phargogh · 2026-01-27T22:52:59Z

invest_processes/src/invest_processes/slurm_manager.py

+            else:
+                LOGGER.debug('Job and post processing completed.')
+
+        return {


Is the job result dict's structure defined by pygeoapi? Or is the schema one that we define?

The elements of this dict are required by pygeoapi. This pygeoapi function expects all these keys to exist. I don't think it's perfect for our use case (for instance, the updated property doesn't really apply), but I filled out all the fields to get it working.

phargogh · 2026-01-27T23:03:38Z

invest_processes/src/invest_processes/slurm_manager.py

+        monitor_thread = threading.Thread(
+            target=self.monitor_job_status,
+            args=(job_id, workspace_dir, processor.process_output))


At least for testing I'm sure this polling thread is a-ok! But just to confirm my understanding, are we likely to need a different status-checking approach in the near future? Just thinking that if we have enough database connections (perhaps even with worker nodes consuming db connections), we might get into a state where we nee to handle the volume of traffic. I suppose that'll be a good problem to have when we get there!

Related, I just found out about strigger (docs), in case that helps!

Oh that's interesting. I'm not sure yet what the maximum number of database connections is, it seems like it would depend on the specific DB configuration that the cluster toolkit is creating for us, which I'm not familiar with.

I guess the number of concurrent jobs is limited by slurm, but it's possible users would request more and we'd get a backlog of pending jobs. strigger looks useful, the callback design could certainly be an improvement over the monitoring thread! I'll make a separate issue for this.

phargogh · 2026-01-27T23:15:17Z

invest_processes/src/invest_processes/utils.py

+        tuple of extracted json datastack path and model id
+    """
+    # Download the datastack from the given URL and
+    response = requests.get(datastack_url)


In case the datastack is large, might I suggest chunking and streaming the file? Requests has a streaming=true parameter, and if you pair that with iter_content() you can write out the file chunk-by-chunk without exhausting system memory.

Smart! thanks for the suggestion

phargogh · 2026-01-27T23:22:06Z

slurm_cluster_config/bundled-openapi.yml

I confess I didn't review this file closely, but given your description of the file, hopefully that's ok? Please let me know if you'd like me to take a close look at this!

Yep, that's fine!

slurm_cluster_config/openapi.yml

phargogh · 2026-01-27T23:39:20Z

.github/workflows/test.yml


      - name: Install package
-        run: cd invest_processes && pip install .
+        run: pip install git+https://github.com/emlys/invest@debug-compute && cd invest_processes && pip install .


In case we find it useful to use a fork-and-pull model, would it be worth using git+https://github.com/{{ github.repository }}@debug-compute?

Oop, I think I can remove this now. The regular natcap.invest from PyPI should work.

Co-authored-by: James Douglass <jdouglass@stanford.edu>

…tcap/invest-compute into feature/compute-note-playbook

emlys · 2026-02-05T19:19:58Z

I think I addressed all your comments so I'll go ahead and merge!

emlys added 30 commits December 9, 2025 11:36

add test datastack tgz

057680c

reorganize files

a79e1e1

copy local repo over using rsync

0a6e4ab

work in progress: lots of misc new files

65787c8

refactor async and sync execution handlers

44b5e95

update yml paths in gha test

40115b0

fix parameters passed to async handler

6d0339f

rename datastack_path to datastack_url in test suite

87e0d77

fix condition to exit status checking loop

5aa676b

add missing import; wait to check job status

ff795b7

misc debugging

4042d29

attempt to rewrite async slurm manager to use multithreading

45314a6

add missing import

3e58e21

debugging

2221c61

debugging

feb7c2e

add invalid test datastack archive

5c20195

debugging

bd580c8

debugging

cfcdb43

debugging

c1d2fc7

debugging

1eb1d2d

fix structure of test datastack archive

7564501

debugging

f556f7f

debugging

add072e

debugging

76216d0

debugging

0c81de6

debugging

0fff874

debugging

ed128cc

debugging

53deafc

resolve tarfile deprecation warning

915e7af

resolve tarfile deprecation warning

92f6080

emlys added 19 commits January 23, 2026 08:45

debugging

8606b99

debugging

0205f0c

debugging

7bcb96d

debugging

d20d08e

debugging

2d91d9f

debugging

9db6b53

debugging

76fb2ee

debugging

aea1107

return json path from download_and_extract_datastac

53627f2

missing import

dee6d18

refactor upload_directory_to_bucket to use pathlib"

a91319d

refactor upload_directory_to_bucket to use pathlib"

1495fe0

refactor upload_directory_to_bucket to use pathlib"

85d6269

debugging

e9698f3

replace os path manipulation with patlib

096e281

add missing import

a1b68e4

cast Path to str before jsonifying

135434a

remove unused file

40fcea8

Merge branch 'main' into feature/compute-note-playbook

d8070cf

emlys marked this pull request as ready for review January 24, 2026 00:35

emlys requested a review from phargogh January 24, 2026 00:35

emlys self-assigned this Jan 24, 2026

phargogh reviewed Jan 27, 2026

View reviewed changes

emlys and others added 4 commits February 3, 2026 12:35

Update slurm_cluster_config/openapi.yml

6eb96ec

Co-authored-by: James Douglass <jdouglass@stanford.edu>

test with regular natcap.invest not debug branch

f5946c0

Merge branch 'feature/compute-note-playbook' of https://github.com/na…

1f74a2b

…tcap/invest-compute into feature/compute-note-playbook

read in datastack data by chunks

c75cc62

emlys requested a review from phargogh February 4, 2026 01:26

emlys merged commit 94919bb into main Feb 5, 2026
2 checks passed

Conversation

emlys commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pygeoapi server updates

infrastructure updates

Uh oh!

phargogh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emlys commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emlys commented Jan 5, 2026 •

edited

Loading