Skip to content

Conversation

@mx-moth
Copy link
Contributor

@mx-moth mx-moth commented Nov 12, 2025

When opening particularly large datasets emsarray could crash, printing a memory error. The code would attempt to construct a numpy array containing coordinates for all polygons in the dataset, then construct all the polygons in one batch. Because of shapely internals, the polygon coordinate array would be copied and thus memory usage would be doubled, briefly, before the original polygon coordinate array was garbage collected.

This PR contains a number of improvements to this situation:

  • A regression test has been added for each convention that checks the peak memory usage when constructing polygons for a large dataset
  • All conventions preallocate some numpy arrays to construct point data in. By reusing memory we can be sure that the memory usage stays predictably low.
  • CFGrid1D, CFGrid2D, and ArakawaC grids all construct polygons row-by-row. This effectively batches the polygons in to smaller groups.
  • UGrid batches polygon construction in to runs of no more than 10000 polygons at a time.

The memory figures in this PR are taken from runs on my laptop. We should test on multiple different systems to ensure that these figures are representative of multiple different environments.

Tested on:

  • Linux x64, Python 3.14, numpy 2.3.4, shapely 2.1.2

Please run the following and copy the output to a comment, so we can compare memory usage:

$ pytest -vvrP --log-level info -k memory_usage

Please also run the full test suite to verify that everything passes. No need to include the output of this:

$ pytest -vv

@mx-moth mx-moth force-pushed the track-numpy-memory-usage branch from bc6fb8a to e4d3e9a Compare November 12, 2025 05:51
@mx-moth
Copy link
Contributor Author

mx-moth commented Nov 12, 2025

$ pytest -vvrP --log-level info -k memory_usage
============================================================ test session starts =============================================================
platform linux -- Python 3.14.0, pytest-9.0.0, pluggy-1.6.0 -- /home/hea211/projects/emsarray/emsarray/.conda/bin/python3.14
cachedir: .pytest_cache
Matplotlib: 3.10.7
Freetype: 2.6.1
rootdir: /home/hea211/projects/emsarray/emsarray
configfile: pyproject.toml
testpaths: tests
plugins: cov-7.0.0, mpl-0.17.0
collected 409 items / 405 deselected / 4 selected

tests/conventions/test_cfgrid1d.py::test_make_polygon_memory_usage PASSED                                                              [ 25%]
tests/conventions/test_cfgrid2d.py::test_make_polygon_memory_usage PASSED                                                              [ 50%]
tests/conventions/test_shoc_standard.py::test_make_polygons_memory_usage PASSED                                                        [ 75%]
tests/conventions/test_ugrid.py::test_make_polygons_memory_usage PASSED                                                                [100%]

=================================================================== PASSES ===================================================================
_______________________________________________________ test_make_polygon_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_cfgrid1d:test_cfgrid1d.py:517 current memory usage: 128057886, peak memory usage: 132058438
_______________________________________________________ test_make_polygon_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_cfgrid2d:test_cfgrid2d.py:513 current memory usage: 192004310, peak memory usage: 292312287
______________________________________________________ test_make_polygons_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_shoc_standard:test_shoc_standard.py:674 current memory usage: 128003699, peak memory usage: 132004251
______________________________________________________ test_make_polygons_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_ugrid:test_ugrid.py:998 current memory usage: 60483162, peak memory usage: 77852479
===================================================== 4 passed, 405 deselected in 27.34s =====================================================

@mx-moth
Copy link
Contributor Author

mx-moth commented Nov 13, 2025

Linux x64 VM, Python 3.14 from conda, numpy 2.3.4, shapely 2.1.2, xarray 2025.10.1

$ COLUMNS=80 pytest -vvk memory_usage --log-level info -rP
============================= test session starts ==============================
platform linux -- Python 3.13.9, pytest-9.0.1, pluggy-1.6.0 -- /srv/emsarray/.conda/bin/python3.13
cachedir: .pytest_cache
Matplotlib: 3.10.7
Freetype: 2.6.1
rootdir: /srv/emsarray
configfile: pyproject.toml
testpaths: tests
plugins: mpl-0.17.0, cov-7.0.0
collected 408 items / 404 deselected / 4 selected

tests/conventions/test_cfgrid1d.py::test_make_polygon_memory_usage PASSED [ 25%]
tests/conventions/test_cfgrid2d.py::test_make_polygon_memory_usage PASSED [ 50%]
tests/conventions/test_shoc_standard.py::test_make_polygons_memory_usage PASSED [ 75%]
tests/conventions/test_ugrid.py::test_make_polygons_memory_usage PASSED  [100%]

==================================== PASSES ====================================
________________________ test_make_polygon_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_cfgrid1d:test_cfgrid1d.py:517 current memory usage: 128068082, peak memory usage: 132068658
________________________ test_make_polygon_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_cfgrid2d:test_cfgrid2d.py:513 current memory usage: 192064830, peak memory usage: 292313623
_______________________ test_make_polygons_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_shoc_standard:test_shoc_standard.py:674 current memory usage: 128013731, peak memory usage: 132014307
_______________________ test_make_polygons_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_ugrid:test_ugrid.py:998 current memory usage: 60493654, peak memory usage: 77863035
====================== 4 passed, 404 deselected in 56.17s ======================

@mx-moth
Copy link
Contributor Author

mx-moth commented Nov 13, 2025

It also passes in whatever environment Github runners are

@david-sh-csiro
Copy link
Collaborator

WSL2 Windows 11, Python 3.13.1 from python, numpy 2.3.4, shapely 2.1.2, xarray 2025.10.1
image

@mx-moth
Copy link
Contributor Author

mx-moth commented Nov 13, 2025

Alright I am pretty convinced that this works across multiple environments! This is ready for review.

@david-sh-csiro
Copy link
Collaborator

Windows 11, Python 3.14 from conda, numpy 2.3.4, shapely 2.1.2, xarray 2025.10.1
image

Seems Windows uses slightly more memory and we were right on the limit.
I am unsure what aspects introduce variance and if there is anything we
can do about that.
Copy link
Collaborator

@david-sh-csiro david-sh-csiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally with MoVE. Triangulation of 1km resolution ugrid national dataset took around 37 seconds which might be slightly slower than previous results. Otherwise works as advertised. Memory usage was stable between 26% and 30% throughout the triangulation.

@mx-moth
Copy link
Contributor Author

mx-moth commented Nov 17, 2025

This PR shouldn't have affected triangulation much. That is a separate process that happens after constructing the polygons. Anything that was possible before should still be possible now, but hopefully we can now open even larger datasets.

Constructing the polygons might be marginally slower if you have more polygons in your dataset than the batch size, but no noticeably slower outside of benchmarks.

* origin/main:
  Update Python version in release automation workflow
  Bump version of Sphinx tools
  Bump minimum Python to 3.12, add 3.14 support
@mx-moth mx-moth merged commit d3ec547 into main Nov 17, 2025
15 checks passed
@mx-moth mx-moth deleted the track-numpy-memory-usage branch November 17, 2025 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants