Improve peak memory allocation during polygon construction #200

mx-moth · 2025-11-12T05:34:32Z

When opening particularly large datasets emsarray could crash, printing a memory error. The code would attempt to construct a numpy array containing coordinates for all polygons in the dataset, then construct all the polygons in one batch. Because of shapely internals, the polygon coordinate array would be copied and thus memory usage would be doubled, briefly, before the original polygon coordinate array was garbage collected.

This PR contains a number of improvements to this situation:

A regression test has been added for each convention that checks the peak memory usage when constructing polygons for a large dataset
All conventions preallocate some numpy arrays to construct point data in. By reusing memory we can be sure that the memory usage stays predictably low.
CFGrid1D, CFGrid2D, and ArakawaC grids all construct polygons row-by-row. This effectively batches the polygons in to smaller groups.
UGrid batches polygon construction in to runs of no more than 10000 polygons at a time.

The memory figures in this PR are taken from runs on my laptop. We should test on multiple different systems to ensure that these figures are representative of multiple different environments.

Tested on:

Linux x64, Python 3.14, numpy 2.3.4, shapely 2.1.2

Please run the following and copy the output to a comment, so we can compare memory usage:

$ pytest -vvrP --log-level info -k memory_usage

Please also run the full test suite to verify that everything passes. No need to include the output of this:

$ pytest -vv

mx-moth · 2025-11-12T06:08:54Z

$ pytest -vvrP --log-level info -k memory_usage
============================================================ test session starts =============================================================
platform linux -- Python 3.14.0, pytest-9.0.0, pluggy-1.6.0 -- /home/hea211/projects/emsarray/emsarray/.conda/bin/python3.14
cachedir: .pytest_cache
Matplotlib: 3.10.7
Freetype: 2.6.1
rootdir: /home/hea211/projects/emsarray/emsarray
configfile: pyproject.toml
testpaths: tests
plugins: cov-7.0.0, mpl-0.17.0
collected 409 items / 405 deselected / 4 selected

tests/conventions/test_cfgrid1d.py::test_make_polygon_memory_usage PASSED                                                              [ 25%]
tests/conventions/test_cfgrid2d.py::test_make_polygon_memory_usage PASSED                                                              [ 50%]
tests/conventions/test_shoc_standard.py::test_make_polygons_memory_usage PASSED                                                        [ 75%]
tests/conventions/test_ugrid.py::test_make_polygons_memory_usage PASSED                                                                [100%]

=================================================================== PASSES ===================================================================
_______________________________________________________ test_make_polygon_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_cfgrid1d:test_cfgrid1d.py:517 current memory usage: 128057886, peak memory usage: 132058438
_______________________________________________________ test_make_polygon_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_cfgrid2d:test_cfgrid2d.py:513 current memory usage: 192004310, peak memory usage: 292312287
______________________________________________________ test_make_polygons_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_shoc_standard:test_shoc_standard.py:674 current memory usage: 128003699, peak memory usage: 132004251
______________________________________________________ test_make_polygons_memory_usage _______________________________________________________
------------------------------------------------------------- Captured log call --------------------------------------------------------------
INFO     tests.conventions.test_ugrid:test_ugrid.py:998 current memory usage: 60483162, peak memory usage: 77852479
===================================================== 4 passed, 405 deselected in 27.34s =====================================================

mx-moth · 2025-11-13T02:06:50Z

Linux x64 VM, Python 3.14 from conda, numpy 2.3.4, shapely 2.1.2, xarray 2025.10.1

$ COLUMNS=80 pytest -vvk memory_usage --log-level info -rP
============================= test session starts ==============================
platform linux -- Python 3.13.9, pytest-9.0.1, pluggy-1.6.0 -- /srv/emsarray/.conda/bin/python3.13
cachedir: .pytest_cache
Matplotlib: 3.10.7
Freetype: 2.6.1
rootdir: /srv/emsarray
configfile: pyproject.toml
testpaths: tests
plugins: mpl-0.17.0, cov-7.0.0
collected 408 items / 404 deselected / 4 selected

tests/conventions/test_cfgrid1d.py::test_make_polygon_memory_usage PASSED [ 25%]
tests/conventions/test_cfgrid2d.py::test_make_polygon_memory_usage PASSED [ 50%]
tests/conventions/test_shoc_standard.py::test_make_polygons_memory_usage PASSED [ 75%]
tests/conventions/test_ugrid.py::test_make_polygons_memory_usage PASSED  [100%]

==================================== PASSES ====================================
________________________ test_make_polygon_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_cfgrid1d:test_cfgrid1d.py:517 current memory usage: 128068082, peak memory usage: 132068658
________________________ test_make_polygon_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_cfgrid2d:test_cfgrid2d.py:513 current memory usage: 192064830, peak memory usage: 292313623
_______________________ test_make_polygons_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_shoc_standard:test_shoc_standard.py:674 current memory usage: 128013731, peak memory usage: 132014307
_______________________ test_make_polygons_memory_usage ________________________
------------------------------ Captured log call -------------------------------
INFO     tests.conventions.test_ugrid:test_ugrid.py:998 current memory usage: 60493654, peak memory usage: 77863035
====================== 4 passed, 404 deselected in 56.17s ======================

mx-moth · 2025-11-13T02:11:30Z

It also passes in whatever environment Github runners are

david-sh-csiro · 2025-11-13T03:27:33Z

WSL2 Windows 11, Python 3.13.1 from python, numpy 2.3.4, shapely 2.1.2, xarray 2025.10.1

mx-moth · 2025-11-13T04:08:18Z

Alright I am pretty convinced that this works across multiple environments! This is ready for review.

david-sh-csiro · 2025-11-13T04:25:50Z

Windows 11, Python 3.14 from conda, numpy 2.3.4, shapely 2.1.2, xarray 2025.10.1

Seems Windows uses slightly more memory and we were right on the limit. I am unsure what aspects introduce variance and if there is anything we can do about that.

david-sh-csiro

Tested locally with MoVE. Triangulation of 1km resolution ugrid national dataset took around 37 seconds which might be slightly slower than previous results. Otherwise works as advertised. Memory usage was stable between 26% and 30% throughout the triangulation.

mx-moth · 2025-11-17T00:59:38Z

This PR shouldn't have affected triangulation much. That is a separate process that happens after constructing the polygons. Anything that was possible before should still be possible now, but hopefully we can now open even larger datasets.

Constructing the polygons might be marginally slower if you have more polygons in your dataset than the batch size, but no noticeably slower outside of benchmarks.

* origin/main: Update Python version in release automation workflow Bump version of Sphinx tools Bump minimum Python to 3.12, add 3.14 support

Add regression tests to track peak memory allocation

807a3ee

mx-moth requested a review from david-sh-csiro November 12, 2025 05:34

mx-moth added 3 commits November 12, 2025 16:47

Improve peak memory usage in CF grid conventions

fe4443a

Improve peak memory usage in Arakawa C grid conventions

e5e48fe

Improve peak memory usage in unstructured grid conventions

e4d3e9a

mx-moth force-pushed the track-numpy-memory-usage branch from bc6fb8a to e4d3e9a Compare November 12, 2025 05:51

Bump maximum memory usage for SHOC standard

5b3e699

Seems Windows uses slightly more memory and we were right on the limit. I am unsure what aspects introduce variance and if there is anything we can do about that.

david-sh-csiro approved these changes Nov 17, 2025

View reviewed changes

mx-moth added 2 commits November 17, 2025 15:56

Add release note

7202f12

Merge remote-tracking branch 'origin/main' into track-numpy-memory-usage

aa9a518

* origin/main: Update Python version in release automation workflow Bump version of Sphinx tools Bump minimum Python to 3.12, add 3.14 support

mx-moth merged commit d3ec547 into main Nov 17, 2025
15 checks passed

mx-moth deleted the track-numpy-memory-usage branch November 17, 2025 05:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve peak memory allocation during polygon construction #200

Improve peak memory allocation during polygon construction #200

Uh oh!

mx-moth commented Nov 12, 2025 •

edited

Loading

Uh oh!

mx-moth commented Nov 12, 2025

Uh oh!

mx-moth commented Nov 13, 2025

Uh oh!

mx-moth commented Nov 13, 2025

Uh oh!

david-sh-csiro commented Nov 13, 2025

Uh oh!

mx-moth commented Nov 13, 2025

Uh oh!

david-sh-csiro commented Nov 13, 2025

Uh oh!

david-sh-csiro left a comment

Uh oh!

mx-moth commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve peak memory allocation during polygon construction #200

Improve peak memory allocation during polygon construction #200

Uh oh!

Conversation

mx-moth commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mx-moth commented Nov 12, 2025

Uh oh!

mx-moth commented Nov 13, 2025

Uh oh!

mx-moth commented Nov 13, 2025

Uh oh!

david-sh-csiro commented Nov 13, 2025

Uh oh!

mx-moth commented Nov 13, 2025

Uh oh!

david-sh-csiro commented Nov 13, 2025

Uh oh!

david-sh-csiro left a comment

Choose a reason for hiding this comment

Uh oh!

mx-moth commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mx-moth commented Nov 12, 2025 •

edited

Loading