Fix: Address PerformanceWarnings thrown when running on a larger data set by wong-hl · Pull Request #63 · ImperialCollegeLondon/rojak

wong-hl · 2025-10-14T15:49:43Z

When run on larger datasets, a PerformanceWarning is often thrown about there being an increase in the number of chunks by a factor of 320. This occurs when a blockwise operation takes places and dask.array.unify_chunks() is invoked by the method. This is invoked as two arrays are being operated on and they have different chunks.

In my code, the case where this tended to occur was when array_to_index_into[indices] are arrays with different chunks. The first occurrence was when I attempted to add the turbulence diagnostic data to the data frame. As the turbulence diagnostics are in a much later array, the chunks are larger than the more sparse observational data. This was a more complex case as the number of partitions needs to be lined up with the original data frame once the turbulence diagnostic values are obtained. The second occurrence was during the computation of the ROC where only the bucketed values are selected. I don't know why the second case still results in an increase in the number of chunks that warrants a PerformanceWarning as both arrays have the same chunk sizes even after putting in the fix.

The implemented solution revolves around manually ensuring that the chunks are were approximately in agreement. This involved rechunking the smaller array to have the same size chunks of the larger array.

… flattened turbulence diagnostic array By using warnings.filterwarnings('error', category=Warning, module='dask'), I was able to get the code to break where the PerformanceWarning was thrown. It showed that the number of chunks increased by 320 when indexing into the turbulence diagnostic array to get the closest value. This is due to the chunking of the indexer array following that of the data frame while the turbulence diagnostic array followed the chunking of the met data. As the dataframe data is more sparse, the chunks would be smaller.

This has not been completely eliminated, but has reduced the number of partitoins from increasing by 20+ time to increasing by 12 times

codecov-commenter · 2025-10-14T16:26:59Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.64%. Comparing base (019a9c0) to head (a295039).

Files with missing lines	Patch %	Lines
src/rojak/turbulence/verification.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #63      +/-   ##
==========================================
- Coverage   77.56%   76.64%   -0.93%     
==========================================
  Files          24       24              
  Lines        3124     3125       +1     
  Branches      356      356              
==========================================
- Hits         2423     2395      -28     
- Misses        608      639      +31     
+ Partials       93       91       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wong-hl · 2025-10-17T14:12:52Z

When these changes are run with a specific spatial domain, the code works perfectly and runs very fast. However, when run on a different domain, it will fail with lost dependencies. As I don't know if this is a dask problem (see dask/dask#12055 as a potential similar case) or a problem with my implementation, I'm going to leave this PR open and stale for now

wong-hl · 2025-11-04T13:18:31Z

Turns out it might not be due to that dask issue. My current hypothesis is that the partitioning is tied to the partitioning of the parquet files. For some reason, repartitioning them on the fly causes the dependencies to be lost.

To test whether,

Decreasing the number of partitions in the parquet files would decrease the number of loaded partitions in the resulting dataframe
The larger partitions results in a PerformanceWarning with a smaller factor increase and doesn't result in the code mysteriously failing with a lost-dependencies error

I implemented #85. This loads the parquet files in a directory, repartitions it and saves the repartitioned dataframe. From running the analysis on the HPC with the repartitioned files, the factor increase in chunks in the PerformanceWarning has decreased from 1000+ to 80.

The original set up had each hour of every day as a single partition. The final set up had an entire months worth of data split into 20 partitions

wong-hl added 2 commits October 13, 2025 17:30

fix: A second PeformanceWarning is thrown at the computation of ROC

96ad42c

This has not been completely eliminated, but has reduced the number of partitoins from increasing by 20+ time to increasing by 12 times

wong-hl had a problem deploying to CI October 14, 2025 15:49 — with GitHub Actions Failure

test: Fixed failing test due to updated chunking

3c8ae3e

wong-hl temporarily deployed to CI October 14, 2025 16:13 — with GitHub Actions Inactive

Merge branch 'master' into fix-performance-warning

a295039

wong-hl temporarily deployed to CI October 29, 2025 12:05 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Address PerformanceWarnings thrown when running on a larger data set#63

Fix: Address PerformanceWarnings thrown when running on a larger data set#63
wong-hl wants to merge 4 commits intomasterfrom
fix-performance-warning

wong-hl commented Oct 14, 2025

Uh oh!

codecov-commenter commented Oct 14, 2025 •

edited

Loading

Uh oh!

wong-hl commented Oct 17, 2025 •

edited

Loading

Uh oh!

wong-hl commented Nov 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wong-hl commented Oct 14, 2025

Uh oh!

codecov-commenter commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wong-hl commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wong-hl commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Oct 14, 2025 •

edited

Loading

wong-hl commented Oct 17, 2025 •

edited

Loading

wong-hl commented Nov 4, 2025 •

edited

Loading