Skip to content

Fix: Address PerformanceWarnings thrown when running on a larger data set#63

Open
wong-hl wants to merge 4 commits intomasterfrom
fix-performance-warning
Open

Fix: Address PerformanceWarnings thrown when running on a larger data set#63
wong-hl wants to merge 4 commits intomasterfrom
fix-performance-warning

Conversation

@wong-hl
Copy link
Collaborator

@wong-hl wong-hl commented Oct 14, 2025

When run on larger datasets, a PerformanceWarning is often thrown about there being an increase in the number of chunks by a factor of 320. This occurs when a blockwise operation takes places and dask.array.unify_chunks() is invoked by the method. This is invoked as two arrays are being operated on and they have different chunks.

In my code, the case where this tended to occur was when array_to_index_into[indices] are arrays with different chunks. The first occurrence was when I attempted to add the turbulence diagnostic data to the data frame. As the turbulence diagnostics are in a much later array, the chunks are larger than the more sparse observational data. This was a more complex case as the number of partitions needs to be lined up with the original data frame once the turbulence diagnostic values are obtained. The second occurrence was during the computation of the ROC where only the bucketed values are selected. I don't know why the second case still results in an increase in the number of chunks that warrants a PerformanceWarning as both arrays have the same chunk sizes even after putting in the fix.

The implemented solution revolves around manually ensuring that the chunks are were approximately in agreement. This involved rechunking the smaller array to have the same size chunks of the larger array.

… flattened turbulence diagnostic array

By using warnings.filterwarnings('error', category=Warning, module='dask'), I was able to get the code to break where the PerformanceWarning was thrown. It showed that the number of chunks increased by 320 when indexing into the turbulence diagnostic array to get the closest value. This is due to the chunking of the indexer array following that of the data frame while the turbulence diagnostic array followed the chunking of the met data. As the dataframe data is more sparse, the chunks would be smaller.
This has not been completely eliminated, but has reduced the number of partitoins from increasing by 20+ time to increasing by 12 times
@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.64%. Comparing base (019a9c0) to head (a295039).

Files with missing lines Patch % Lines
src/rojak/turbulence/verification.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #63      +/-   ##
==========================================
- Coverage   77.56%   76.64%   -0.93%     
==========================================
  Files          24       24              
  Lines        3124     3125       +1     
  Branches      356      356              
==========================================
- Hits         2423     2395      -28     
- Misses        608      639      +31     
+ Partials       93       91       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wong-hl
Copy link
Collaborator Author

wong-hl commented Oct 17, 2025

When these changes are run with a specific spatial domain, the code works perfectly and runs very fast. However, when run on a different domain, it will fail with lost dependencies. As I don't know if this is a dask problem (see dask/dask#12055 as a potential similar case) or a problem with my implementation, I'm going to leave this PR open and stale for now

@wong-hl
Copy link
Collaborator Author

wong-hl commented Nov 4, 2025

Turns out it might not be due to that dask issue. My current hypothesis is that the partitioning is tied to the partitioning of the parquet files. For some reason, repartitioning them on the fly causes the dependencies to be lost.

To test whether,

  1. Decreasing the number of partitions in the parquet files would decrease the number of loaded partitions in the resulting dataframe
  2. The larger partitions results in a PerformanceWarning with a smaller factor increase and doesn't result in the code mysteriously failing with a lost-dependencies error

I implemented #85. This loads the parquet files in a directory, repartitions it and saves the repartitioned dataframe. From running the analysis on the HPC with the repartitioned files, the factor increase in chunks in the PerformanceWarning has decreased from 1000+ to 80.

The original set up had each hour of every day as a single partition. The final set up had an entire months worth of data split into 20 partitions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants