Fix: Address PerformanceWarnings thrown when running on a larger data set#63
Fix: Address PerformanceWarnings thrown when running on a larger data set#63
Conversation
… flattened turbulence diagnostic array
By using warnings.filterwarnings('error', category=Warning, module='dask'), I was able to get the code to break where the PerformanceWarning was thrown. It showed that the number of chunks increased by 320 when indexing into the turbulence diagnostic array to get the closest value. This is due to the chunking of the indexer array following that of the data frame while the turbulence diagnostic array followed the chunking of the met data. As the dataframe data is more sparse, the chunks would be smaller.
This has not been completely eliminated, but has reduced the number of partitoins from increasing by 20+ time to increasing by 12 times
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #63 +/- ##
==========================================
- Coverage 77.56% 76.64% -0.93%
==========================================
Files 24 24
Lines 3124 3125 +1
Branches 356 356
==========================================
- Hits 2423 2395 -28
- Misses 608 639 +31
+ Partials 93 91 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
When these changes are run with a specific spatial domain, the code works perfectly and runs very fast. However, when run on a different domain, it will fail with |
|
Turns out it might not be due to that To test whether,
I implemented #85. This loads the parquet files in a directory, repartitions it and saves the repartitioned The original set up had each hour of every day as a single partition. The final set up had an entire months worth of data split into 20 partitions |
When run on larger datasets, a
PerformanceWarningis often thrown about there being an increase in the number of chunks by a factor of 320. This occurs when a blockwise operation takes places anddask.array.unify_chunks()is invoked by the method. This is invoked as two arrays are being operated on and they have different chunks.In my code, the case where this tended to occur was when
array_to_index_into[indices]are arrays with different chunks. The first occurrence was when I attempted to add the turbulence diagnostic data to the data frame. As the turbulence diagnostics are in a much later array, the chunks are larger than the more sparse observational data. This was a more complex case as the number of partitions needs to be lined up with the original data frame once the turbulence diagnostic values are obtained. The second occurrence was during the computation of the ROC where only the bucketed values are selected. I don't know why the second case still results in an increase in the number of chunks that warrants aPerformanceWarningas both arrays have the same chunk sizes even after putting in the fix.The implemented solution revolves around manually ensuring that the chunks are were approximately in agreement. This involved rechunking the smaller array to have the same size chunks of the larger array.