forked from Climate-Smart-Public-Health/era5_sandbox
-
Notifications
You must be signed in to change notification settings - Fork 0
Diurnal Aggregation Algorithm #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
TinasheMTapera
wants to merge
22
commits into
main
Choose a base branch
from
develop
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…se resolves Add CC-BY License to the ERA5 dataset #20
…ublish datasets to the harvard dataverse. Also first attempt at nbdev with quarto
…tration of how to use `pytask` to manage data processing tasks in a Pythonic way, leveraging the power of decorators and type hints to define tasks and their dependencies
- Tested out pytask for building pipelines - Used the pytask data catalog to create sets of tasks as parameters to functions using namedtuples - Used the pytask data catalog to manage the parallelization of tasks - Created a pytask logger to log the progress of tasks - Implemented the download step of querying the ERA5 dataset in pytask - Began implementation of the aggregation step in pytask: - Used the astral library to find the time of sunrise and sunset for each data point in a query - Assigned a diurnal class to each data point based on the time of day - Aggregation of data points by date and diurnal class in progress
- Adopt Quarto for documentation and notebooks making use of [this nbdev PR](AnswerDotAI/nbdev#1521) that allows full `.qmd` driven packages - Convert all `ipynb` files to `.qmd` format - Use nbdev_docs to generate the documentation website - Adopt logger that solves #3 (#3)
This commit includes significant updates to the ERA5 data processing pipeline, focusing on using and demonstrating `pytask` as our workflow management tool. Key changes include: - Deleted obsolete log files for various datasets from 2015, 2017, 2019, 2021, and 2024. - Removed unnecessary Hydra configuration files and logs from the 2025-03-17 run. - Updated SLURM batch script to reduce maximum runtime from 18 hours to 6 hours. - Add the pytask `config.py` to introduce a demo data catalog and adjust data catalog structure. - Introduced the query object in `task_download.py` to handle data queries more effectively. - Add `task_aggregate.py` with a modified function to convert netCDF to GeoTIFF. - Refactored `task_download.py` to improve query handling and logging. - Cleaned up imports and improved code organization across multiple modules. - Updated documentation comments to reflect recent changes and maintain clarity. - Add nbdev quarto website documentation files.
…tatypes by solar date
…each xarray classified and resampled, but we need to convert to raster and then aggregate by polygon... not clear how to do this yet
… use DataFrame for diurnal classification. WIP: Continue trying to figure out how to rasterize xarray data so that they work with polygon_to_raster_cells function.
First, find the classifications of each point using sun position, then create two copies of the dataset with NaNs in the masked values, then resample by day. Importantly, you must set the time zone to the local time zone for the resampling by day to work correctly.
Parameterization now looks good by using pandas to create a dataframe of all combinations of parameters, filtering the ones that don't apply, and combining the parameters into a single dataframe that can be iterated over in the task function.
…kes a row from the jobs dataframe as input, which makes it easier to manage parameters. - The algorithm splits the data into day and night based on local time, which is determined from the longitude of the grid cell. - Remaining steps: change the query to use the new jobs dataframe, and update the notebook to reflect these changes; run and test the entire workflow to ensure everything works as expected; merge the aggregations into a single file per calendar month.
- Separate the qmd and ipynb files for notes and processing to test pipeline integrity - Refactored `config.py` to enhance the data catalog structure and improve query handling. Data catalog now uses dataframes to manage jobs - Updated `download.py` to improve the download process and added checks for existing files. - Improved `pytask_logger.py` for better logging setup. - Enhanced `task_aggregate.py` to optimize aggregation tasks and ensure proper output handling. - Updated `task_data_preparation.py` to improve task definitions and exports. - Refined `task_download.py` to include checks for existing downloads and improve logging.
- Updated Jupyter Notebook metadata to enable execution of all cells. - Added a new core module for internal functions and testing, including utilities for path expansion, dynamic function importing, and directory structure creation. - Implemented a Google Drive authentication class for fetching healthshed files. - Created a ClimateDataFileHandler class to manage different file types from the Climate Data Store (CDS). - Added a testAPI function to validate API connections and configurations. - Updated aggregation module to use a specific example file for testing. - Refactored various notebooks to improve clarity and execution flow. - Removed unnecessary execution flags from multiple notebooks. - Enhanced the task_aggregate.py script to include raster calculations and aggregation to healthsheds.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR creates a
pytask
pipeline for diurnal aggregation. This is where we are able to aggregate data to night and day values based on the position of the sun. This is separate from the original snakemake + hydra pipeline, but the repo contains both.Review Instructions
To review this PR, please first clone and install the package in a clean conda environment
conda create -n NAME python=3.12 pip install -e .
Then, symlink the data (in
pytask
, thebld
folder is what they use fordata
)Then, open the docs website to read the notebooks explaining the functionality. You can do this by right clicking
_docs/index.html
in VSCode and clicking "Show Preview". Alternatively, you can run the notebook code in thenotes
folder (they are identical).Notebooks to review:
Next, you can test out
pytask
in your terminal. Due to the large number of tasks, this can take up to 10 mins to runThen, you can delete a file from
bld
and submit a pipeline job to run it. There is already an sbatch script setup for this to run in parallel:Improvements that could be made:
Closes #16 #3 #17 #19 #20