-
Notifications
You must be signed in to change notification settings - Fork 2
snakemake #14
base: main
Are you sure you want to change the base?
snakemake #14
Changes from all commits
fa5cb8e
d2a4daf
a2224a5
c7b23cd
d36a767
c38d1fd
c1faf55
bffb26c
1cdf7b4
f6f3626
be5ca80
87d3d85
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -1,9 +1,66 @@ | ||||||
| # drb_gridmet_tools | ||||||
|
|
||||||
| Repository with functions to aggregate [gridmet climate raster data](https://www.climatologylab.org/gridmet.html) from pixel grid to vector aoi (hru polygon grid, polylines). This repository heavily relies on [grd2shp_xagg](https://github.com/rmcd-mscb/grd2shp_xagg) library, by [rmcd-mscb](https://github.com/rmcd-mscb) | ||||||
| Repository with functions to aggregate [gridmet climate raster data](https://www.climatologylab.org/gridmet.html) from pixel grid to vector aoi (hru polygon grid, polylines). This repository heavily relies on the [grd2shp_xagg](https://github.com/rmcd-mscb/grd2shp_xagg) library, by [rmcd-mscb](https://github.com/rmcd-mscb), which in turn relies oni [xagg](https://github.com/ks905383/xagg). | ||||||
|
|
||||||
| # Accessing the re-gridded files | ||||||
| ## On Caldera | ||||||
| On Caldera, re-gridded file paths are structured like this:`/caldera/projects/usgs/water/impd/pump/gridmet/drb_gridmet_tools/drb-gridmet/{fabric}/{run_date}/drb_climate_{run_date}.nc`. For example: ``/caldera/projects/usgs/water/impd/pump/gridmet/drb_gridmet_tools/drb-gridmet/nhm/2022_06_14/drb_climate_2022_06_14.nc`. | ||||||
|
|
||||||
| ## Selected gridMET variables: | ||||||
| ## On S3 | ||||||
| As of 6/2022, the results of the workflow are, by default, also stored S3 in the `drb-gridmet` bucket. For example, `s3://drb-gridmet/nhm/2022_06_14/drb_climate_2022_06_14.nc`.:w | ||||||
|
|
||||||
| # Running re-gridding for the Delaware River Basin via Snakemake | ||||||
| The re-gridding process for the Delaware River Basin has been run on USGS's Tallgrass via Singularity. It has been run for the National Hydrologic Model (NHM) fabric and the National Hydrographic Database (NHD) fabric. It should be able to be run via Docker as well as Singularity, but it has not been tried yet. Instructions for running and modifying the pipeline are below. | ||||||
|
|
||||||
| _Parallelization in Snakemake_ | ||||||
| The Snakemake workflow parallelizes the re-gridding of the 8 Gridmet variables. As long as you provide at least 8 cores, these tasks will run in parallel. | ||||||
|
|
||||||
| _S3_ | ||||||
| By default, the results of the workflow are stored locally and on S3 in the `drb-gridmet` bucket. If you would like to only store the resulting files locally, change the `use_S3` option in the config file to `False`. | ||||||
|
|
||||||
| NOTE: Although this workflow was first run on DRB catchments, it should be able to be run on any set of polygons in the conterminous US. This however, has not be tested. | ||||||
|
|
||||||
| ## Run via Singularity | ||||||
| To run the workflow with Singularity, you can either use the image already in Caldera via Tallgrass, or pull the image into a new directory. | ||||||
|
|
||||||
| ### Option A. Use the existing image and cloned repo | ||||||
| 1. move to the correct directory | ||||||
|
|
||||||
| ``` | ||||||
| cd /caldera/projects/usgs/water/impd/pump/gridmet/drb_gridmet_tools/ | ||||||
| ``` | ||||||
|
|
||||||
| 2. Decide on which fabric you want to use. You will use a different `config` file depending on which fabric you use: `config_nhm.yml` for the NHM fabric and `config_nhd.yml` for the `nhd` fabric. | ||||||
| 3. [Optional] edit the options in the config file | ||||||
| 4. Run the workflow | ||||||
| You can run the workflow on Tallgrass either in batch mode or interactively. Subtite your HPC account (may not be 'iidd') and desired config file (may not be `config_nhd.yml`) | ||||||
|
|
||||||
| You can run the workflow via `sbatch` | ||||||
| ``` | ||||||
| sbatch -A iidd slurm/launch_snakemake.slurm config_nhd.yml | ||||||
| ``` | ||||||
| It may be helpful instead to run it interactively | ||||||
|
|
||||||
| ``` | ||||||
| salloc -N 1 -n 8 -t 10:00:00 -p cpu -A iidd | ||||||
| module load singularity | ||||||
| singularity exec-agg_v0.3.sif /opt/conda/bin/snakemake -j --configfile config_nhd.yml | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a typo right?
Suggested change
Comment on lines
+45
to
+47
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When I run: from the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There isn't write access to all the files / directories (see below) so everyone but Jeff will get this error. We'll have to ask HPC folks to give group permissions in this folder because I'm pretty sure Jeff doesn't have access to USGS HPC anymore
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll ask in #tallgrass-help channel
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that makes sense 👍
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. did we get any response on this?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes (see here), they changed who owns the files and permissions should be updated. |
||||||
| ``` | ||||||
|
|
||||||
| ### Option B. Executing in a different directory | ||||||
| 1. Clone the repo and move to the `drb_gridmet_tools` directory | ||||||
| ``` | ||||||
| git clone [email protected]:USGS-R/drb_gridmet_tools.git | ||||||
| cd drb_gridmet_tools | ||||||
| ``` | ||||||
| 2. Pull down Docker image into a Singularity file | ||||||
| ``` | ||||||
| singularity pull docker://jsadler2/gridmet-agg:v0.3 | ||||||
| ``` | ||||||
| 3. Do Steps 2-4 from Option A. | ||||||
|
|
||||||
|
Comment on lines
+53
to
+61
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ran this in my own user dir in tallgrass (as well as on a different project dir) in and
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this refers to the S3 credentials - could you try re-running after setting
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure in that case I run locally?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that'd help to run locally. |
||||||
|
|
||||||
| # Selected gridMET variables: | ||||||
|
|
||||||
| tmmx: | ||||||
| * Description: Daily Maximum Temperature (2m)\ | ||||||
|
|
@@ -38,8 +95,4 @@ sph: | |||||
| * Units: kg/kg | ||||||
|
|
||||||
|
|
||||||
| ## Running re-gridding for the Delaware River Basin | ||||||
|
|
||||||
| `gridmet_split_script.py` processes the gridmet raster dataset values to the scale of the input multi-polygon shapefile. | ||||||
|
|
||||||
| `gridmet_aggregation_PRMS.py` processes the output of `gridmet_splot_script.py` and aggregates the PRMS_segid scale calculating an area weighted average. | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,167 @@ | ||
| import geopandas as gpd | ||
| import xarray as xr | ||
| from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider | ||
| from gridmet_split_script import get_gridmet_datasets, create_weightmap, g2shp_regridding | ||
| from gridmet_aggregation_PRMS import ncdf_to_gdf, gridmet_prms_area_avg_agg | ||
| import requests | ||
| from datetime import datetime | ||
|
|
||
| todays_date = datetime.today().strftime('%Y_%m_%d') | ||
|
|
||
| final_files = [] | ||
|
|
||
| nc_file_path = "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}.nc" | ||
| final_nc_file_path = nc_file_path.format(fabric_id = config['fabric_id'], | ||
| todays_date=todays_date, | ||
| run_prefix=config['run_prefix']) | ||
|
|
||
| seg_file_path = "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}_segments.csv" | ||
| final_seg_file_path = seg_file_path.format(fabric_id = config['fabric_id'], | ||
| todays_date=todays_date, | ||
| run_prefix=config['run_prefix']) | ||
|
|
||
| zarr_path = "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}.zarr.ind" | ||
| final_zarr_path = zarr_path.format(fabric_id = config['fabric_id'], | ||
| todays_date=todays_date, | ||
| run_prefix=config['run_prefix']) | ||
|
|
||
|
|
||
| if config['fabric_id'] == 'nhm': | ||
| final_files.append(final_seg_file_path) | ||
|
|
||
| if config['make_zarr']: | ||
| final_files.append(final_zarr_path) | ||
|
|
||
| if config['use_S3']: | ||
| S3 = S3RemoteProvider(keep_local=True) | ||
| final_files = [S3.remote(f) for f in final_files] | ||
| nc_file_path = S3.remote(nc_file_path) | ||
| seg_file_path = S3.remote(seg_file_path) | ||
|
|
||
|
|
||
|
|
||
| rule all: | ||
| input: | ||
| final_files | ||
|
|
||
|
|
||
| rule make_weight_map: | ||
| output: | ||
| "drb-gridmet/{fabric_id}/grd2shp_weights.pickle" | ||
| run: | ||
| gdf = gpd.read_file(config['catchment_file_path']) | ||
| # getting just one date and one variable to make the weight map. | ||
| # the same weight map applies to all dates and all variables | ||
| data_dict = get_gridmet_datasets(variable="tmmn", | ||
| start_date="2001-01-01", | ||
| end_date="2001-01-02", | ||
| polygon_for_bbox=gdf) | ||
| create_weightmap(xarray_dict=data_dict, | ||
| polygon=gdf, | ||
| output_data_folder = os.path.split(output[0])[0]) | ||
|
|
||
|
|
||
| rule aggregate_gridmet_to_polygons_one_var: | ||
| input: | ||
| "drb-gridmet/{fabric_id}/grd2shp_weights.pickle" | ||
| output: | ||
| "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_var_{variable}_climate_{todays_date}.nc" | ||
| run: | ||
| gdf = gpd.read_file(config['catchment_file_path']) | ||
| data_dict = get_gridmet_datasets(variable=wildcards.variable, | ||
| start_date=config.get('start_date', "1979-01-01"), | ||
| end_date=config.get('end_date', todays_date.replace("_", "-")), | ||
| polygon_for_bbox=gdf) | ||
| g2shp_regridding(xarray_dict=data_dict, | ||
| polygon=gdf, | ||
| weightmap_file= input[0], | ||
| g2s_file_prefix=f'{wildcards.run_prefix}_var_{wildcards.variable}_', | ||
| output_data_folder= os.path.split(output[0])[0], | ||
| g2s_time_var = 'day', | ||
| g2s_lat_var = 'lat', | ||
| g2s_lon_var = 'lon') | ||
|
|
||
|
|
||
| rule gather_gridmets: | ||
| input: | ||
| expand("drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_var_{variable}_climate_{todays_date}.nc", | ||
| fabric_id=config['fabric_id'], | ||
| todays_date=todays_date, | ||
| run_prefix=config['run_prefix'], | ||
| variable=config['data_vars']) | ||
| output: | ||
| nc_file_path | ||
| run: | ||
| gdf = gpd.read_file(config['catchment_file_path']) | ||
| ds_list = [xr.open_dataset(nc_file) for nc_file in input] | ||
| ds_combined = xr.merge(ds_list) | ||
| ds_combined = ds_combined.assign_coords({config["id_col"]: ("geomid", gdf[config["id_col"]])}).swap_dims({"geomid":config["id_col"]}) | ||
| ds_combined = ds_combined.drop("geomid") | ||
| ds_combined.to_netcdf(output[0]) | ||
|
|
||
|
|
||
| rule aggregate_gridmet_polygons_to_flowlines: | ||
| input: | ||
| "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}.nc" | ||
| output: | ||
| seg_file_path | ||
| run: | ||
| gdf = gpd.read_file(config['catchment_file_path']) | ||
| gridmet_drb_gdf = ncdf_to_gdf(ncdf_path=input[0], | ||
| shp = gdf, | ||
| left_on = config["id_col"], | ||
| right_on = config["id_col"]) | ||
| df_agg = gridmet_prms_area_avg_agg(gridmet_drb_gdf, | ||
| groupby_cols = ['PRMS_segid',"time"], | ||
| val_colnames = config['data_vars'], | ||
| wgt_col='hru_area_m2', | ||
| output_path= output[0]) | ||
|
|
||
| checkpoint write_zarr: | ||
| input: | ||
| "drb-gridmet/{filename}.nc" | ||
| output: | ||
| directory("/tmp/{filename}.zarr") | ||
| run: | ||
| ds = xr.open_dataset(input[0]) | ||
| ds = ds.chunk({"time": len(ds.time), config["id_col"]: 100}) | ||
| ds.to_zarr(output[0]) | ||
|
|
||
|
|
||
| def get_zarr_files(wildcards): | ||
| zarr_files = [] | ||
| for path, currentDirectory, files in os.walk(f"/tmp/{wildcards.filename}.zarr"): | ||
| for file in files: | ||
| scratch_path = os.path.join(path, file) | ||
| drb_path = scratch_path.replace("/tmp", "drb-gridmet") | ||
| zarr_files.append(drb_path) | ||
| return zarr_files | ||
|
|
||
|
|
||
| rule write_zarr_ind: | ||
| input: | ||
| "/tmp/{filename}.zarr", | ||
| get_zarr_files | ||
| output: | ||
| "drb-gridmet/{filename}.zarr.ind" | ||
| shell: | ||
| "touch {output[0]}" | ||
|
|
||
| if config['use_S3']: | ||
| rule copy_from_scratch_to_s3: | ||
| input: | ||
| "/tmp/{filename}.zarr/{zarr_file}" | ||
| output: | ||
| S3.remote("drb-gridmet/{filename}.zarr/{zarr_file}") | ||
| shell: | ||
| "cp {input[0]} {output[0]}" | ||
|
|
||
| else: | ||
| rule copy_from_scratch_to_s3: | ||
| input: | ||
| "/tmp/{filename}.zarr/{zarr_file}" | ||
| output: | ||
| "drb-gridmet/{filename}.zarr/{zarr_file}" | ||
| shell: | ||
| "cp {input[0]} {output[0]}" | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # probably don't change these | ||
| catchment_file_path: "https://github.com/USGS-R/drb-network-prep/blob/main/1_fetch/out/NHDPlusv2_catchments.gpkg?raw=true" | ||
|
|
||
| fabric_id: 'nhd' | ||
|
|
||
|
|
||
| id_col: "COMID" | ||
|
|
||
| # feel free to change these | ||
| data_vars: ['tmmx', 'tmmn', 'pr', 'srad', 'vs','rmax','rmin','sph'] | ||
|
|
||
| use_S3: True | ||
|
|
||
| make_zarr: True | ||
|
|
||
| run_prefix: "drb" | ||
|
|
||
| # if not specified, data will be processed from 1979-01-01 | ||
| #start_date: "2022-05-09" | ||
|
|
||
| # if not specified, data will be processed to today's date | ||
| #end_date: "2022-06-13" | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # probably don't change these | ||
| catchment_file_path: "https://github.com/USGS-R/drb-network-prep/blob/940073e8d77c911b6fb9dc4e3657aeab1162a158/2_process/out/GFv1_catchments_edited.gpkg?raw=true" | ||
|
|
||
| fabric_id: 'nhm' | ||
|
|
||
| id_col: "PRMS_segid" | ||
|
|
||
| # feel free to change these | ||
| data_vars: ['tmmx', 'tmmn', 'pr', 'srad', 'vs','rmax','rmin','sph'] | ||
|
|
||
| use_S3: True | ||
|
|
||
| make_zarr: True | ||
|
|
||
| run_prefix: "drb" | ||
|
|
||
| # leave blank if you want to run from 1979-01-01 | ||
| #start_date: "2022-05-09" | ||
|
|
||
| # leave blank if you want to run through current date | ||
| #end_date: "2022-06-13" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this set up and the Snakefile is clear! I did not get a chance to say this earlier, but thanks for setting this up like this.