Skip to content
This repository was archived by the owner on Jun 2, 2023. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 59 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,66 @@
# drb_gridmet_tools

Repository with functions to aggregate [gridmet climate raster data](https://www.climatologylab.org/gridmet.html) from pixel grid to vector aoi (hru polygon grid, polylines). This repository heavily relies on [grd2shp_xagg](https://github.com/rmcd-mscb/grd2shp_xagg) library, by [rmcd-mscb](https://github.com/rmcd-mscb)
Repository with functions to aggregate [gridmet climate raster data](https://www.climatologylab.org/gridmet.html) from pixel grid to vector aoi (hru polygon grid, polylines). This repository heavily relies on the [grd2shp_xagg](https://github.com/rmcd-mscb/grd2shp_xagg) library, by [rmcd-mscb](https://github.com/rmcd-mscb), which in turn relies oni [xagg](https://github.com/ks905383/xagg).

# Accessing the re-gridded files
## On Caldera
On Caldera, re-gridded file paths are structured like this:`/caldera/projects/usgs/water/impd/pump/gridmet/drb_gridmet_tools/drb-gridmet/{fabric}/{run_date}/drb_climate_{run_date}.nc`. For example: ``/caldera/projects/usgs/water/impd/pump/gridmet/drb_gridmet_tools/drb-gridmet/nhm/2022_06_14/drb_climate_2022_06_14.nc`.

## Selected gridMET variables:
## On S3
As of 6/2022, the results of the workflow are, by default, also stored S3 in the `drb-gridmet` bucket. For example, `s3://drb-gridmet/nhm/2022_06_14/drb_climate_2022_06_14.nc`.:w

# Running re-gridding for the Delaware River Basin via Snakemake
The re-gridding process for the Delaware River Basin has been run on USGS's Tallgrass via Singularity. It has been run for the National Hydrologic Model (NHM) fabric and the National Hydrographic Database (NHD) fabric. It should be able to be run via Docker as well as Singularity, but it has not been tried yet. Instructions for running and modifying the pipeline are below.

_Parallelization in Snakemake_
The Snakemake workflow parallelizes the re-gridding of the 8 Gridmet variables. As long as you provide at least 8 cores, these tasks will run in parallel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this set up and the Snakefile is clear! I did not get a chance to say this earlier, but thanks for setting this up like this.


_S3_
By default, the results of the workflow are stored locally and on S3 in the `drb-gridmet` bucket. If you would like to only store the resulting files locally, change the `use_S3` option in the config file to `False`.

NOTE: Although this workflow was first run on DRB catchments, it should be able to be run on any set of polygons in the conterminous US. This however, has not be tested.

## Run via Singularity
To run the workflow with Singularity, you can either use the image already in Caldera via Tallgrass, or pull the image into a new directory.

### Option A. Use the existing image and cloned repo
1. move to the correct directory

```
cd /caldera/projects/usgs/water/impd/pump/gridmet/drb_gridmet_tools/
```

2. Decide on which fabric you want to use. You will use a different `config` file depending on which fabric you use: `config_nhm.yml` for the NHM fabric and `config_nhd.yml` for the `nhd` fabric.
3. [Optional] edit the options in the config file
4. Run the workflow
You can run the workflow on Tallgrass either in batch mode or interactively. Subtite your HPC account (may not be 'iidd') and desired config file (may not be `config_nhd.yml`)

You can run the workflow via `sbatch`
```
sbatch -A iidd slurm/launch_snakemake.slurm config_nhd.yml
```
It may be helpful instead to run it interactively

```
salloc -N 1 -n 8 -t 10:00:00 -p cpu -A iidd
module load singularity
singularity exec-agg_v0.3.sif /opt/conda/bin/snakemake -j --configfile config_nhd.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a typo right?

Suggested change
singularity exec-agg_v0.3.sif /opt/conda/bin/snakemake -j --configfile config_nhd.yml
singularity exec gridmet-agg_v0.3.sif /opt/conda/bin/snakemake -j --configfile config_nhd.yml

Comment on lines +45 to +47
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run:

salloc -N 1 -n 8 -t 10:00:00 -p cpu -A iidd
module load singularity
singularity exec gridmet-agg_v0.3.sif /opt/conda/bin/snakemake -j --configfile config_nhd.yml

from the pump/gridmet/drb_gridmet_tools/ (following option A).
I get a Permission error
Permission denied: '/caldera/projects/usgs/water/impd/pump/gridmet/drb_gridmet_tools/.snakemake/log/2022-09-20T203254.660009.snakemake.log'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't write access to all the files / directories (see below) so everyone but Jeff will get this error. We'll have to ask HPC folks to give group permissions in this folder because I'm pretty sure Jeff doesn't have access to USGS HPC anymore

[jzwart@tg-login1 drb_gridmet_tools] ls -la
total 995356
drwxr-sr-x 11 jsadler pump       4096 Jun 17 12:24 .
drwxr-sr-x  3 jsadler pump       4096 Jan 28  2022 ..
drwxr-sr-x  3 jsadler pump       4096 Jun 17 12:22 archive
-rw-r--r--  1 jsadler pump        509 Jun 15 13:43 config_nhd.yml
-rw-r--r--  1 jsadler pump        536 Jun 15 16:04 config_nhm.yml
drwxr-sr-x  4 jsadler pump       4096 Jun 14 11:09 drb-gridmet
-rw-r--r--  1 jsadler pump        536 Apr  8 11:42 environment.yml
drwxr-sr-x  8 jsadler pump       4096 Jun 17 12:24 .git
-rw-r--r--  1 jsadler pump        123 Jan 28  2022 .gitignore
-rw-r--r--  1 jsadler pump       3079 Jun 15 12:34 gridmet_aggregation_PRMS.py
-rwxrwxr-x  1 jsadler pump 1019154432 Jun 15 13:40 gridmet-agg_v0.3.sif
-rw-r--r--  1 jsadler pump       7329 Jun 14 15:51 gridmet_split_script.py
drwxr-sr-x  2 jsadler pump       4096 Apr 21 15:57 .ipynb_checkpoints
drwxr-sr-x  2 jsadler pump       4096 Jun 15 12:34 __pycache__
-rw-r--r--  1 jsadler pump       4081 Jun 15 16:06 README.md
drwxr-sr-x  5 jsadler pump       4096 Jun 15 13:46 scratch
drwxr-sr-x  2 jsadler pump       4096 Jun 17 12:23 slurm
drwxr-sr-x  2 jsadler pump       4096 Jun 17 12:23 slurm_out
-rw-r--r--  1 jsadler pump       6400 Jun 15 16:02 Snakefile
drwxr-sr-x 11 jsadler pump       4096 Jun  9 16:57 .snakemake

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask in #tallgrass-help channel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we get any response on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (see here), they changed who owns the files and permissions should be updated.

[jzwart@tg-login2 drb_gridmet_tools] ls -la
total 995356
drwxrwsr-x 11 jzwart pump       4096 Jun 17 12:24 .
drwxrwsr-x  3 jzwart pump       4096 Jan 28  2022 ..
drwxrwsr-x  3 jzwart pump       4096 Jun 17 12:22 archive
-rw-rwxr--  1 jzwart pump        509 Jun 15 13:43 config_nhd.yml
-rw-rwxr--  1 jzwart pump        536 Jun 15 16:04 config_nhm.yml
drwxrwsr-x  4 jzwart pump       4096 Jun 14 11:09 drb-gridmet
-rw-rwxr--  1 jzwart pump        536 Apr  8 11:42 environment.yml
drwxrwsr-x  8 jzwart pump       4096 Jun 17 12:24 .git
-rw-rwxr--  1 jzwart pump        123 Jan 28  2022 .gitignore
-rw-rwxr--  1 jzwart pump       3079 Jun 15 12:34 gridmet_aggregation_PRMS.py
-rwxrwxr-x  1 jzwart pump 1019154432 Jun 15 13:40 gridmet-agg_v0.3.sif
-rw-rwxr--  1 jzwart pump       7329 Jun 14 15:51 gridmet_split_script.py
drwxrwsr-x  2 jzwart pump       4096 Apr 21 15:57 .ipynb_checkpoints
drwxrwsr-x  2 jzwart pump       4096 Jun 15 12:34 __pycache__
-rw-rwxr--  1 jzwart pump       4081 Jun 15 16:06 README.md
drwxrwsr-x  5 jzwart pump       4096 Jun 15 13:46 scratch
drwxrwsr-x  2 jzwart pump       4096 Jun 17 12:23 slurm
drwxrwsr-x  2 jzwart pump       4096 Jun 17 12:23 slurm_out
-rw-rwxr--  1 jzwart pump       6400 Jun 15 16:02 Snakefile
drwxrwsr-x 11 jzwart pump       4096 Jun  9 16:57 .snakemake

```

### Option B. Executing in a different directory
1. Clone the repo and move to the `drb_gridmet_tools` directory
```
git clone [email protected]:USGS-R/drb_gridmet_tools.git
cd drb_gridmet_tools
```
2. Pull down Docker image into a Singularity file
```
singularity pull docker://jsadler2/gridmet-agg:v0.3
```
3. Do Steps 2-4 from Option A.

Comment on lines +53 to +61
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this in my own user dir in tallgrass (as well as on a different project dir) in impd/pump/ and in both cases got the following error:

ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/share/proj failed
Building DAG of jobs... 

and botocore.exceptions.NoCredentialsError: Unable to locate credentials

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this refers to the S3 credentials - could you try re-running after setting use_S3: False in the config_XXX.yml ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure in that case I run locally?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd help to run locally.


# Selected gridMET variables:

tmmx:
* Description: Daily Maximum Temperature (2m)\
Expand Down Expand Up @@ -38,8 +95,4 @@ sph:
* Units: kg/kg


## Running re-gridding for the Delaware River Basin

`gridmet_split_script.py` processes the gridmet raster dataset values to the scale of the input multi-polygon shapefile.

`gridmet_aggregation_PRMS.py` processes the output of `gridmet_splot_script.py` and aggregates the PRMS_segid scale calculating an area weighted average.
167 changes: 167 additions & 0 deletions Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
import geopandas as gpd
import xarray as xr
from snakemake.remote.S3 import RemoteProvider as S3RemoteProvider
from gridmet_split_script import get_gridmet_datasets, create_weightmap, g2shp_regridding
from gridmet_aggregation_PRMS import ncdf_to_gdf, gridmet_prms_area_avg_agg
import requests
from datetime import datetime

todays_date = datetime.today().strftime('%Y_%m_%d')

final_files = []

nc_file_path = "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}.nc"
final_nc_file_path = nc_file_path.format(fabric_id = config['fabric_id'],
todays_date=todays_date,
run_prefix=config['run_prefix'])

seg_file_path = "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}_segments.csv"
final_seg_file_path = seg_file_path.format(fabric_id = config['fabric_id'],
todays_date=todays_date,
run_prefix=config['run_prefix'])

zarr_path = "drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}.zarr.ind"
final_zarr_path = zarr_path.format(fabric_id = config['fabric_id'],
todays_date=todays_date,
run_prefix=config['run_prefix'])


if config['fabric_id'] == 'nhm':
final_files.append(final_seg_file_path)

if config['make_zarr']:
final_files.append(final_zarr_path)

if config['use_S3']:
S3 = S3RemoteProvider(keep_local=True)
final_files = [S3.remote(f) for f in final_files]
nc_file_path = S3.remote(nc_file_path)
seg_file_path = S3.remote(seg_file_path)



rule all:
input:
final_files


rule make_weight_map:
output:
"drb-gridmet/{fabric_id}/grd2shp_weights.pickle"
run:
gdf = gpd.read_file(config['catchment_file_path'])
# getting just one date and one variable to make the weight map.
# the same weight map applies to all dates and all variables
data_dict = get_gridmet_datasets(variable="tmmn",
start_date="2001-01-01",
end_date="2001-01-02",
polygon_for_bbox=gdf)
create_weightmap(xarray_dict=data_dict,
polygon=gdf,
output_data_folder = os.path.split(output[0])[0])


rule aggregate_gridmet_to_polygons_one_var:
input:
"drb-gridmet/{fabric_id}/grd2shp_weights.pickle"
output:
"drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_var_{variable}_climate_{todays_date}.nc"
run:
gdf = gpd.read_file(config['catchment_file_path'])
data_dict = get_gridmet_datasets(variable=wildcards.variable,
start_date=config.get('start_date', "1979-01-01"),
end_date=config.get('end_date', todays_date.replace("_", "-")),
polygon_for_bbox=gdf)
g2shp_regridding(xarray_dict=data_dict,
polygon=gdf,
weightmap_file= input[0],
g2s_file_prefix=f'{wildcards.run_prefix}_var_{wildcards.variable}_',
output_data_folder= os.path.split(output[0])[0],
g2s_time_var = 'day',
g2s_lat_var = 'lat',
g2s_lon_var = 'lon')


rule gather_gridmets:
input:
expand("drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_var_{variable}_climate_{todays_date}.nc",
fabric_id=config['fabric_id'],
todays_date=todays_date,
run_prefix=config['run_prefix'],
variable=config['data_vars'])
output:
nc_file_path
run:
gdf = gpd.read_file(config['catchment_file_path'])
ds_list = [xr.open_dataset(nc_file) for nc_file in input]
ds_combined = xr.merge(ds_list)
ds_combined = ds_combined.assign_coords({config["id_col"]: ("geomid", gdf[config["id_col"]])}).swap_dims({"geomid":config["id_col"]})
ds_combined = ds_combined.drop("geomid")
ds_combined.to_netcdf(output[0])


rule aggregate_gridmet_polygons_to_flowlines:
input:
"drb-gridmet/{fabric_id}/{todays_date}/{run_prefix}_climate_{todays_date}.nc"
output:
seg_file_path
run:
gdf = gpd.read_file(config['catchment_file_path'])
gridmet_drb_gdf = ncdf_to_gdf(ncdf_path=input[0],
shp = gdf,
left_on = config["id_col"],
right_on = config["id_col"])
df_agg = gridmet_prms_area_avg_agg(gridmet_drb_gdf,
groupby_cols = ['PRMS_segid',"time"],
val_colnames = config['data_vars'],
wgt_col='hru_area_m2',
output_path= output[0])

checkpoint write_zarr:
input:
"drb-gridmet/{filename}.nc"
output:
directory("/tmp/{filename}.zarr")
run:
ds = xr.open_dataset(input[0])
ds = ds.chunk({"time": len(ds.time), config["id_col"]: 100})
ds.to_zarr(output[0])


def get_zarr_files(wildcards):
zarr_files = []
for path, currentDirectory, files in os.walk(f"/tmp/{wildcards.filename}.zarr"):
for file in files:
scratch_path = os.path.join(path, file)
drb_path = scratch_path.replace("/tmp", "drb-gridmet")
zarr_files.append(drb_path)
return zarr_files


rule write_zarr_ind:
input:
"/tmp/{filename}.zarr",
get_zarr_files
output:
"drb-gridmet/{filename}.zarr.ind"
shell:
"touch {output[0]}"

if config['use_S3']:
rule copy_from_scratch_to_s3:
input:
"/tmp/{filename}.zarr/{zarr_file}"
output:
S3.remote("drb-gridmet/{filename}.zarr/{zarr_file}")
shell:
"cp {input[0]} {output[0]}"

else:
rule copy_from_scratch_to_s3:
input:
"/tmp/{filename}.zarr/{zarr_file}"
output:
"drb-gridmet/{filename}.zarr/{zarr_file}"
shell:
"cp {input[0]} {output[0]}"

23 changes: 23 additions & 0 deletions config_nhd.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# probably don't change these
catchment_file_path: "https://github.com/USGS-R/drb-network-prep/blob/main/1_fetch/out/NHDPlusv2_catchments.gpkg?raw=true"

fabric_id: 'nhd'


id_col: "COMID"

# feel free to change these
data_vars: ['tmmx', 'tmmn', 'pr', 'srad', 'vs','rmax','rmin','sph']

use_S3: True

make_zarr: True

run_prefix: "drb"

# if not specified, data will be processed from 1979-01-01
#start_date: "2022-05-09"

# if not specified, data will be processed to today's date
#end_date: "2022-06-13"

21 changes: 21 additions & 0 deletions config_nhm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# probably don't change these
catchment_file_path: "https://github.com/USGS-R/drb-network-prep/blob/940073e8d77c911b6fb9dc4e3657aeab1162a158/2_process/out/GFv1_catchments_edited.gpkg?raw=true"

fabric_id: 'nhm'

id_col: "PRMS_segid"

# feel free to change these
data_vars: ['tmmx', 'tmmn', 'pr', 'srad', 'vs','rmax','rmin','sph']

use_S3: True

make_zarr: True

run_prefix: "drb"

# leave blank if you want to run from 1979-01-01
#start_date: "2022-05-09"

# leave blank if you want to run through current date
#end_date: "2022-06-13"
Binary file removed data/.DS_Store
Binary file not shown.
Binary file removed data/GFv1_catchments_edited.gpkg.zip
Binary file not shown.
Binary file removed data/PRMS_catchments_4326.zip
Binary file not shown.
Binary file removed data/nhru_01.zip
Binary file not shown.
Binary file removed data/nhru_02.zip
Binary file not shown.
28 changes: 2 additions & 26 deletions gridmet_aggregation_PRMS.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import time

# ncdf_to_gdf() function converts ncdf to dataset and merged with shpfile information (geometry + area)
def ncdf_to_gdf(ncdf_path, shp, left_on = 'geomid', right_on_index = True, gpkg_layer = None):
def ncdf_to_gdf(ncdf_path, shp, left_on, right_on, gpkg_layer = None):

"""
:param str ncdf_path: path to regridded ncdf file (output of g2shp_regridding())
Expand All @@ -27,7 +27,7 @@ def ncdf_to_gdf(ncdf_path, shp, left_on = 'geomid', right_on_index = True, gpkg_
print('shp must be path to geospatial file or a geodataframe')

## Merge ncdf w/ shapefile (the shpfile has area info) & convert to GeoDataFrame
gridmet_drb_df = xr_mapped_df.merge(gdf, how ='left', left_on = left_on, right_index = right_on_index)
gridmet_drb_df = xr_mapped_df.merge(gdf, how ='left', left_on = left_on, right_on = right_on)
gridmet_drb_gdf = gpd.GeoDataFrame(gridmet_drb_df)

return gridmet_drb_gdf
Expand Down Expand Up @@ -76,27 +76,3 @@ def gridmet_prms_area_avg_agg(df, groupby_cols, val_colnames, wgt_col, output_pa

return df_final

# Define variables and run
if __name__ =='__main__':

## Variable definitions
gdf_prms_path_edited = 'https://github.com/USGS-R/drb-network-prep/blob/940073e8d77c911b6fb9dc4e3657aeab1162a158/2_process/out/GFv1_catchments_edited.gpkg?raw=true'
gdf = gpd.read_file(gdf_prms_path_edited, layer='GFv1_catchments_edited')
gridmet_ncdf = './data/t_climate_2022_03_31.nc'
data_vars_shrt_all = ['tmmx', 'tmmn', 'pr', 'srad', 'vs', 'rmax', 'rmin', 'sph']

## Create dataframe and merge with shapefile information
gridmet_drb_gdf = ncdf_to_gdf(ncdf_path=gridmet_ncdf,
shp = gdf,
left_on = 'geomid',
right_on_index = True)

## run aggregation on PRMS_segid and time
df_agg = gridmet_prms_area_avg_agg(gridmet_drb_gdf,
groupby_cols = ['PRMS_segid',"time"],
val_colnames = data_vars_shrt_all,
wgt_col='hru_area_m2',
output_path= None)

## Uncomment to run
# df_agg.reset_index().to_csv('../drb-inland-salinity-ml/1_fetch/in/grdmet_drb_agg_032321.csv', index = False)
Loading