Code to produce spatial aggregations of PM2.5 component estimates as generated by the Atmospheric Composition Analysis Group. The spatial aggregation are performed for satellite PM2.5 components from grid/raster (NetCDF) to polygons (shp).
This pipeline processes 8 PM2.5 components: NO3, SO4, Sea Salt (SS), NH4, Dust, Black Carbon (BC), Organic Matter (OM), and Organic Matter with H2O (OM_H2O).
The Atmospheric Composition Analysis Group uses a combination of satellite images, monitors and simulation to generate estimates of PM2.5 and its chemical components. Estimates are stored in NetCDF files and made publicly available. There are several versions of the estimates.
The version V5.NA.05.02 consists of mean PM2.5 component concentrations (μg/m³) available at:
- Temporal frequency: Annual and monthly
- Grid resolution: High resolution for North America
- Geographic region: North America only
- Components: NO3, SO4, Sea Salt (SS), NH4, Dust, Black Carbon (BC), Organic Matter (OM), and Organic Matter with H2O (OM_H2O)
In this repository, we specifically aggregate the V5.NA.05.02 component files for North America, processing all 8 components simultaneously. The temporal frequency can be modified via configuration parameters.
The file name convention varies by component, for example:
- NO3: V5NA05.02.HybridNO3-NO3.NorthAmerica.yyyyjjj-yyyyjjj.nc
- SO4: V5NA05.02.HybridSO4-SO4.NorthAmerica.yyyyjjj-yyyyjjj.nc
- BC: V5NA05.02.HybridBC-BC.NorthAmerica.yyyyjjj-yyyyjjj.nc
Where yyyy represents the year and jjj represents the Julian day.
Aaron van Donkelaar, Melanie S. Hammer, Liam Bindle, Michael Brauer, Jeffery R. Brook, Michael J. Garay, N. Christina Hsu, Olga V. Kalashnikova, Ralph A. Kahn, Colin Lee, Robert C. Levy, Alexei Lyapustin, Andrew M. Sayer and Randall V. Martin (2021). Monthly Global Estimates of Fine Particulate Matter and Their Uncertainty Environmental Science & Technology, 2021, doi:10.1021/acs.est.1c05309.
The output parquet files contain PM2.5 component concentrations aggregated to geographic polygons. Each file includes:
- Spatial identifier:
county
orzcta
- Geographic unit identifier - year: Year of the data
- month: Month of the data (monthly files only)
- Component concentrations (μg/m³):
no3
: Nitrate componentso4
: Sulfate componentss
: Sea salt componentnh4
: Ammonium componentdust
: Dust componentbc
: Black carbon componentom
: Organic matter componentom_h2o
: Organic matter with water component
Output files are in Parquet format for efficient storage and processing.
The configuration structure within the /conf
folder allows you to modify the input parameters for the following steps:
- create directory paths:
utils/create_dir_paths.py
- download components:
src/download_components.py
- download shapefiles:
src/download_shapefile.py
- aggregate components:
src/aggregate_all_components.py
temporal_freq
: Determines whether the original annual (yearly
) or monthly data will be aggregated. Options are:yearly
andmonthly
.polygon_name
: Determines into which polygons the component grids will be aggregated. Options are:zcta
andcounty
.components
: List of PM2.5 components to process. Current components:no3
,so4
,ss
,nh4
,dust
,bc
,om
,om_h2o
.shapefile_year
: Years of shapefiles to download for polygon boundaries.
conf/snakemake.yaml
: Main pipeline configurationconf/satellite_component/us_components.yaml
: Component-specific URLs and file patternsconf/shapefiles/shapefiles.yaml
: Shapefile sources and parameters
Clone the repository and create a conda environment.
git clone <https://github.com/<user>/repo>
cd <repo>
conda env create -f requirements.yml
conda activate <env_name> #environment name as found in requirements.yml
It is also possible to use mamba
.
mamba env create -f requirements.yml
mamba activate <env_name>
Run
python utils/create_dir_paths.py
You can run the pipeline steps manually or run the snakemake pipeline described in the Snakefile.
run pipeline steps manually
python src/download_shapefile.py polygon_name=zcta shapefile_year=2020
python src/download_components.py component=no3 ++temporal_freq=yearly
export PYTHONPATH=.
python src/aggregate_all_components.py polygon_name=zcta ++temporal_freq=yearly ++year=2020
run snakemake pipeline The pipeline processes all configured components simultaneously:
snakemake --cores 4
For SLURM environments, use the provided batch script:
sbatch snakefile.sbatch
Modify the configuration in conf/snakemake.yaml
to change polygon_name
, temporal_freq
, and components
as needed.
Note: The Docker configuration may need updates to reflect the new component-based pipeline.
Create the folder where you would like to store the output dataset.
mkdir <path>/satellite_pm25_components_raster2polygon
docker pull nsaph/satellite_pm25_components_raster2polygon
docker run -v <path>:/app/data/input/pm25_components__randall/yearly <path>/satellite_pm25_components_raster2polygon/:/app/data/output/pm25_components__randall nsaph/satellite_pm25_components_raster2polygon
If you are interested in storing the input raw and intermediate data run
docker run -v <path>/satellite_pm25_components_raster2polygon/:/app/data/ nsaph/satellite_pm25_components_raster2polygon
If you want to build your own image use
docker build -t <image_name> .
For multiplatform use
docker buildx build --platform linux/amd64,linux/arm64 -t <username>/<image_name>:<tag> . --push