|
| 1 | +# Input Dataset Ingestion |
| 2 | + |
| 3 | +This guide covers how to ingest and process input datasets for the OCR (Open Climate Risk) project using the unified CLI infrastructure. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The input dataset infrastructure provides a consistent interface for ingesting both tensor (raster/Icechunk) and vector (GeoParquet) datasets: |
| 8 | + |
| 9 | +## Quick Start |
| 10 | + |
| 11 | +### Discovery |
| 12 | + |
| 13 | +List all available datasets: |
| 14 | + |
| 15 | +```bash |
| 16 | +pixi run ocr ingest-data list-datasets |
| 17 | +``` |
| 18 | + |
| 19 | +### Processing |
| 20 | + |
| 21 | +Process a dataset (always dry run first to preview): |
| 22 | + |
| 23 | +```bash |
| 24 | +# Preview operations (recommended first step) |
| 25 | +pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run |
| 26 | + |
| 27 | +# Execute the full pipeline |
| 28 | +pixi run ocr ingest-data run-all scott-et-al-2024 |
| 29 | + |
| 30 | +# Use Coiled for distributed processing |
| 31 | +pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled |
| 32 | +``` |
| 33 | + |
| 34 | +### Dataset-Specific Options |
| 35 | + |
| 36 | +Different datasets support different processing options: |
| 37 | + |
| 38 | +```bash |
| 39 | +# Vector datasets: Overture Maps - select data type |
| 40 | +pixi run ocr ingest-data process overture-maps --overture-data-type buildings |
| 41 | + |
| 42 | +# Vector datasets: Census TIGER - select geography and states |
| 43 | +pixi run ocr ingest-data process census-tiger \ |
| 44 | + --census-geography-type tracts \ |
| 45 | + --census-subset-states California --census-subset-states Oregon |
| 46 | +``` |
| 47 | + |
| 48 | +## Available Datasets |
| 49 | + |
| 50 | +### Tensor Datasets (Raster/Icechunk) |
| 51 | + |
| 52 | +#### scott-et-al-2024 |
| 53 | + |
| 54 | +**USFS Wildfire Risk to Communities (2nd Edition)** |
| 55 | + |
| 56 | +- **RDS ID**: RDS-2020-0016-02 |
| 57 | +- **Version**: 2024-V2 |
| 58 | +- **Source**: [USFS Research Data Archive](https://www.fs.usda.gov/rds/archive/catalog/RDS-2020-0016-2) |
| 59 | +- **Resolution**: 30m (EPSG:4326), native 270m (EPSG:5070) |
| 60 | +- **Coverage**: CONUS |
| 61 | +- **Variables**: BP (Burn Probability), CRPS (Conditional Risk to Potential Structures), CFL (Conditional Flame Length), Exposure, FLEP4, FLEP8, RPS (Relative Proportion Spread), WHP (Wildfire Hazard Potential) |
| 62 | + |
| 63 | +**Pipeline**: |
| 64 | + |
| 65 | +1. Download 8 TIFF files from USFS Box (one per variable) |
| 66 | +2. Merge TIFFs into Icechunk store (EPSG:5070, native resolution) |
| 67 | +3. Reproject to EPSG:4326 at 30m resolution |
| 68 | + |
| 69 | +**Usage**: |
| 70 | + |
| 71 | +```bash |
| 72 | +# Full pipeline |
| 73 | +pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run |
| 74 | +pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled |
| 75 | + |
| 76 | +# Individual steps |
| 77 | +pixi run ocr ingest-data download scott-et-al-2024 |
| 78 | +pixi run ocr ingest-data process scott-et-al-2024 --use-coiled |
| 79 | +``` |
| 80 | + |
| 81 | +**Outputs**: |
| 82 | + |
| 83 | +- Raw TIFFs: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02/input_tif/` |
| 84 | +- Native Icechunk: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02_all_vars_merge_icechunk/` |
| 85 | +- Reprojected: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/scott-et-al-2024-30m-4326.icechunk/` |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +#### riley-et-al-2025 |
| 90 | + |
| 91 | +**USFS Probabilistic Wildfire Risk - 2011 & 2047 Climate Runs** |
| 92 | + |
| 93 | +- **RDS ID**: RDS-2025-0006 |
| 94 | +- **Version**: 2025 |
| 95 | +- **Source**: [USFS Research Data Archive](https://www.fs.usda.gov/rds/archive/catalog/RDS-2025-0006) |
| 96 | +- **Resolution**: 30m (EPSG:4326), native 270m (EPSG:5070) |
| 97 | +- **Coverage**: CONUS |
| 98 | +- **Variables**: Multiple climate scenarios (2011 baseline, 2047 projections) |
| 99 | + |
| 100 | +**Pipeline**: |
| 101 | + |
| 102 | +1. Download TIFF files for both time periods |
| 103 | +2. Process and merge into Icechunk stores |
| 104 | +3. Reproject to EPSG:4326 at 30m resolution |
| 105 | + |
| 106 | +**Usage**: |
| 107 | + |
| 108 | +```bash |
| 109 | +pixi run ocr ingest-data run-all riley-et-al-2025 --use-coiled |
| 110 | +``` |
| 111 | + |
| 112 | +**Outputs**: |
| 113 | + |
| 114 | +- Reprojected: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/riley-et-al-2025-30m-4326.icechunk/` |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +#### dillon-et-al-2023 |
| 119 | + |
| 120 | +**USFS Spatial Datasets of Probabilistic Wildfire Risk Components (270m, 3rd Edition)** |
| 121 | + |
| 122 | +- **RDS ID**: RDS-2016-0034-3 |
| 123 | +- **Version**: 2023 |
| 124 | +- **Source**: [USFS Research Data Archive](https://www.fs.usda.gov/rds/archive/catalog/RDS-2016-0034-3) |
| 125 | +- **Resolution**: 30m (EPSG:4326), native 270m (EPSG:5070) |
| 126 | +- **Coverage**: CONUS |
| 127 | +- **Variables**: BP, FLP1-6 (Flame Length Probability levels) |
| 128 | + |
| 129 | +**Pipeline**: |
| 130 | + |
| 131 | +1. Download ZIP archive and extract TIFFs |
| 132 | +2. Upload TIFFs to S3 and merge into Icechunk |
| 133 | +3. Reproject to EPSG:4326 at 30m resolution |
| 134 | + |
| 135 | +**Usage**: |
| 136 | + |
| 137 | +```bash |
| 138 | +pixi run ocr ingest-data run-all dillon-et-al-2023 --use-coiled |
| 139 | +``` |
| 140 | + |
| 141 | +**Outputs**: |
| 142 | + |
| 143 | +- Raw TIFFs: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/raw-input-tiffs/` |
| 144 | +- Native Icechunk: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-270m-5070.icechunk/` |
| 145 | +- Reprojected: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-30m-4326.icechunk/` |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +### Vector Datasets (GeoParquet) |
| 150 | + |
| 151 | +#### overture-maps |
| 152 | + |
| 153 | +**Overture Maps Building and Address Data for CONUS** |
| 154 | + |
| 155 | +- **Release**: 2025-09-24.0 |
| 156 | +- **Source**: [Overture Maps Foundation](https://overturemaps.org) |
| 157 | +- **Format**: GeoParquet (WKB geometry, zstd compression) |
| 158 | +- **Coverage**: CONUS (spatially filtered from global dataset) |
| 159 | +- **Data Types**: Buildings (bbox + geometry), Addresses (full attributes), Region-Tagged Buildings (buildings + census identifiers) |
| 160 | + |
| 161 | +**Pipeline**: |
| 162 | + |
| 163 | +1. Query Overture S3 bucket directly (no download step) |
| 164 | +2. Filter by CONUS bounding box using DuckDB |
| 165 | +3. Write subsetted data to carbonplan-ocr S3 bucket |
| 166 | +4. (If buildings processed) Perform spatial join with US Census blocks to add geographic identifiers |
| 167 | + |
| 168 | +**Region-Tagged Buildings Processing**: |
| 169 | + |
| 170 | +When buildings are processed, an additional dataset is automatically created that tags each building with census geographic identifiers: |
| 171 | + |
| 172 | +- Loads census FIPS lookup table for state/county names |
| 173 | +- Creates spatial indexes on buildings and census blocks |
| 174 | +- Performs bbox-filtered spatial join using `ST_Intersects` |
| 175 | +- Adds identifiers at multiple administrative levels: state, county, tract, block group, and block |
| 176 | + |
| 177 | +**Usage**: |
| 178 | + |
| 179 | +```bash |
| 180 | +# Both buildings and addresses (default) |
| 181 | +# Also creates region-tagged buildings automatically |
| 182 | +pixi run ocr ingest-data run-all overture-maps |
| 183 | + |
| 184 | +# Only buildings (also creates region-tagged buildings) |
| 185 | +pixi run ocr ingest-data process overture-maps --overture-data-type buildings |
| 186 | + |
| 187 | +# Only addresses (no region tagging) |
| 188 | +pixi run ocr ingest-data process overture-maps --overture-data-type addresses |
| 189 | + |
| 190 | +# Dry run |
| 191 | +pixi run ocr ingest-data run-all overture-maps --dry-run |
| 192 | + |
| 193 | +# Use Coiled for distributed processing |
| 194 | +pixi run ocr ingest-data run-all overture-maps --use-coiled |
| 195 | +``` |
| 196 | + |
| 197 | +**Outputs**: |
| 198 | + |
| 199 | +- Buildings: `s3://carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-buildings-2025-09-24.0.parquet` |
| 200 | +- Addresses: `s3://carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-addresses-2025-09-24.0.parquet` |
| 201 | +- Region-Tagged Buildings: `s3://carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-region-tagged-buildings-2025-09-24.0.parquet` |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +#### census-tiger |
| 206 | + |
| 207 | +**US Census TIGER/Line Geographic Boundaries** |
| 208 | + |
| 209 | +- **Vintage**: 2024 (tracts/counties), 2025 (blocks) |
| 210 | +- **Source**: [US Census Bureau TIGER/Line](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html) |
| 211 | +- **Format**: GeoParquet (WKB geometry, zstd compression, schema v1.1.0) |
| 212 | +- **Coverage**: CONUS + DC (49 states/territories, excludes Alaska & Hawaii) |
| 213 | +- **Geography Types**: Blocks, Tracts, Counties |
| 214 | + |
| 215 | +**Pipeline**: |
| 216 | + |
| 217 | +1. Download TIGER/Line shapefiles from Census Bureau (per-state for blocks/tracts) |
| 218 | +2. Convert to GeoParquet with spatial metadata |
| 219 | +3. Aggregate tract files using DuckDB |
| 220 | + |
| 221 | +**Usage**: |
| 222 | + |
| 223 | +```bash |
| 224 | +# All geography types (default) |
| 225 | +pixi run ocr ingest-data run-all census-tiger |
| 226 | + |
| 227 | +# Only counties |
| 228 | +pixi run ocr ingest-data process census-tiger --census-geography-type counties |
| 229 | + |
| 230 | +# Tracts for specific states |
| 231 | +pixi run ocr ingest-data process census-tiger --census-geography-type tracts \ |
| 232 | + --census-subset-states California --census-subset-states Oregon |
| 233 | + |
| 234 | +# Dry run |
| 235 | +pixi run ocr ingest-data run-all census-tiger --dry-run |
| 236 | +``` |
| 237 | + |
| 238 | +**Outputs**: |
| 239 | + |
| 240 | +- Blocks: `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/blocks/blocks.parquet` |
| 241 | +- Tracts (per-state): `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/FIPS/FIPS_*.parquet` |
| 242 | +- Tracts (aggregated): `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/tracts.parquet` |
| 243 | +- Counties: `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/counties/counties.parquet` |
| 244 | + |
| 245 | +## CLI Reference |
| 246 | + |
| 247 | +### Commands |
| 248 | + |
| 249 | +- **`list-datasets`**: Show all available datasets |
| 250 | +- **`download <dataset>`**: Download raw source data (tensor datasets only) |
| 251 | +- **`process <dataset>`**: Process and upload to S3/Icechunk |
| 252 | +- **`run-all <dataset>`**: Complete pipeline (download + process + cleanup) |
| 253 | + |
| 254 | +### Global Options |
| 255 | + |
| 256 | +- **`--dry-run`**: Preview operations without executing (recommended before any real run) |
| 257 | +- **`--debug`**: Enable debug logging for troubleshooting |
| 258 | + |
| 259 | +### Tensor Dataset Options |
| 260 | + |
| 261 | +- **`--use-coiled`**: Use Coiled for distributed processing (USFS datasets) |
| 262 | + |
| 263 | +### Vector Dataset Options |
| 264 | + |
| 265 | +#### Overture Maps |
| 266 | + |
| 267 | +- **`--overture-data-type <type>`**: Which data to process |
| 268 | + - `buildings`: Only building geometries |
| 269 | + - `addresses`: Only address points |
| 270 | + - `both`: Both datasets (default) |
| 271 | + |
| 272 | +#### Census TIGER |
| 273 | + |
| 274 | +- **`--census-geography-type <type>`**: Which geography to process |
| 275 | + - `blocks`: Census blocks |
| 276 | + - `tracts`: Census tracts (per-state + aggregated) |
| 277 | + - `counties`: County boundaries |
| 278 | + - `all`: All three types (default) |
| 279 | +- **`--census-subset-states <state> [<state> ...]`**: Process only specific states |
| 280 | + - Repeat option for each state: `--census-subset-states California --census-subset-states Oregon` |
| 281 | + - Use full state names (case-sensitive): `California`, `Oregon`, `Washington`, etc. |
| 282 | + |
| 283 | +## Configuration |
| 284 | + |
| 285 | +### Environment Variables |
| 286 | + |
| 287 | +All settings can be overridden via environment variables: |
| 288 | + |
| 289 | +```bash |
| 290 | +# S3 configuration |
| 291 | +export OCR_INPUT_DATASET_S3_BUCKET=my-bucket |
| 292 | +export OCR_INPUT_DATASET_S3_REGION=us-east-1 |
| 293 | +export OCR_INPUT_DATASET_BASE_PREFIX=custom/prefix |
| 294 | + |
| 295 | +# Processing options |
| 296 | +export OCR_INPUT_DATASET_CHUNK_SIZE=16384 |
| 297 | +export OCR_INPUT_DATASET_DEBUG=true |
| 298 | + |
| 299 | +# Temporary storage |
| 300 | +export OCR_INPUT_DATASET_TEMP_DIR=/path/to/temp |
| 301 | +``` |
| 302 | + |
| 303 | +### Configuration Class |
| 304 | + |
| 305 | +The `InputDatasetConfig` class (Pydantic model) provides: |
| 306 | + |
| 307 | +- Type validation for all settings |
| 308 | +- Automatic environment variable loading (prefix: `OCR_INPUT_DATASET_`) |
| 309 | +- Default values for all options |
| 310 | +- Case-insensitive environment variable names |
| 311 | + |
| 312 | +## Troubleshooting |
| 313 | + |
| 314 | +### Dry Run First |
| 315 | + |
| 316 | +Always test with `--dry-run` before executing: |
| 317 | + |
| 318 | +```bash |
| 319 | +ocr ingest-data run-all <dataset> --dry-run |
| 320 | +``` |
| 321 | + |
| 322 | +This previews all operations without making changes. |
0 commit comments