Skip to content

Commit da347e2

Browse files
authored
migrate vector input datasets to unified ingestion and remove unused datasets (#297)
1 parent 9f2c2e3 commit da347e2

File tree

19 files changed

+1431
-782
lines changed

19 files changed

+1431
-782
lines changed

.github/workflows/deploy.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,7 @@ jobs:
227227
--region-id y8_x3 --region-id y8_x4 --region-id y8_x5 --region-id y8_x6
228228
--region-id y7_x4 --region-id y7_x5 --region-id y7_x6 --region-id y7_x7 --region-id y7_x8
229229
--region-id y6_x5 --region-id y6_x6 --region-id y6_x7 --region-id y6_x8 --region-id y6_x9
230+
--region-id y5_x7 --region-id y5_x8 --region-id y5_x9
230231
--region-id y9_x14 --region-id y9_x15 --region-id y9_x16 --region-id y9_x17
231232
--region-id y8_x14 --region-id y8_x15 --region-id y8_x16 --region-id y8_x17
232233
--region-id y7_x31 --region-id y7_x32
Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
# Input Dataset Ingestion
2+
3+
This guide covers how to ingest and process input datasets for the OCR (Open Climate Risk) project using the unified CLI infrastructure.
4+
5+
## Overview
6+
7+
The input dataset infrastructure provides a consistent interface for ingesting both tensor (raster/Icechunk) and vector (GeoParquet) datasets:
8+
9+
## Quick Start
10+
11+
### Discovery
12+
13+
List all available datasets:
14+
15+
```bash
16+
pixi run ocr ingest-data list-datasets
17+
```
18+
19+
### Processing
20+
21+
Process a dataset (always dry run first to preview):
22+
23+
```bash
24+
# Preview operations (recommended first step)
25+
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
26+
27+
# Execute the full pipeline
28+
pixi run ocr ingest-data run-all scott-et-al-2024
29+
30+
# Use Coiled for distributed processing
31+
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled
32+
```
33+
34+
### Dataset-Specific Options
35+
36+
Different datasets support different processing options:
37+
38+
```bash
39+
# Vector datasets: Overture Maps - select data type
40+
pixi run ocr ingest-data process overture-maps --overture-data-type buildings
41+
42+
# Vector datasets: Census TIGER - select geography and states
43+
pixi run ocr ingest-data process census-tiger \
44+
--census-geography-type tracts \
45+
--census-subset-states California --census-subset-states Oregon
46+
```
47+
48+
## Available Datasets
49+
50+
### Tensor Datasets (Raster/Icechunk)
51+
52+
#### scott-et-al-2024
53+
54+
**USFS Wildfire Risk to Communities (2nd Edition)**
55+
56+
- **RDS ID**: RDS-2020-0016-02
57+
- **Version**: 2024-V2
58+
- **Source**: [USFS Research Data Archive](https://www.fs.usda.gov/rds/archive/catalog/RDS-2020-0016-2)
59+
- **Resolution**: 30m (EPSG:4326), native 270m (EPSG:5070)
60+
- **Coverage**: CONUS
61+
- **Variables**: BP (Burn Probability), CRPS (Conditional Risk to Potential Structures), CFL (Conditional Flame Length), Exposure, FLEP4, FLEP8, RPS (Relative Proportion Spread), WHP (Wildfire Hazard Potential)
62+
63+
**Pipeline**:
64+
65+
1. Download 8 TIFF files from USFS Box (one per variable)
66+
2. Merge TIFFs into Icechunk store (EPSG:5070, native resolution)
67+
3. Reproject to EPSG:4326 at 30m resolution
68+
69+
**Usage**:
70+
71+
```bash
72+
# Full pipeline
73+
pixi run ocr ingest-data run-all scott-et-al-2024 --dry-run
74+
pixi run ocr ingest-data run-all scott-et-al-2024 --use-coiled
75+
76+
# Individual steps
77+
pixi run ocr ingest-data download scott-et-al-2024
78+
pixi run ocr ingest-data process scott-et-al-2024 --use-coiled
79+
```
80+
81+
**Outputs**:
82+
83+
- Raw TIFFs: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02/input_tif/`
84+
- Native Icechunk: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/RDS-2020-0016-02_all_vars_merge_icechunk/`
85+
- Reprojected: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/scott-et-al-2024-30m-4326.icechunk/`
86+
87+
---
88+
89+
#### riley-et-al-2025
90+
91+
**USFS Probabilistic Wildfire Risk - 2011 & 2047 Climate Runs**
92+
93+
- **RDS ID**: RDS-2025-0006
94+
- **Version**: 2025
95+
- **Source**: [USFS Research Data Archive](https://www.fs.usda.gov/rds/archive/catalog/RDS-2025-0006)
96+
- **Resolution**: 30m (EPSG:4326), native 270m (EPSG:5070)
97+
- **Coverage**: CONUS
98+
- **Variables**: Multiple climate scenarios (2011 baseline, 2047 projections)
99+
100+
**Pipeline**:
101+
102+
1. Download TIFF files for both time periods
103+
2. Process and merge into Icechunk stores
104+
3. Reproject to EPSG:4326 at 30m resolution
105+
106+
**Usage**:
107+
108+
```bash
109+
pixi run ocr ingest-data run-all riley-et-al-2025 --use-coiled
110+
```
111+
112+
**Outputs**:
113+
114+
- Reprojected: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/riley-et-al-2025-30m-4326.icechunk/`
115+
116+
---
117+
118+
#### dillon-et-al-2023
119+
120+
**USFS Spatial Datasets of Probabilistic Wildfire Risk Components (270m, 3rd Edition)**
121+
122+
- **RDS ID**: RDS-2016-0034-3
123+
- **Version**: 2023
124+
- **Source**: [USFS Research Data Archive](https://www.fs.usda.gov/rds/archive/catalog/RDS-2016-0034-3)
125+
- **Resolution**: 30m (EPSG:4326), native 270m (EPSG:5070)
126+
- **Coverage**: CONUS
127+
- **Variables**: BP, FLP1-6 (Flame Length Probability levels)
128+
129+
**Pipeline**:
130+
131+
1. Download ZIP archive and extract TIFFs
132+
2. Upload TIFFs to S3 and merge into Icechunk
133+
3. Reproject to EPSG:4326 at 30m resolution
134+
135+
**Usage**:
136+
137+
```bash
138+
pixi run ocr ingest-data run-all dillon-et-al-2023 --use-coiled
139+
```
140+
141+
**Outputs**:
142+
143+
- Raw TIFFs: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/raw-input-tiffs/`
144+
- Native Icechunk: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-270m-5070.icechunk/`
145+
- Reprojected: `s3://carbonplan-ocr/input/fire-risk/tensor/USFS/dillon-et-al-2023/processed-30m-4326.icechunk/`
146+
147+
---
148+
149+
### Vector Datasets (GeoParquet)
150+
151+
#### overture-maps
152+
153+
**Overture Maps Building and Address Data for CONUS**
154+
155+
- **Release**: 2025-09-24.0
156+
- **Source**: [Overture Maps Foundation](https://overturemaps.org)
157+
- **Format**: GeoParquet (WKB geometry, zstd compression)
158+
- **Coverage**: CONUS (spatially filtered from global dataset)
159+
- **Data Types**: Buildings (bbox + geometry), Addresses (full attributes), Region-Tagged Buildings (buildings + census identifiers)
160+
161+
**Pipeline**:
162+
163+
1. Query Overture S3 bucket directly (no download step)
164+
2. Filter by CONUS bounding box using DuckDB
165+
3. Write subsetted data to carbonplan-ocr S3 bucket
166+
4. (If buildings processed) Perform spatial join with US Census blocks to add geographic identifiers
167+
168+
**Region-Tagged Buildings Processing**:
169+
170+
When buildings are processed, an additional dataset is automatically created that tags each building with census geographic identifiers:
171+
172+
- Loads census FIPS lookup table for state/county names
173+
- Creates spatial indexes on buildings and census blocks
174+
- Performs bbox-filtered spatial join using `ST_Intersects`
175+
- Adds identifiers at multiple administrative levels: state, county, tract, block group, and block
176+
177+
**Usage**:
178+
179+
```bash
180+
# Both buildings and addresses (default)
181+
# Also creates region-tagged buildings automatically
182+
pixi run ocr ingest-data run-all overture-maps
183+
184+
# Only buildings (also creates region-tagged buildings)
185+
pixi run ocr ingest-data process overture-maps --overture-data-type buildings
186+
187+
# Only addresses (no region tagging)
188+
pixi run ocr ingest-data process overture-maps --overture-data-type addresses
189+
190+
# Dry run
191+
pixi run ocr ingest-data run-all overture-maps --dry-run
192+
193+
# Use Coiled for distributed processing
194+
pixi run ocr ingest-data run-all overture-maps --use-coiled
195+
```
196+
197+
**Outputs**:
198+
199+
- Buildings: `s3://carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-buildings-2025-09-24.0.parquet`
200+
- Addresses: `s3://carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-addresses-2025-09-24.0.parquet`
201+
- Region-Tagged Buildings: `s3://carbonplan-ocr/input/fire-risk/vector/overture-maps/CONUS-overture-region-tagged-buildings-2025-09-24.0.parquet`
202+
203+
---
204+
205+
#### census-tiger
206+
207+
**US Census TIGER/Line Geographic Boundaries**
208+
209+
- **Vintage**: 2024 (tracts/counties), 2025 (blocks)
210+
- **Source**: [US Census Bureau TIGER/Line](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html)
211+
- **Format**: GeoParquet (WKB geometry, zstd compression, schema v1.1.0)
212+
- **Coverage**: CONUS + DC (49 states/territories, excludes Alaska & Hawaii)
213+
- **Geography Types**: Blocks, Tracts, Counties
214+
215+
**Pipeline**:
216+
217+
1. Download TIGER/Line shapefiles from Census Bureau (per-state for blocks/tracts)
218+
2. Convert to GeoParquet with spatial metadata
219+
3. Aggregate tract files using DuckDB
220+
221+
**Usage**:
222+
223+
```bash
224+
# All geography types (default)
225+
pixi run ocr ingest-data run-all census-tiger
226+
227+
# Only counties
228+
pixi run ocr ingest-data process census-tiger --census-geography-type counties
229+
230+
# Tracts for specific states
231+
pixi run ocr ingest-data process census-tiger --census-geography-type tracts \
232+
--census-subset-states California --census-subset-states Oregon
233+
234+
# Dry run
235+
pixi run ocr ingest-data run-all census-tiger --dry-run
236+
```
237+
238+
**Outputs**:
239+
240+
- Blocks: `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/blocks/blocks.parquet`
241+
- Tracts (per-state): `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/FIPS/FIPS_*.parquet`
242+
- Tracts (aggregated): `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/tracts/tracts.parquet`
243+
- Counties: `s3://carbonplan-ocr/input/fire-risk/vector/aggregated_regions/counties/counties.parquet`
244+
245+
## CLI Reference
246+
247+
### Commands
248+
249+
- **`list-datasets`**: Show all available datasets
250+
- **`download <dataset>`**: Download raw source data (tensor datasets only)
251+
- **`process <dataset>`**: Process and upload to S3/Icechunk
252+
- **`run-all <dataset>`**: Complete pipeline (download + process + cleanup)
253+
254+
### Global Options
255+
256+
- **`--dry-run`**: Preview operations without executing (recommended before any real run)
257+
- **`--debug`**: Enable debug logging for troubleshooting
258+
259+
### Tensor Dataset Options
260+
261+
- **`--use-coiled`**: Use Coiled for distributed processing (USFS datasets)
262+
263+
### Vector Dataset Options
264+
265+
#### Overture Maps
266+
267+
- **`--overture-data-type <type>`**: Which data to process
268+
- `buildings`: Only building geometries
269+
- `addresses`: Only address points
270+
- `both`: Both datasets (default)
271+
272+
#### Census TIGER
273+
274+
- **`--census-geography-type <type>`**: Which geography to process
275+
- `blocks`: Census blocks
276+
- `tracts`: Census tracts (per-state + aggregated)
277+
- `counties`: County boundaries
278+
- `all`: All three types (default)
279+
- **`--census-subset-states <state> [<state> ...]`**: Process only specific states
280+
- Repeat option for each state: `--census-subset-states California --census-subset-states Oregon`
281+
- Use full state names (case-sensitive): `California`, `Oregon`, `Washington`, etc.
282+
283+
## Configuration
284+
285+
### Environment Variables
286+
287+
All settings can be overridden via environment variables:
288+
289+
```bash
290+
# S3 configuration
291+
export OCR_INPUT_DATASET_S3_BUCKET=my-bucket
292+
export OCR_INPUT_DATASET_S3_REGION=us-east-1
293+
export OCR_INPUT_DATASET_BASE_PREFIX=custom/prefix
294+
295+
# Processing options
296+
export OCR_INPUT_DATASET_CHUNK_SIZE=16384
297+
export OCR_INPUT_DATASET_DEBUG=true
298+
299+
# Temporary storage
300+
export OCR_INPUT_DATASET_TEMP_DIR=/path/to/temp
301+
```
302+
303+
### Configuration Class
304+
305+
The `InputDatasetConfig` class (Pydantic model) provides:
306+
307+
- Type validation for all settings
308+
- Automatic environment variable loading (prefix: `OCR_INPUT_DATASET_`)
309+
- Default values for all options
310+
- Case-insensitive environment variable names
311+
312+
## Troubleshooting
313+
314+
### Dry Run First
315+
316+
Always test with `--dry-run` before executing:
317+
318+
```bash
319+
ocr ingest-data run-all <dataset> --dry-run
320+
```
321+
322+
This previews all operations without making changes.

0 commit comments

Comments
 (0)