Skip to content

NCAR/gdex-arco-kerchunk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GDEX ARCO Kerchunk

Purpose

This repository contains tools and examples for creating Kerchunk reference files specifically for GDEX (Geoscience Data Exchange) data holdings. The repository is designed to facilitate cloud-optimized data access to NCAR's GDEX data collections through Kerchunk reference files, enabling efficient analysis of large climate datasets without requiring full data downloads.

Key Features:

  • Create Kerchunk reference files for GDEX NetCDF and GRIB data
  • Generate both individual sidecar files and combined reference files
  • Support for remote data access via HTTPS and OSDF protocols
  • Distributed processing capabilities using Dask for large datasets
  • Integration with GDEX data storage infrastructure at NCAR

Note: Parquet output format is supported for combined references but may have limited functionality compared to JSON format.

Main Scripts

create_kerchunk.py

The primary tool for creating Kerchunk reference files from GDEX data holdings.

python src/create_kerchunk.py -h
usage: create_kerchunk.py [-h] --action <combine|sidecar> --directory <directory> [--output_directory <directory>] [--filename <output filename>] 
                          [--extensions <extension> [<extension> ...]] [--variables <variable names> [<variable names> ...]] 
                          [--cluster < PBS / single / local >] [--dry_run] [--make_remote] [--regex <regular expression>] 
                          [--output_format < json / parquet >]

Creates kerchunk sidecar files of an entire directory structure.

optional arguments:
  -h, --help            show this help message and exit
  --action <combine|sidecar>, -a <combine|sidecar>
                        Specify whether to create combined references or create sidecar files.
  --directory <directory>, -d <directory>
                        Directory to scan and create kerchunk reference files.
  --output_directory <directory>, -o <directory>
                        Directory to place output files (default: current directory)
  --filename <output filename>, -f <output filename>
                        Filename for output json.
  --extensions <extension> [<extension> ...], -e <extension> [<extension> ...]
                        Only process files of this extension
  --variables <variable names> [<variable names> ...], -v <variable names> [<variable names> ...]
                        Only gather specific variables. Variable names are case sensitive. Use the special keyword 'ALL' to separate all into individual files.
  --cluster < PBS / single / local >, -c < PBS / single / local >
                        Choose type of dask cluster to use:
                        PBS - PBSCluster (defaults to 5 workers, uses GDEX queue)
                        single - singleThreaded
                        local - localCluster (uses os.ncpus)
  --dry_run, -dr        Do a dry run of processing
  --make_remote, -mr    Additionally make a remote accessible copy of json with GDEX URLs
  --regex <regular expression>, -r <regular expression>
                        Combine references that match the specified regular expression
  --output_format < json / parquet >, -of < json / parquet >
                        Specify the output format for combined references (default: json)
                        Note: Parquet format support is experimental

Example Usage

# Create individual sidecar files for a directory
python src/create_kerchunk.py --action sidecar --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output

# Create combined reference file for NetCDF files
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json

# Create combined reference with remote access capability
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename bnd_ocean.194907.json --make_remote

# Dry run to preview processing
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json --dry_run

Additional Tools

convert_ref_file_loc.py

Converts local file paths in Kerchunk reference files to remote HTTPS or OSDF URLs for cloud access.

Supported remote endpoints:

  • https://data.gdex.ucar.edu (primary GDEX data portal)
  • osdf:///ncar/gdex (Open Science Data Federation endpoint)

create_kerchunk_grib.py

Specialized tool for creating Kerchunk reference files from GRIB format data, with support for parameter ID filtering.

separate_kerchunk.py

Utility for separating combined Kerchunk reference files into individual variable-specific reference files.

convert_chunks.py

Tool for modifying chunk sizes on files

Repository Structure

├── src/                    # Main source code directory
│   ├── create_kerchunk.py     # Primary Kerchunk creation tool
│   ├── create_kerchunk_grib.py # GRIB-specific Kerchunk tool
│   ├── convert_ref_file_loc.py # Local to remote path converter
│   ├── separate_kerchunk.py   # Reference file separator
│   └── convert_chunks.py      # Chunk size modifier
├── examples/               # Usage examples and batch scripts
└── test/                   # Test scripts and validation notebooks

GDEX Integration

This repository is specifically designed to work with NCAR's GDEX (Geoscience Data Exchange) infrastructure:

  • Data Sources: Processes data from /glade/campaign/collections/gdex/data/
  • Remote Access: Generates reference files compatible with GDEX web services
  • HPC Integration: Configured for NCAR's PBS job scheduler with GDEX queue
  • Protocols: Supports both HTTPS and OSDF data federation protocols

Important Notes

  • Parquet Support: While parquet format is available as an output option (--output_format parquet), it has experimental support and may have limited functionality compared to the default JSON format
  • PBS Cluster: When using --cluster PBS, jobs are automatically submitted to the GDEX queue
  • Remote References: The --make_remote flag creates additional reference files with GDEX URLs for cloud-native data access

About

The repo is the kerchunking module GDEX team used to prepare the ARCO data. Currently support json and parquet reference file output

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors