GDEX ARCO Kerchunk

Purpose

This repository contains tools and examples for creating Kerchunk reference files specifically for GDEX (Geoscience Data Exchange) data holdings. The repository is designed to facilitate cloud-optimized data access to NCAR's GDEX data collections through Kerchunk reference files, enabling efficient analysis of large climate datasets without requiring full data downloads.

Key Features:

Create Kerchunk reference files for GDEX NetCDF and GRIB data
Generate both individual sidecar files and combined reference files
Support for remote data access via HTTPS and OSDF protocols
Distributed processing capabilities using Dask for large datasets
Integration with GDEX data storage infrastructure at NCAR

Note: Parquet output format is supported for combined references but may have limited functionality compared to JSON format.

Main Scripts

create_kerchunk.py

The primary tool for creating Kerchunk reference files from GDEX data holdings.

python src/create_kerchunk.py -h
usage: create_kerchunk.py [-h] --action <combine|sidecar> --directory <directory> [--output_directory <directory>] [--filename <output filename>] 
                          [--extensions <extension> [<extension> ...]] [--variables <variable names> [<variable names> ...]] 
                          [--cluster < PBS / single / local >] [--dry_run] [--make_remote] [--regex <regular expression>] 
                          [--output_format < json / parquet >]

Creates kerchunk sidecar files of an entire directory structure.

optional arguments:
  -h, --help            show this help message and exit
  --action <combine|sidecar>, -a <combine|sidecar>
                        Specify whether to create combined references or create sidecar files.
  --directory <directory>, -d <directory>
                        Directory to scan and create kerchunk reference files.
  --output_directory <directory>, -o <directory>
                        Directory to place output files (default: current directory)
  --filename <output filename>, -f <output filename>
                        Filename for output json.
  --extensions <extension> [<extension> ...], -e <extension> [<extension> ...]
                        Only process files of this extension
  --variables <variable names> [<variable names> ...], -v <variable names> [<variable names> ...]
                        Only gather specific variables. Variable names are case sensitive. Use the special keyword 'ALL' to separate all into individual files.
  --cluster < PBS / single / local >, -c < PBS / single / local >
                        Choose type of dask cluster to use:
                        PBS - PBSCluster (defaults to 5 workers, uses GDEX queue)
                        single - singleThreaded
                        local - localCluster (uses os.ncpus)
  --dry_run, -dr        Do a dry run of processing
  --make_remote, -mr    Additionally make a remote accessible copy of json with GDEX URLs
  --regex <regular expression>, -r <regular expression>
                        Combine references that match the specified regular expression
  --output_format < json / parquet >, -of < json / parquet >
                        Specify the output format for combined references (default: json)
                        Note: Parquet format support is experimental

Example Usage

# Create individual sidecar files for a directory
python src/create_kerchunk.py --action sidecar --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output

# Create combined reference file for NetCDF files
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json

# Create combined reference with remote access capability
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename bnd_ocean.194907.json --make_remote

# Dry run to preview processing
python src/create_kerchunk.py --action combine --directory /glade/campaign/collections/gdex/data/d640000/bnd_ocean/194907 --output_directory ./output --extensions nc --filename combined_kerchunk.json --dry_run

Additional Tools

convert_ref_file_loc.py

Converts local file paths in Kerchunk reference files to remote HTTPS or OSDF URLs for cloud access.

Supported remote endpoints:

https://data.gdex.ucar.edu (primary GDEX data portal)
osdf:///ncar/gdex (Open Science Data Federation endpoint)

create_kerchunk_grib.py

Specialized tool for creating Kerchunk reference files from GRIB format data, with support for parameter ID filtering.

separate_kerchunk.py

Utility for separating combined Kerchunk reference files into individual variable-specific reference files.

convert_chunks.py

Tool for modifying chunk sizes on files

Repository Structure

├── src/                    # Main source code directory
│   ├── create_kerchunk.py     # Primary Kerchunk creation tool
│   ├── create_kerchunk_grib.py # GRIB-specific Kerchunk tool
│   ├── convert_ref_file_loc.py # Local to remote path converter
│   ├── separate_kerchunk.py   # Reference file separator
│   └── convert_chunks.py      # Chunk size modifier
├── examples/               # Usage examples and batch scripts
└── test/                   # Test scripts and validation notebooks

GDEX Integration

This repository is specifically designed to work with NCAR's GDEX (Geoscience Data Exchange) infrastructure:

Data Sources: Processes data from /glade/campaign/collections/gdex/data/
Remote Access: Generates reference files compatible with GDEX web services
HPC Integration: Configured for NCAR's PBS job scheduler with GDEX queue
Protocols: Supports both HTTPS and OSDF data federation protocols

Important Notes

Parquet Support: While parquet format is available as an output option (--output_format parquet), it has experimental support and may have limited functionality compared to the default JSON format
PBS Cluster: When using --cluster PBS, jobs are automatically submitted to the GDEX queue
Remote References: The --make_remote flag creates additional reference files with GDEX URLs for cloud-native data access

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
doc		doc
examples		examples
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
myst.yml		myst.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDEX ARCO Kerchunk

Purpose

Main Scripts

create_kerchunk.py

Example Usage

Additional Tools

convert_ref_file_loc.py

create_kerchunk_grib.py

separate_kerchunk.py

convert_chunks.py

Repository Structure

GDEX Integration

Important Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GDEX ARCO Kerchunk

Purpose

Main Scripts

create_kerchunk.py

Example Usage

Additional Tools

convert_ref_file_loc.py

create_kerchunk_grib.py

separate_kerchunk.py

convert_chunks.py

Repository Structure

GDEX Integration

Important Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages