Skip to content

Repo with codes to turn raw model output into glorious self-describing granules for widespread distribution

License

Notifications You must be signed in to change notification settings

ECCO-GROUP/ECCO-Dataset-Production

Repository files navigation

ECCO Dataset Production

ECCO Dataset Production is a toolset that supports NASA's Open Science initiative by making ECCO's multidecadal, physically- and statistically-consistent ocean state estimates available in NetCDF format.

In so doing, it transforms raw MITgcm-generated results into ordered collections of date- and time-stamped files, in native and lon/lat grid formats, for wide use by the broader scientific research community.

ECCO Dataset Production can run either locally or in the cloud, the latter mode in regular use by the ECCO group to generate the multi-terabyte datasets available through the Physical Oceanography Distributed Active Archive Center (PO.DAAC) and NASA's Earthdata ESDIS Project.

Project Dependencies

Much of the core computation in ECCO Dataset Production is provided by xmitgcm, ECCOv4-py, and the cloud utilities package from ECCO-ACCESS.

To this, ECCO Dataset Production adds workflow automation, packaging, and utilities suitable for both local (i.e. custom dataset) and cloud-based (i.e., multi-terabyte) production and distribution.

Installation and Usage

ECCO Dataset Production can be pip-installed as with any other Python package. Just clone the repo, cd to the top-level directory and install:

$ git clone https://github.com/ECCO-GROUP/ECCO-Dataset-Production.git
$ cd ECCO-Dataset-Production
$ pip install .

Dockerfiles, Docker Compose files, and automation scripts have also been included to support local, and AWS-targeted container-based solutions. See ./docker/README.md for details.

ECCO Dataset Production exposes several command-line scripts, two of the more important ones being edp_create_job_task_list for creating a json-formatted explicit list of NetCDF files that are to be produced, and edp_generate_dataproducts that then reads this task list and generates the resulting files. Command-line help is available via:

$ edp_create_job_task_list --help
$ edp_generate_dataproducts --help

Test/demonstration examples illustrating dataset production in local and cloud-based modes are in ./demos. In order to run the demonstration examples, you'll need to install the ECCO-v4-Configurations submodule (that is, unless ECCO-Dataset-Production hasn't originally been cloned using the --recurse-submodules option):

$ git submodule init
$ git submodule update

./demos/native_latlon_local is a useful "getting started" example illustrating generation of local NetCDF files from local input files, with a discussion of problem setup, input formats, and job submittal.

History

Initial dataset production iterations were the work of Ian Fenty, with subsequent prototype AWS Lambda cloud deployment by Ian Fenty and Duncan Bark. The current package is a significant update that includes production tools and scaling for AWS Batch-based cloud deployment, and has been implemented by Ian Fenty and Greg Moore ([email protected]). Release documentation generation tools are the work of Jose Gonzales and Odilon Houndegnonto.

Contributing

Contributions and use case examples are always welcome! Please feel free to fork this repo and issue a pull request or contact the ECCO Group.

About

Repo with codes to turn raw model output into glorious self-describing granules for widespread distribution

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6