ECCO Dataset Production is a toolset that supports NASA's Open Science initiative by making ECCO's multidecadal, physically- and statistically-consistent ocean state estimates available in NetCDF format.
In so doing, it transforms raw MITgcm-generated results into ordered collections of date- and time-stamped files, in native and lon/lat grid formats, for wide use by the broader scientific research community.
ECCO Dataset Production can run either locally or in the cloud, the latter mode in regular use by the ECCO group to generate the multi-terabyte datasets available through the Physical Oceanography Distributed Active Archive Center (PO.DAAC) and NASA's Earthdata ESDIS Project.
Much of the core computation in ECCO Dataset Production is provided by xmitgcm, ECCOv4-py, and the cloud utilities package from ECCO-ACCESS.
To this, ECCO Dataset Production adds workflow automation, packaging, and utilities suitable for both local (i.e. custom dataset) and cloud-based (i.e., multi-terabyte) production and distribution.
ECCO Dataset Production can be pip-installed as with any other Python
package. Just clone the repo, cd to the top-level directory and
install:
$ git clone https://github.com/ECCO-GROUP/ECCO-Dataset-Production.git
$ cd ECCO-Dataset-Production
$ pip install .
Dockerfiles, Docker Compose files, and automation scripts have also
been included to support local, and AWS-targeted container-based
solutions. See ./docker/README.md for details.
ECCO Dataset Production exposes several command-line scripts, two
of the more important ones being edp_create_job_task_list for
creating a json-formatted explicit list of NetCDF files that are to
be produced, and edp_generate_dataproducts that then reads this task
list and generates the resulting files. Command-line help is available
via:
$ edp_create_job_task_list --help
$ edp_generate_dataproducts --help
Test/demonstration examples illustrating dataset production in local
and cloud-based modes are in ./demos. In order to run the
demonstration examples, you'll need to install the
ECCO-v4-Configurations submodule (that is, unless
ECCO-Dataset-Production hasn't originally been cloned using the
--recurse-submodules option):
$ git submodule init
$ git submodule update
./demos/native_latlon_local is a useful "getting started" example
illustrating generation of local NetCDF files from local input files,
with a discussion of problem setup, input formats, and job submittal.
Initial dataset production iterations were the work of Ian Fenty, with subsequent prototype AWS Lambda cloud deployment by Ian Fenty and Duncan Bark. The current package is a significant update that includes production tools and scaling for AWS Batch-based cloud deployment, and has been implemented by Ian Fenty and Greg Moore ([email protected]). Release documentation generation tools are the work of Jose Gonzales and Odilon Houndegnonto.
Contributions and use case examples are always welcome! Please feel free to fork this repo and issue a pull request or contact the ECCO Group.