Improve documentation

rbeucher · rbeucher · commit c1669edb627d · 2025-07-30T14:20:08.000+10:00
diff --git a/.zenodo.json b/.zenodo.json
@@ -0,0 +1,39 @@
+{
+    "creators": [
+        {
+            "orcid": "0000-0003-3891-5444",
+            "affiliation": "ACCESS-NRI",
+            "name": "Beucher, Romain"
+        }
+    ],
+    "contributors": [
+        {
+            "name": "Paola Petrellui",
+            "affiliation": "University of Tasmania",
+            "orcid": "0000-0002-0164-5105",
+            "type": "Version 1 developer"
+        },
+        {
+            "name": "Samuel Green",
+            "affiliation": "ARC Centre of Excellence for the Weather of the 21st Century",
+            "orcid": "0000-0003-1129-4676",
+            "type": "Version 1 developer"
+        },
+        {
+            "name": "Chloe Mackallah",
+            "affiliation": "CSIRO",
+            "orcid": "0000-0003-4989-5530",
+            "type": "APP4 developer"
+        }
+    ],
+
+    "license": "Apache-2.0",
+
+    "title": "ACCESS-MOPPeR",
+
+    "keywords": ["Climate", "Science", "Model Evaluation", "CMOR", "CMIP", "ACCESS", "ACCESS-NRI", "NCI"],
+
+    "communities": [
+        {"identifier": "access-nri"}
+    ]
+}
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
-# ACCESS-MOPPeR v2.1.0a (Alpha Version)
+# ACCESS Model Output Post-Processor (ACCESS-MOPPeR) v2.1.0a (Alpha Version)
 
 ## Overview
-ACCESS-MOPPeR v2.0.0a is a CMORisation tool designed to post-process ACCESS model output. This version represents a significant rewrite of the original MOPPeR, focusing on usability rather than raw performance. It introduces a more flexible and user-friendly Python API that can be integrated into Jupyter notebooks and other workflows.
+ACCESS-MOPPeR v2.1.0a is a CMORisation tool designed to post-process ACCESS model output. This version represents a significant rewrite of the original MOPPeR, focusing on usability rather than raw performance. It introduces a more flexible and user-friendly Python API that can be integrated into Jupyter notebooks and other workflows.
 
 ACCESS-MOPPeR allows for targeted CMORisation of individual variables and is specifically designed to support the ACCESS-ESM1.6 configuration prepared for CMIP7 FastTrack. However, ocean variable support remains limited in this alpha release.
 
@@ -15,23 +15,52 @@ ACCESS-MOPPeR allows for targeted CMORisation of individual variables and is spe
 
 ## Current Limitations
 - **Alpha Version**: Intended for evaluation purposes only; not recommended for data publication.
-- **Limited Ocean Variable Support**: Further development is needed to support ocean-related variables fully.
+
+> **⚠️ Variable Mapping Under Review**
+>
+> We are currently reviewing the mapping of ACCESS variables to their CMIP6 and CMIP7 equivalents. Some variables that require derivation may not be available yet, or their calculation may need further verification.
+> **If you notice any major issues or missing variables, please submit an issue!**
+
 
 ## Background
-ACCESS-MOPPeR builds upon the original APP4 and MOPPeR frameworks, which were initially developed for CMIP5 and later extended for CMIP6. These tools leveraged CMOR3 and CMIP6 data request files to produce CF-compliant datasets aligned with ESGF standards. MOPPeR introduced the **mopdb** tool, allowing users to create custom mappings and CMOR table definitions.
+ACCESS-MOPPeR v2 is a complete rewrite of the original APP4 and MOPPeR frameworks. Unlike previous versions, it does **not** depend on CMOR; instead, it leverages modern Python libraries such as **xarray** and **dask** for efficient processing of NETCDF files. This approach streamlines the workflow, improves flexibility, and enhances integration with contemporary data science tools.
+
+While retaining the core concepts of "custom" and "cmip" modes, ACCESS-MOPPeR v2 unifies these workflows within a single configuration file, focusing on usability and extensibility for current and future CMIP projects.
+
+---
+
+## Installation
+
+
+```sh
+pip install numpy pandas xarray netCDF4 cftime dask pyyaml tqdm requests
+pip install .
+```
+
+---
+
+## Documentation
+
+See the [Getting Started notebook](notebooks/Getting_started.ipynb) and the [docs](docs/) folder for detailed usage and API documentation.
+
+---
+
+## Testing
+
+To run tests:
+
+```sh
+pytest
+```
+
+---
 
-This rewrite retains key features of the original MOPPeR while enhancing usability. The differentiation between "custom" and "cmip" modes remains, but both modes now follow a unified workflow defined in a single configuration file.
+## License
 
-## Usage
-ACCESS-MOPPeR v2.0.0a is best suited for users interested in evaluating outputs from ACCESS-ESM1.6 development releases. Full documentation is not available yet.
-Please refer to the [Getting Started Notebook](https://github.com/ACCESS-NRI/ACCESS-MOPPeR/blob/v2/notebooks/Getting_started.ipynb):
+ACCESS-MOPPeR is licensed under the Apache-2.0 License.
 
-## Future Development
-- **Optimised Multi-CPU Execution**: Parallel processing support will be introduced in later versions.
-- **Enhanced Ocean Variable Support**: Expansion of CMORisation capabilities for ocean-related data.
-- **Expanded CMORisation Standards**: Continued flexibility in defining custom post-processing standards beyond CMIP6.
+---
 
-## Disclaimer
-This is an **alpha release** and should not be used for official data publications. Users should expect potential changes in future versions that may affect workflow compatibility.
+## Contact
 
-For feedback or issues, please submit your contributions via the project's repository or contact the development team.
+Author: Romain Beucher
diff --git a/notebooks/Getting_started.ipynb b/notebooks/Getting_started.ipynb
@@ -5,7 +5,17 @@
    "id": "c042f571-90e9-4160-ae53-bdbc5a165525",
    "metadata": {},
    "source": [
-    "# ACCESS-MOPPeR Getting Started"
+    "# ACCESS-MOPPeR Getting Started\n",
+    "\n",
+    "Welcome to the ACCESS-MOPPeR Getting Started guide!\n",
+    "\n",
+    "This notebook will walk you through the initial setup and basic usage of ACCESS-MOPPeR, a tool designed to post-process ACCESS model output and produce CMIP-compliant datasets. You’ll learn how to configure your environment, prepare your data, and run the CMORisation workflow using both the Python API and Dask for scalable processing.\n",
+    "\n",
+    "By following this guide, you’ll be able to:\n",
+    "- Set up your user configuration\n",
+    "- Select input data files\n",
+    "- Run the CMORisation process for selected variables\n",
+    "- Inspect and save the processed output\n"
    ]
   },
   {
@@ -28,7 +38,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": null,
    "id": "80dbbe95-35ea-43d1-a1a0-cea79082b2eb",
    "metadata": {},
    "outputs": [
@@ -46,13 +56,32 @@
     }
    ],
    "source": [
-    "from access_mopper import ACCESS_ESM_CMORiser\n",
-    "import dask.distributed as dask"
+    "from access_mopper import ACCESS_ESM_CMORiser"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eae38f8c",
+   "metadata": {},
+   "source": [
+    "## Dask support\n",
+    "\n",
+    "ACCESS-MOPPeR supports Dask for parallel processing, which can significantly speed up the CMORisation workflow, especially when working with large datasets. To use Dask with ACCESS-MOPPeR, you can create a Dask client it will be used to manage the distributed computation. This allows you to take advantage of multiple CPU cores or even a cluster of machines, depending on your setup.\n",
+    "You can configure the Dask client to use a specific number of threads per worker, which can help optimize performance based on your hardware and the size of the datasets you are processing.\n",
+    "\n",
+    "Here's an example of how to set up a Dask client:\n",
+    "\n",
+    "```python\n",
+    "import dask.distributed as dask\n",
+    "\n",
+    "client = dask.Client(threads_per_worker=1)\n",
+    "client\n",
+    "```"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": null,
    "id": "9000d152-d67c-49ad-a648-025a0808cfe8",
    "metadata": {},
    "outputs": [
@@ -734,21 +763,50 @@
     }
    ],
    "source": [
+    "import dask.distributed as dask\n",
+    "\n",
     "client = dask.Client(threads_per_worker = 1)\n",
     "client"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d14ad618",
+   "metadata": {},
+   "source": [
+    "## Data selection\n",
+    "\n",
+    "The `ACCESS_ESM_CMORiser` class (described in detail below) takes as input a list of paths to NetCDF files containing the raw model output variables to be CMORised. The CMORiser does **not** assume any specific folder structure, DRS (Data Reference Syntax), or file naming convention. It is intentionally left to the user to ensure that the provided files contain the original variables required for CMORisation.\n",
+    "\n",
+    "This design is intentional: ACCESS-NRI plans to integrate ACCESS-MOPPeR into extended workflows that leverage the [ACCESS-NRI Intake Catalog](https://github.com/ACCESS-NRI/access-nri-intake-catalog) or evaluation frameworks such as [ESMValTool](https://www.esmvaltool.org/) and [ILAMB](https://www.ilamb.org/). By decoupling file selection from the CMORiser, ACCESS-MOPPeR can be flexibly used in a variety of data processing and evaluation pipelines."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": null,
    "id": "f49fd1d4-dcb6-47a8-9d4a-731a7ca1ea0d",
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Here we use netcdf file from a raw ACCESS-ESM run.\n",
     "import glob\n",
     "files = glob.glob(\"../../Test_data/esm1-6/atmosphere/aiihca.pa-0961*_mon.nc\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d458b955",
+   "metadata": {},
+   "source": [
+    "### Parent experiment information\n",
+    "\n",
+    "In CMIP workflows, providing parent experiment information is required for proper data provenance and traceability. This metadata describes the relationship between your experiment and its parent (for example, a historical run branching from a piControl simulation), and is essential for CMIP data publication and compliance.\n",
+    "\n",
+    "However, for some applications—such as when using ACCESS-MOPPeR to interact with evaluation frameworks like [ESMValTool](https://www.esmvaltool.org/) or [ILAMB](https://www.ilamb.org/)—strict CMIP compliance is not always necessary. In these cases, you may choose to skip providing parent experiment information to simplify the workflow.\n",
+    "\n",
+    "If you choose to skip this step, ACCESS-MOPPeR will issue a warning to let you know that, if you write the output to disk, the resulting file may not be compatible with CMIP requirements for publication. This flexibility allows you to use ACCESS-MOPPeR for rapid evaluation and prototyping, while still supporting full CMIP compliance when needed."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -769,9 +827,27 @@
     "}"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "68b05b80",
+   "metadata": {
+    "vscode": {
+     "languageId": "markdown"
+    }
+   },
+   "source": [
+    "## Set up the CMORiser for CMORisation\n",
+    "\n",
+    "To begin the CMORisation process, you need to create an instance of the `ACCESS_ESM_CMORiser` class. This class requires several key parameters, including the list of input NetCDF files and metadata describing your experiment.\n",
+    "\n",
+    "A crucial parameter is the `compound_name`, which should be specified using the full CMIP convention: `table.variable` (for example, `Amon.rsds`). This format uniquely identifies the variable, its frequency (e.g., monthly, daily), and the associated CMIP table, ensuring that all requirements for grids and metadata are correctly handled. Using the full compound name helps avoid ambiguity and guarantees that the CMORiser applies the correct standards for each variable.\n",
+    "\n",
+    "You can also provide additional metadata such as `experiment_id`, `source_id`, `variant_label`, and `grid_label` to ensure your output is CMIP-compliant. Optionally, you may include parent experiment information for full provenance tracking."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": null,
    "id": "0e54cf4e-b707-4128-aa93-23bb9cf684d3",
    "metadata": {},
    "outputs": [],
@@ -784,19 +860,52 @@
     "    variant_label=\"r1i1p1f1\",\n",
     "    grid_label=\"gn\",\n",
     "    activity_id=\"CMIP\",\n",
-    "    parent_info=parent_experiment_config)"
+    "    parent_info=parent_experiment_config # <-- This is optional, can be skipped if not needed\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "de6be45d",
+   "metadata": {
+    "vscode": {
+     "languageId": "markdown"
+    }
+   },
+   "source": [
+    "## Running the CMORiser\n",
+    "\n",
+    "To start the CMORisation process, simply call the `run()` method on your `cmoriser` instance as shown below. This step may take some time, especially if you are processing a large number of files.\n",
+    "\n",
+    "We recommend using the [dask-labextension](https://github.com/dask/dask-labextension) with JupyterLab to monitor the progress of your computation. The extension provides a convenient dashboard to track task progress and resource usage directly within your notebook interface.\n"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "id": "5e6c9e48-9dc0-42ab-a396-6bcf7b57cb42",
    "metadata": {},
    "outputs": [],
    "source": [
     "cmoriser.run()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "c1fade88",
+   "metadata": {
+    "vscode": {
+     "languageId": "markdown"
+    }
+   },
+   "source": [
+    "### In-memory processing with xarray and Dask\n",
+    "\n",
+    "The CMORisation workflow processes data entirely in memory using `xarray` and Dask. This approach enables efficient parallel computation and flexible data manipulation, but requires that your system has enough memory to handle the size of your dataset. \n",
+    "\n",
+    "Once the CMORisation is complete, you can access the resulting dataset by calling the `to_dataset()` method on your `cmoriser` instance (see below). The returned object is a standard xarray dataset, which means you can slice, analyze, or further process the data using familiar xarray operations."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -1677,6 +1786,22 @@
     "ds"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "f2a97420",
+   "metadata": {
+    "vscode": {
+     "languageId": "markdown"
+    }
+   },
+   "source": [
+    "### Writing the output to a NetCDF file\n",
+    "\n",
+    "To save your CMORised data to disk, use the `write()` method of the `cmoriser` instance. This will create a NetCDF file with all attributes set according to the CMIP Controlled Vocabulary, ensuring compliance with CMIP metadata standards.\n",
+    "\n",
+    "After writing the file, we recommend validating it using [PrePARE](https://github.com/PCMDI/cmor/tree/master/PrePARE), a tool provided by PCMDI to check the conformity of CMIP files. PrePARE will help you identify any issues with metadata or file structure before publication or further analysis."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,