Categorized configuration settings

forman · forman · commit d8d7f02cb964 · 2024-03-18T07:26:07.000+01:00
diff --git a/CHANGES.md b/CHANGES.md
@@ -10,6 +10,12 @@
     [contextlib.closing()](https://docs.python.org/3/library/contextlib.html#contextlib.closing)
     is applicable. Deprecated `SliceSource.dispose()`.
 
+* Improve the configuration reference Introduced configuration schema categories
+
+* Introduced configuration setting `extra`, which is an arbitrary configuration that 
+  is not validated by default. Intended use is by a `slice_source` that expects an 
+  argument named `ctx` and therefore can access the configuration.
+
 ## Version 0.6.0 (from 2024-03-12)
 
 ### Enhancements
diff --git a/docs/config.md b/docs/config.md
@@ -1,31 +1,15 @@
 # Configuration Reference
 
+In the following all possible configuration settings are described.
 
-## `version`
+## Target Outline
 
-Configuration schema version. Allows the schema to evolve while still preserving backwards compatibility.
-Its value is `1`.
-Defaults to `1`.
-
-## `zarr_version`
-
-The Zarr version to be used.
-Its value is `2`.
-Defaults to `2`.
-
-## `fixed_dims`
-
-Type _object_.
-Specifies the fixed dimensions of the target dataset. Keys are dimension names, values are dimension sizes.
-The object's values are of type _integer_.
-
-## `append_dim`
+### `append_dim`
 
 Type _string_.
 The name of the variadic append dimension.
 Defaults to `"time"`.
-
-## `append_step`
+### `append_step`
 
 If set, enforces a step size in the append dimension between two slices or just enforces a direction.
 Must be one of the following:
@@ -46,20 +30,22 @@ Must be one of the following:
     A positive or negative numerical delta value.
 
 Defaults to `null`.
+### `fixed_dims`
 
-## `included_variables`
+Type _object_.
+Specifies the fixed dimensions of the target dataset. Keys are dimension names, values are dimension sizes.
+The object's values are of type _integer_.
+### `included_variables`
 
 Type _array_.
 Specifies the names of variables to be included in the target dataset. Defaults to all variables found in the first contributing dataset.
 The items of the array are of type _string_.
-
-## `excluded_variables`
+### `excluded_variables`
 
 Type _array_.
 Specifies the names of individual variables to be excluded  from all contributing datasets.
 The items of the array are of type _string_.
-
-## `variables`
+### `variables`
 
 Type _object_.
 Defines dimensions, encoding, and attributes for variables in the target dataset. Object property names refer to variable names. The special name `*` refers to all variables, which is useful for defining common values.
@@ -149,13 +135,11 @@ Variable metadata.
   * `attrs`:
     Type _object_.
     Arbitrary variable metadata attributes.
-
-## `attrs`
+### `attrs`
 
 Type _object_.
 Arbitrary dataset attributes. If `permit_eval` is set to `true`, string values may include Python expressions enclosed in `{{` and `}}` to dynamically compute attribute values; in the expression, the current dataset  is named `ds`. Refer to the user guide for more information.
-
-## `attrs_update_mode`
+### `attrs_update_mode`
 
 The mode used update target attributes from slice attributes. Independently of this setting, extra attributes configured by the `attrs` setting will finally be used to update the resulting target attributes.
 Must be one of the following:
@@ -173,39 +157,37 @@ Must be one of the following:
     Its value is `"ignore"`.
 
 Defaults to `"keep"`.
+### `zarr_version`
 
-## `permit_eval`
-
-Type _boolean_.
-Allow for dynamically computed values in dataset attributes `attrs` using the syntax `{{ expression }}`.  Executing arbitrary Python expressions is a security risk, therefore this must be explicitly enabled. Refer to the user guide for more information.
-Defaults to `false`.
+The Zarr version to be used.
+Its value is `2`.
+Defaults to `2`.
+## Data I/O - Target
 
-## `target_dir`
+### `target_dir`
 
 Type _string_.
 The URI or local path of the target Zarr dataset. Must specify a directory whose parent directory must exist.
-
-## `target_storage_options`
+### `target_storage_options`
 
 Type _object_.
 Options for the filesystem given by the URI of `target_dir`.
+### `force_new`
 
-## `slice_source`
-
-Type _string_.
-The fully qualified name of a class or function that receives a slice item as argument(s) and provides the slice dataset. If a class is given, it must be derived from `zappend.api.SliceSource`. If the function is a context manager, it must yield an `xarray.Dataset`. If a plain function is given, it must return any valid slice item type. Refer to the user guide for more information.
-
-## `slice_engine`
-
-Type _string_.
-The name of the engine to be used for opening contributing datasets. Refer to the `engine` argument of the function `xarray.open_dataset()`.
+Type _boolean_.
+Force creation of a new target dataset.  An existing target dataset (and its lock) will be permanently deleted before appending of slice datasets begins. WARNING: the deletion cannot be rolled back.
+Defaults to `false`.
+## Data I/O - Slices
 
-## `slice_storage_options`
+### `slice_storage_options`
 
 Type _object_.
 Options for the filesystem given by the protocol of the URIs of contributing datasets.
+### `slice_engine`
 
-## `slice_polling`
+Type _string_.
+The name of the engine to be used for opening contributing datasets. Refer to the `engine` argument of the function `xarray.open_dataset()`.
+### `slice_polling`
 
 Defines how to poll for contributing datasets.
 Must be one of the following:
@@ -230,36 +212,52 @@ Must be one of the following:
         Polling timeout in seconds.
         Defaults to `60`.
 
+### `slice_source`
+
+Type _string_.
+The fully qualified name of a class or function that receives a slice item as argument(s) and provides the slice dataset. If a class is given, it must be derived from `zappend.api.SliceSource`. If the function is a context manager, it must yield an `xarray.Dataset`. If a plain function is given, it must return any valid slice item type. Refer to the user guide for more information.
+### `slice_source_kwargs`
 
-## `persist_mem_slices`
+Type _object_.
+Extra keyword-arguments passed to a configured `slice_source` together with each slice item.
+### `persist_mem_slices`
 
 Type _boolean_.
 Persist in-memory slices and reopen from a temporary Zarr before appending them to the target dataset. This can prevent expensive re-computation of dask chunks at the cost of additional i/o.
 Defaults to `false`.
+## Data I/O - Transactions
 
-## `temp_dir`
+### `temp_dir`
 
 Type _string_.
 The URI or local path of the directory that will be used to temporarily store rollback information.
-
-## `temp_storage_options`
+### `temp_storage_options`
 
 Type _object_.
 Options for the filesystem given by the protocol of `temp_dir`.
-
-## `force_new`
+### `disable_rollback`
 
 Type _boolean_.
-Force creation of a new target dataset.  An existing target dataset (and its lock) will be permanently deleted before appending of slice datasets begins. WARNING: the deletion cannot be rolled back.
+Disable rolling back dataset changes on failure. Effectively disables transactional dataset modifications, so use this setting with care.
 Defaults to `false`.
+## Misc.
+
+### `version`
 
-## `disable_rollback`
+Configuration schema version. Allows the schema to evolve while still preserving backwards compatibility.
+Its value is `1`.
+Defaults to `1`.
+### `dry_run`
 
 Type _boolean_.
-Disable rolling back dataset changes on failure. Effectively disables transactional dataset modifications, so use this setting with care.
+If `true`, log only what would have been done, but don't apply any changes.
 Defaults to `false`.
+### `permit_eval`
 
-## `profiling`
+Type _boolean_.
+Allow for dynamically computed values in dataset attributes `attrs` using the syntax `{{ expression }}`.  Executing arbitrary Python expressions is a security risk, therefore this must be explicitly enabled. Refer to the user guide for more information.
+Defaults to `false`.
+### `profiling`
 
 Profiling configuration. Allows for runtime profiling of the processing.
 Must be one of the following:
@@ -307,8 +305,7 @@ Must be one of the following:
             Pattern-match the standard name that is printed.
         
 
-
-## `logging`
+### `logging`
 
 Logging configuration.
 Must be one of the following:
@@ -402,9 +399,3 @@ Must be one of the following:
             The items of the array are of type _string_.
 
 
-## `dry_run`
-
-Type _boolean_.
-If `true`, log only what would have been done, but don't apply any changes.
-Defaults to `false`.
-
diff --git a/tests/config/test_schema.py b/tests/config/test_schema.py
@@ -11,7 +11,8 @@ class ConfigSchemaTest(unittest.TestCase):
     def test_get_config_schema(self):
         schema = get_config_schema()
         self.assertIn("properties", schema)
-        self.assertIsInstance(schema["properties"], dict)
+        properties = schema["properties"]
+        self.assertIsInstance(properties, dict)
         self.assertEqual(
             {
                 "append_dim",
@@ -41,8 +42,12 @@ def test_get_config_schema(self):
                 "version",
                 "zarr_version",
             },
-            set(schema["properties"].keys()),
+            set(properties.keys()),
         )
+        for k, v in properties.items():
+            self.assertIsInstance(v, dict)
+            self.assertIn("category", v, msg=k)
+            self.assertIn("description", v, msg=k)
 
     def test_get_config_schema_json(self):
         # Smoke test is sufficient here
diff --git a/zappend/config/markdown.py b/zappend/config/markdown.py
@@ -6,9 +6,29 @@
 from typing import Any
 
 
-def schema_to_markdown(schema: dict[str, Any]) -> str:
+def schema_to_markdown(config_schema: dict[str, Any]) -> str:
     lines = []
-    _schema_to_md(schema, [], lines)
+
+    settings = config_schema["properties"]
+    categories = {}
+    for setting_name, setting_schema in settings.items():
+        category_name = setting_schema["category"]
+        if category_name not in categories:
+            categories[category_name] = []
+        categories[category_name].append(setting_name)
+
+    lines.append("# Configuration Reference")
+    lines.append("")
+    lines.append("In the following all possible configuration settings are described.")
+    lines.append("")
+    for category_name, setting_names in categories.items():
+        lines.append(f"## {category_name}")
+        lines.append("")
+        for setting_name in setting_names:
+            lines.append(f"### `{setting_name}`")
+            lines.append("")
+            _schema_to_md(settings[setting_name], [setting_name], lines)
+
     return "\n".join(lines)
 
 
@@ -19,10 +39,9 @@ def _schema_to_md(
     sequence_name: str | None = None,
 ):
     undefined = object()
-    is_root = len(path) == 0
 
     _type = schema.get("type")
-    if _type and not is_root:
+    if _type:
         if isinstance(_type, str):
             _type = [_type]
         value = " | ".join([f"_{name}_" for name in _type])
@@ -31,12 +50,6 @@ def _schema_to_md(
         else:
             lines.append(f"Type {value}.")
 
-    title = schema.get("title")
-    if title:
-        prefix = "# " if is_root else ""
-        lines.append(prefix + title)
-        lines.append("")
-
     description = schema.get("description")
     if description:
         lines.append(description)
@@ -83,18 +96,12 @@ def _schema_to_md(
     properties = schema.get("properties")
     if properties:
         for name, property_schema in properties.items():
-            if is_root:
-                lines.append("")
-                lines.append(f"## `{name}`")
-                lines.append("")
-                _schema_to_md(property_schema, path + [name], lines)
-            else:
-                lines.append("")
-                lines.append(f"  * `{name}`:")
-                sub_lines = []
-                _schema_to_md(property_schema, path + [name], sub_lines)
-                for sub_line in sub_lines:
-                    lines.append("    " + sub_line)
+            lines.append("")
+            lines.append(f"  * `{name}`:")
+            sub_lines = []
+            _schema_to_md(property_schema, path + [name], sub_lines)
+            for sub_line in sub_lines:
+                lines.append("    " + sub_line)
 
     additional_properties = schema.get("additionalProperties")
     if isinstance(additional_properties, dict):
diff --git a/zappend/config/schema.py b/zappend/config/schema.py