Skip to content

Commit 1713c6e

Browse files
authored
Merge pull request #127 from OpenEnergyPlatform/feature-126-add-yaml-and-template-based-metadata-creation
Add yaml and template based metadata creation
2 parents 0e52e0f + 32b3b53 commit 1713c6e

24 files changed

+2796
-20
lines changed

CHANGELOG.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ Changelog
44

55
current
66
--------------------
7-
*
7+
* Add the creation module and create entry: They implement yaml based metadata creation, provide template feature to keep metadata creation DRY, provide functionality to setup the metadata structure & generate metadata from existing sources like datapackages and csv files, provide functionality to create the full datapackage.json and save it to file [(#127)](https://github.com/rl-institut/super-repo/pull/127)
8+
89

910
1.1.0 (2025-03-25)
1011
--------------------

docs/create.md

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# OMI “Create” Entry Point
2+
3+
This mini-guide explains how to use the **programmatic entry points** that turn your split YAML metadata (dataset + template + resources) into a single OEMetadata JSON document.
4+
5+
> If you’re looking for how to author the YAML files and how templating works, see the main **Assembly Guide** in the `creation` module directory. This page just shows how to *call* the entry points.
6+
7+
---
8+
9+
## What it does
10+
11+
The functions in `omi.create` wrap the full assembly pipeline:
12+
13+
1. **Discover / load** your YAML parts (dataset, optional template, resources).
14+
2. **Apply the template** to each resource (deep merge; resource wins; keywords/topics/languages concatenate).
15+
3. **Generate & validate** the final OEMetadata JSON using the official schema (via `OEMetadataCreator`).
16+
4. **Write** the result to disk (`build_from_yaml`) or many results to a directory (`build_many_from_yaml`).
17+
18+
---
19+
20+
## API
21+
22+
```python
23+
from omi.create import build_from_yaml, build_many_from_yaml
24+
```
25+
26+
### `build_from_yaml(base_dir, dataset_id, output_file, *, index_file=None) -> None`
27+
28+
Assemble **one** dataset and write `<output_file>` (JSON).
29+
30+
* `base_dir` (`str | Path`): Root that contains:
31+
32+
* `datasets/<dataset_id>.dataset.yaml`
33+
* `datasets/<dataset_id>.template.yaml` *(optional)*
34+
* `resources/<dataset_id>/*.resource.yaml`
35+
* `dataset_id` (`str`): Logical dataset name (e.g. `"powerplants"`).
36+
* `output_file` (`str | Path`): Path to write the generated OEMetadata JSON.
37+
* `index_file` (`str | Path | None`): Optional explicit mapping file (`metadata_index.yaml`). If provided, paths are taken from the index instead of convention.
38+
39+
### `build_many_from_yaml(base_dir, output_dir, *, dataset_ids=None, index_file=None) -> None`
40+
41+
Assemble **multiple** datasets and write each as `<output_dir>/<dataset_id>.json`.
42+
43+
* `base_dir` (`str | Path`): Same as above.
44+
* `output_dir` (`str | Path`): Destination directory for one JSON file per dataset.
45+
* `dataset_ids` (`list[str] | None`): Limit to specific datasets. If `None`, we:
46+
47+
* Use keys from `index_file` when provided, **else**
48+
* Discover all `datasets/*.dataset.yaml` in `base_dir`.
49+
* `index_file` (`str | Path | None`): Optional `metadata_index.yaml`.
50+
51+
---
52+
53+
## Quick examples
54+
55+
### One dataset (convention-based discovery)
56+
57+
```python
58+
from omi.create import build_from_yaml
59+
60+
build_from_yaml(
61+
base_dir="./metadata",
62+
dataset_id="powerplants",
63+
output_file="./out/powerplants.json",
64+
)
65+
```
66+
67+
Directory layout:
68+
69+
```bash
70+
metadata/
71+
datasets/
72+
powerplants.dataset.yaml
73+
powerplants.template.yaml # optional
74+
resources/
75+
powerplants/
76+
*.resource.yaml
77+
```
78+
79+
### One dataset (explicit index)
80+
81+
```python
82+
from omi.create import build_from_yaml
83+
84+
build_from_yaml(
85+
base_dir="./metadata",
86+
dataset_id="powerplants",
87+
output_file="./out/powerplants.json",
88+
index_file="./metadata/metadata_index.yaml",
89+
)
90+
```
91+
92+
### Many datasets (discover all)
93+
94+
```python
95+
from omi.create import build_many_from_yaml
96+
97+
build_many_from_yaml(
98+
base_dir="./metadata",
99+
output_dir="./out",
100+
)
101+
# writes ./out/<dataset_id>.json for each dataset found
102+
```
103+
104+
### Many datasets (index + subset)
105+
106+
```python
107+
from omi.create import build_many_from_yaml
108+
109+
build_many_from_yaml(
110+
base_dir="./metadata",
111+
output_dir="./out",
112+
dataset_ids=["powerplants", "households"],
113+
index_file="./metadata/metadata_index.yaml",
114+
)
115+
```
116+
117+
---
118+
119+
## Notes & behavior
120+
121+
* Output JSON is written with `indent=2` and **`ensure_ascii=False`** to preserve characters like `©`.
122+
* Validation happens via `OEMetadataCreator` using the official schema provided by `oemetadata` (imported through `omi.base.get_metadata_specification`).
123+
* If a dataset YAML is missing, `FileNotFoundError` is raised.
124+
* If schema validation fails, you’ll get an exception from `omi.validation`. Catch it where you call the entry point if you want to handle/report errors.
125+
126+
---
127+
128+
## Using in 3rd Party code like data pipelines
129+
130+
```python
131+
from pathlib import Path
132+
from omi.create import build_from_yaml
133+
134+
def build_oemetadata_callable(**context):
135+
base = Path("/project/metadata")
136+
out = Path("/project/metadata/out/powerplants.json")
137+
build_from_yaml(base, "powerplants", out)
138+
# optionally push to airflow XCom, publish, upload, etc.
139+
```
140+
141+
---
142+
143+
## Testing tips
144+
145+
* For **unit tests** of `omi.create`, patch `omi.create.assemble_metadata_dict` / `assemble_many_metadata` and verify files are written.
146+
* For **integration tests**, put real example YAMLs under `tests/test_data/create/metadata/` and call `build_from_yaml` end-to-end.
147+
148+
---
149+
150+
## Troubleshooting
151+
152+
* **“Dataset YAML not found”**
153+
Check `base_dir/datasets/<dataset_id>.dataset.yaml` exists, or supply the correct `index_file`.
154+
155+
* **Unicode characters appear escaped (`\u00a9`)**
156+
Ensure you’re not re-writing the JSON elsewhere with `ensure_ascii=True`.
157+
158+
* **Template not applied**
159+
Confirm your template file name matches `<dataset_id>.template.yaml` (or is correctly referenced from the index), and the keys you expect to inherit aren’t already set in the resource (resource values win).

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,6 @@ unfixable = ["UP007", "I001"]
7878
"*/__init__.py" = [
7979
"D104", # Missing docstring in public package
8080
]
81+
82+
[omi.scripts]
83+
omi = "omi.cli:main"

src/omi/cli.py

Lines changed: 102 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,118 @@
11
"""
2-
Module that contains the command line app.
2+
Command line interface for OMI.
33
4-
Why does this file exist, and why not put this in __main__?
4+
This CLI only supports the split-files layout:
5+
- datasets/<dataset_id>.dataset.yaml
6+
- datasets/<dataset_id>.template.yaml (optional)
7+
- resources/<dataset_id>/*.resource.yaml
8+
(optionally wired via metadata_index.yaml)
59
6-
You might be tempted to import things from __main__ later, but that will cause
7-
problems: the code will get executed twice:
10+
Usage:
11+
omi assemble \
12+
--base-dir ./metadata \
13+
--dataset-id powerplants \
14+
--output-file ./out/powerplants.json \
15+
--index-file ./metadata/metadata_index.yaml # optional
816
9-
- When you run `python -m omi` python will execute
10-
``__main__.py`` as a script. That means there won't be any
11-
``omi.__main__`` in ``sys.modules``.
12-
- When you import __main__ it will get executed again (as a module) because
13-
there's no ``omi.__main__`` in ``sys.modules``.
14-
15-
Also see (1) from http://click.pocoo.org/5/setuptools/#setuptools-integration
1617
"""
1718

19+
from __future__ import annotations
20+
21+
from pathlib import Path
22+
from typing import Optional
23+
1824
import click
1925

26+
from omi.creation.creator import OEMetadataCreator
27+
from omi.creation.init import init_dataset, init_resources_from_files
28+
from omi.creation.utils import apply_template_to_resources, load_parts
29+
2030

2131
@click.group()
2232
def grp() -> None:
23-
"""Init click group."""
33+
"""OMI CLI."""
34+
35+
36+
@grp.command("assemble")
37+
@click.option(
38+
"--base-dir",
39+
required=True,
40+
type=click.Path(file_okay=False, path_type=Path),
41+
help="Root directory containing 'datasets/' and 'resources/'.",
42+
)
43+
@click.option("--dataset-id", required=True, help="Logical dataset id (e.g. 'powerplants').")
44+
@click.option(
45+
"--output-file",
46+
required=True,
47+
type=click.Path(dir_okay=False, path_type=Path),
48+
help="Path to write the generated OEMetadata JSON.",
49+
)
50+
@click.option(
51+
"--index-file",
52+
default=None,
53+
type=click.Path(dir_okay=False, path_type=Path),
54+
help="Optional metadata index YAML for explicit mapping.",
55+
)
56+
def assemble_cmd(base_dir: Path, dataset_id: str, output_file: Path, index_file: Optional[Path]) -> None:
57+
"""Assemble OEMetadata from split YAML files and write JSON to OUTPUT_FILE."""
58+
# Load pieces
59+
version, dataset, resources, template = load_parts(base_dir, dataset_id, index_file=index_file)
60+
merged_resources = apply_template_to_resources(resources, template)
61+
62+
# Build & save with the correct spec version
63+
creator = OEMetadataCreator(oem_version=version)
64+
creator.save(dataset, merged_resources, output_file, ensure_ascii=False, indent=2)
65+
66+
67+
@click.group()
68+
def init() -> None:
69+
"""Scaffold OEMetadata split-files layout."""
70+
71+
72+
@init.command("dataset")
73+
@click.argument("base_dir", type=click.Path(file_okay=False, path_type=Path))
74+
@click.argument("dataset_id")
75+
@click.option("--oem-version", default="OEMetadata-2.0", show_default=True)
76+
@click.option("--resource", "resources", multiple=True, help="Initial resource names (repeatable).")
77+
@click.option("--overwrite", is_flag=True, help="Overwrite existing files.")
78+
def init_dataset_cmd(
79+
base_dir: Path,
80+
dataset_id: str,
81+
oem_version: str,
82+
resources: tuple[str, ...],
83+
*,
84+
overwrite: bool,
85+
) -> None:
86+
"""Initialize a split-files OEMetadata dataset layout under BASE_DIR."""
87+
res = init_dataset(base_dir, dataset_id, oem_version=oem_version, resources=resources, overwrite=overwrite)
88+
click.echo(f"dataset: {res.dataset_yaml}")
89+
click.echo(f"template: {res.template_yaml}")
90+
for p in res.resource_yamls:
91+
click.echo(f"resource: {p}")
92+
93+
94+
@init.command("resources")
95+
@click.argument("base_dir", type=click.Path(file_okay=False, path_type=Path))
96+
@click.argument("dataset_id")
97+
@click.argument("files", nargs=-1, type=click.Path(exists=True, dir_okay=False, path_type=Path))
98+
@click.option("--oem-version", default="OEMetadata-2.0", show_default=True)
99+
@click.option("--overwrite", is_flag=True, help="Overwrite existing files.")
100+
def init_resources_cmd(
101+
base_dir: Path,
102+
dataset_id: str,
103+
files: tuple[Path, ...],
104+
oem_version: str,
105+
*,
106+
overwrite: bool,
107+
) -> None:
108+
"""Create resource YAML files for DATASET_ID from the given FILES."""
109+
outs = init_resources_from_files(base_dir, dataset_id, files, oem_version=oem_version, overwrite=overwrite)
110+
for p in outs:
111+
click.echo(p)
24112

25113

26-
cli = click.CommandCollection(sources=[grp])
114+
# Keep CommandCollection for backwards compatibility with your entry point
115+
cli = click.CommandCollection(sources=[grp, init])
27116

28117

29118
def main() -> None:

src/omi/create.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
"""Entry point for OEMetadata creation (split-files layout only)."""
2+
3+
from __future__ import annotations
4+
5+
import json
6+
from pathlib import Path
7+
from typing import Optional, Union
8+
9+
from omi.creation.assembler import assemble_many_metadata, assemble_metadata_dict
10+
11+
12+
def build_from_yaml(
13+
base_dir: Union[str, Path],
14+
dataset_id: str,
15+
output_file: Union[str, Path],
16+
*,
17+
index_file: Optional[Union[str, Path]] = None,
18+
) -> None:
19+
"""
20+
Assemble one dataset and write the resulting OEMetadata JSON to a file.
21+
22+
Parameters
23+
----------
24+
base_dir : Union[str, Path]
25+
Base directory containing the split-files dataset structure.
26+
dataset_id : str
27+
The dataset ID to assemble.
28+
output_file : Union[str, Path]
29+
Path to write the resulting OEMetadata JSON file.
30+
index_file : Optional[Union[str, Path]], optional
31+
Optional path to an index file for resolving cross-dataset references,
32+
by default None.
33+
"""
34+
md = assemble_metadata_dict(base_dir, dataset_id, index_file=index_file)
35+
Path(output_file).parent.mkdir(parents=True, exist_ok=True)
36+
Path(output_file).write_text(json.dumps(md, indent=2, ensure_ascii=False), encoding="utf-8")
37+
38+
39+
def build_many_from_yaml(
40+
base_dir: Union[str, Path],
41+
output_dir: Union[str, Path],
42+
*,
43+
dataset_ids: Optional[list[str]] = None,
44+
index_file: Optional[Union[str, Path]] = None,
45+
) -> None:
46+
"""
47+
Assemble multiple datasets and write each as <dataset_id>.json to output_dir.
48+
49+
Parameters
50+
----------
51+
base_dir : Union[str, Path]
52+
Base directory containing the split-files dataset structure.
53+
output_dir : Union[str, Path]
54+
Directory to write the resulting OEMetadata JSON files.
55+
dataset_ids : Optional[list[str]], optional
56+
Optional list of dataset IDs to assemble. If None, all datasets found
57+
in base_dir will be assembled, by default None.
58+
index_file : Optional[Union[str, Path]], optional
59+
Optional path to an index file for resolving cross-dataset references,
60+
by default None.
61+
"""
62+
out_dir = Path(output_dir)
63+
out_dir.mkdir(parents=True, exist_ok=True)
64+
65+
results = assemble_many_metadata(
66+
base_dir,
67+
dataset_ids=dataset_ids,
68+
index_file=index_file,
69+
as_dict=True, # keep it as a mapping id -> metadata
70+
)
71+
for ds_id, md in results.items():
72+
(out_dir / f"{ds_id}.json").write_text(
73+
json.dumps(md, indent=2, ensure_ascii=False),
74+
encoding="utf-8",
75+
)

0 commit comments

Comments
 (0)