A tool to automatically generate Croissant metadata for datasets, starting with those hosted on PhysioNet.
Status: Alpha - Development
It is highly recommended to use a virtual environment to manage dependencies.
-
Clone the repository:
git clone https://github.com/MIT-LCP/croissant-maker.git cd croissant-maker -
Create and activate a virtual environment:
# Create a venv python3 -m venv .venv # Activate the venv source .venv/bin/activate
-
Install dependencies: (Make sure the venv is active)
pip install -e '.[test]'This installs the package in editable mode along with testing requirements inside your virtual environment.
After installation, you can use the croissant-maker CLI:
croissant-maker --helpcroissant-maker --input /path/to/dataset --creator "Your Name" --output my-metadata.jsonldYou can override default metadata fields:
croissant-maker --input /path/to/dataset \
--name "My Dataset" \
--description "A machine learning dataset" \
--creator "John Doe,john@example.com,https://john.com" \
--creator "Jane Smith,jane@example.com" \
--license "MIT" \
--citation "Doe et al. (2024). My Dataset."| Flag | Description | Example | Required |
|---|---|---|---|
--input, -i |
Dataset directory | --input /data/my-dataset |
Yes |
--creator |
Creator info (repeat for multiple) | --creator "Name,email,url" |
Yes |
--output, -o |
Output file | --output metadata.jsonld |
|
--name |
Dataset name | --name "MIMIC-IV Demo" |
|
--description |
Dataset description | --description "Medical records" |
|
--license |
License (SPDX ID or URL) | --license "MIT" |
|
--citation |
Citation text | --citation "Author (2024)..." |
|
--url |
Dataset homepage | --url "https://example.com" |
|
--dataset-version |
Version | --dataset-version "1.0.0" |
|
--date-published |
Publication date | --date-published "2023-12-15" |
|
--no-validate |
Skip validation | --no-validate |
|
--count-csv-rows |
Count exact row numbers for CSV files (slow for large datasets) | --count-csv-rows |
Validation checks that the file can be loaded by mlcroissant and conforms to the basic structure of the specification.
croissant-maker validate my-metadata.jsonld# Run all tests
pytest -v
# Run specific test
pytest tests/test_cli.py::test_creator_formats -vThis project uses pre-commit with Ruff to automatically lint and format Python code, ensuring PEP 8 compliance and consistency before commits are made. Basic configuration file checks are also included.
Setup (run once after cloning and installing dev dependencies):
# (Ensure dev dependencies are installed: pip install -e '.[dev]')
pre-commit installMIT License - see LICENSE file.