feat(cli): add Kaggle dataset integration and Croissant metadata parsing #11

closestfriend · 2026-01-09T06:47:52Z

Description

Add Kaggle dataset integration and Croissant (ML Commons) metadata parsing to streamline dataset-to-TOON workflows. This enables users to download Kaggle datasets and convert them to TOON format in a single command.

Features

New CLI flags:

--kaggle - Treat input as Kaggle dataset slug
--croissant - Parse input as Croissant JSON-LD metadata
--file / -f - Select specific file from multi-file datasets

Usage examples:

# Download Kaggle dataset and convert to TOON
toon username/dataset-name --kaggle --stats

# Select specific file from dataset
toon username/dataset-name --kaggle --file data.csv

# Parse Croissant metadata to see schema
toon metadata.json --croissant

New Python API:

from toon import download_dataset, parse_croissant, csv_to_records

# Download and process Kaggle dataset
files = download_dataset("username/dataset-name")
csv_file = find_best_csv(files)

# Parse Croissant metadata
info = parse_croissant(metadata)
print(info['schema'])

Implementation

New module toon/kaggle.py provides:

download_dataset() - Download Kaggle datasets via kaggle CLI
find_best_csv() - Heuristic selection of main data file
csv_to_records() - CSV to list[dict] conversion
parse_croissant() - Extract schema from Croissant JSON-LD
croissant_to_summary() - Generate human-readable summaries
is_kaggle_slug() - Detect Kaggle dataset slug format

All imports are optional - gracefully degrades if kaggle package is not installed.

Type of Change

New feature (non-breaking change which adds functionality)

Testing

All tests pass
Added 12 new tests for Kaggle integration
Tested manually with real Kaggle datasets

Checklist

Code follows the project's style guidelines
Self-review completed
Documentation updated (CLI help, docstrings)
No new warnings or errors introduced

Add new --kaggle and --croissant CLI flags for streamlined dataset workflows: - `toon username/dataset --kaggle` downloads and converts Kaggle datasets to TOON - `toon metadata.json --croissant` parses ML Commons Croissant metadata - `--file` flag to select specific files from multi-file datasets - Auto-detection of Kaggle slugs (username/dataset-name format) New module toon/kaggle.py provides: - download_dataset(): Download Kaggle datasets via kaggle CLI - find_best_csv(): Heuristic selection of main data file - csv_to_records(): CSV to list[dict] conversion - parse_croissant(): Extract schema from Croissant JSON-LD - croissant_to_summary(): Generate human-readable dataset summaries All functions are optional imports - gracefully degrades if kaggle package is not installed. Includes comprehensive test suite (12 tests, 100% pass).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cli): add Kaggle dataset integration and Croissant metadata parsing #11

feat(cli): add Kaggle dataset integration and Croissant metadata parsing #11

Uh oh!

closestfriend commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(cli): add Kaggle dataset integration and Croissant metadata parsing #11

Are you sure you want to change the base?

feat(cli): add Kaggle dataset integration and Croissant metadata parsing #11

Uh oh!

Conversation

closestfriend commented Jan 9, 2026

Description

Features

New CLI flags:

Usage examples:

New Python API:

Implementation

Type of Change

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant