Releases: trl-lab/SQaLe-Library
Releases · trl-lab/SQaLe-Library
v0.1.3
Improvements
- Smarter deduplication when
limitis set — instead of deduplicating the entire dataset upfront withdrop_duplicates, the extractor now iterates lazily and stops as soon as enough unique schemas have been collected. This avoids loading and processing the full dataset when only a subset is needed. - Progress bar for schema gathering — when
--limitis provided, a dedicatedGathering schemasprogress bar now tracks collection progress up to the limit, giving feedback during what can otherwise be a silent wait on large datasets. - Informative message for full-dataset runs — when no limit is set, a message is printed before deduplication to set expectations (
"No limit set — deduplicating the full dataset, this might take a bit...").
v0.1.2
v0.1.0
v0.1.0 — Initial Release
SQaLe is a Python utility for deserializing the SQaLe dataset into populated SQLite databases.
Features
Load the SQaLe dataset directly from HuggingFace (trl-lab/SQaLe_2) or from local .parquet/.arrow files
Deduplicate by schema ID and materialize each unique schema as a .db file
Populates tables with the synthetic row data from the dataset
CLI entry point:
sqale-extract --output ./dbs --limit 100
Importable Python API:
from sqale import deserialize_sqale
Requirements
Python ≥ 3.9
pandas, tqdm, pyarrow, datasets
Usage
pip install SQaLe
sqale-extract --output ./dbs
from sqale import deserialize_sqale
results = deserialize_sqale("trl-lab/SQaLe_2", output_dir="./dbs", limit=100)