Skip to content

Releases: trl-lab/SQaLe-Library

v0.1.3

09 Mar 21:58

Choose a tag to compare

Improvements

  • Smarter deduplication when limit is set — instead of deduplicating the entire dataset upfront with drop_duplicates, the extractor now iterates lazily and stops as soon as enough unique schemas have been collected. This avoids loading and processing the full dataset when only a subset is needed.
  • Progress bar for schema gathering — when --limit is provided, a dedicated Gathering schemas progress bar now tracks collection progress up to the limit, giving feedback during what can otherwise be a silent wait on large datasets.
  • Informative message for full-dataset runs — when no limit is set, a message is printed before deduplication to set expectations ("No limit set — deduplicating the full dataset, this might take a bit...").

v0.1.2

09 Mar 16:43

Choose a tag to compare

v0.1.1

Bug Fix

sqale-extract no longer requires --input to be specified. It now defaults to trl-lab/SQaLe_2, so running sqale-extract with no arguments works as expected.

v0.1.0

09 Mar 16:33

Choose a tag to compare

v0.1.0 — Initial Release

SQaLe is a Python utility for deserializing the SQaLe dataset into populated SQLite databases.

Features

Load the SQaLe dataset directly from HuggingFace (trl-lab/SQaLe_2) or from local .parquet/.arrow files
Deduplicate by schema ID and materialize each unique schema as a .db file
Populates tables with the synthetic row data from the dataset
CLI entry point:

sqale-extract --output ./dbs --limit 100

Importable Python API:

from sqale import deserialize_sqale

Requirements

Python ≥ 3.9
pandas, tqdm, pyarrow, datasets

Usage

pip install SQaLe
sqale-extract --output ./dbs
from sqale import deserialize_sqale

results = deserialize_sqale("trl-lab/SQaLe_2", output_dir="./dbs", limit=100)