Skip to content

Releases: Lightning-AI/litData

Weekly Release 0.2.60

28 Jan 14:26
016caed

Choose a tag to compare

What's Changed

  • fixed r2 refetch interval by @vlad-heidi in #777
  • Fix StreamingDataset len after drop_last update by @MagellaX in #778
  • chore(deps): update sphinx requirement from <7.0,>=6.0 to >=6.0,<9.0 by @dependabot[bot] in #763
  • chore(deps): bump pytest-rerunfailures from 16.0.1 to 16.1 by @dependabot[bot] in #764
  • chore(deps): bump the gha-updates group across 1 directory with 3 updates by @dependabot[bot] in #774
  • [pre-commit.ci] pre-commit suggestions by @pre-commit-ci[bot] in #779
  • fix: lint errors (UP007, UP045, UP006 & UP035) by @bhimrazy in #754
  • chore(deps): update coverage requirement from ==7.10.* to ==7.12.* by @dependabot[bot] in #762
  • chore: add & simplify concurrency setting to CI testing workflow by @bhimrazy in #780
  • Fix ParallelStreamingDataset with resume=True not resuming after loading a state dict when breaking early by @philgzl in #771
  • Bump SDK by @tchaton in #783
  • chore(deps): bump JamesIves/github-pages-deploy-action from 4.7.6 to 4.8.0 in the gha-updates group by @dependabot[bot] in #782
  • feat(litdata): Better support for filestore & co by @tchaton in #785
  • chore(litdata): Pre-release version bump 0.2.60 by @tchaton in #786

New Contributors

Full Changelog: v0.2.59...v0.2.60

LitData v0.2.59

13 Dec 00:24
5913181

Choose a tag to compare

Lightning AI ⚡ is excited to announce the release of LitData v0.2.59

Changes

Added
  • add CHANGELOG.md to track project updates by @deependujha in #733
  • feat: add support to disable external version checks by @sanggusti in #737
  • feat: Add Python 3.14 zstd builtin support by @bhimrazy in #749
  • feat: add align_chunking option to preserve deterministic chunk boundaries across workers by @deependujha in #768
Changed
  • pin: torchaudio to >=2.7.0,<2.9 by @deependujha in #738
  • ref(test): remove torchaudio dependency and update audio processing to just use soundfile by @bhimrazy in #739
Fixed
  • fix(ci): failing link checks by @bhimrazy in #748
  • fix : ZstdError handling for Python <3.14 & >=3.14 compatibility by @bhimrazy in #767
  • Fix ParallelStreamingDataset with resume=True not resuming after the second epoch when breaking early by @philgzl in #761
Chores
  • chore(deps): update transformers requirement from <4.53.0 to <4.57.0 by @dependabot[bot] in #723
  • chore(deps): bump lightning-sdk from 2025.8.1 to 2025.9.30 by @dependabot[bot] in #724
  • chore(deps): bump pytest-cov from 6.2.1 to 7.0.0 by @dependabot[bot] in #725
  • chore(deps): bump astral-sh/setup-uv from 6 to 7 in the gha-updates group by @dependabot[bot] in #735
  • chore(deps): update transformers requirement from <4.57.0 to <4.58.0 by @dependabot[bot] in #746
  • chore(deps): bump pytest-rerunfailures from 15.1 to 16.0.1 by @dependabot[bot] in #745
  • chore(deps): bump actions/download-artifact from 5 to 6 in the gha-updates group by @dependabot[bot] in #741
  • docs: add anchor links to feature sections in README for easy referencing by @VijayVignesh1 in #743
  • chore(ci): add Python 3.14 to the testing matrix by @bhimrazy in #747
  • chore: drop support for Python 3.9 (EOL) by @bhimrazy in #751
  • chore(deps): bump JamesIves/github-pages-deploy-action from 4.7.3 to 4.7.4 in the gha-updates group by @dependabot[bot] in #750

Full Changelog: v0.2.58...v0.2.59

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

New Contributors

Thank you ❤️ and we hope you'll keep them coming!

Release 0.2.58

07 Oct 12:17
dad316e

Choose a tag to compare

What's Changed

Full Changelog: v0.2.57...v0.2.58

Release 0.2.57

06 Oct 20:31
695a314

Choose a tag to compare

What's Changed

Full Changelog: v0.2.56...v0.2.57

v0.2.56

23 Sep 02:49
df92bf8

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.55...v0.2.56

LitData v0.2.55

19 Sep 15:47
f990376

Choose a tag to compare

Lightning AI ⚡ is excited to announce the release of LitData v0.2.55

Highlights

[Fixed] Writing compressed data to a lighting_storage folder

This release focuses on fixing errors when writing compressed output data to a lightning_storage folder. Previously, a code snippet like the following would break.

from litdata import StreamingDataset, StreamingDataLoader, optimize
import time

def should_keep(data):
    if data % 2 == 0:
        yield data


if __name__ == "__main__":
    output_dir = "/teamspace/lightning_storage/my-folder-1/output"
    optimize(
        fn=should_keep,
        inputs=list(range(500)),
        output_dir=output_dir,
        chunk_bytes="64MB",
        num_workers=4,
        compression="zstd", # Previously, this would cause an error
    )
    time.sleep(20) 
    dataset = StreamingDataset(output_dir)
    dataloader = StreamingDataLoader(dataset, batch_size=32, num_workers=4)
    for _ in dataloader:
        # process code here
        pass

Changes

Fixed
  • Fix errors when using compression and r2 in optimize() by @pwgardipee in #715
Changed
Chores
  • chore(ci): Add step to minimize uv cache in CI workflow by @bhimrazy in #713

Full Changelog: v0.2.54...v0.2.55

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@pwgardipee @bhimrazy

Thank you ❤️ and we hope you'll keep them coming!

LitData v0.2.54

10 Sep 14:02
b50d428

Choose a tag to compare

Lightning AI ⚡ is excited to announce the release of LitData v0.2.54

Highlights

Lightning AI Storage - Direct download

Lightning Studios have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder. LitData has supported existing folder types like S3 and GCS folders, and this release introduces support for lightning_storage folders which were recently launched.

For example, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.

from litdata import StreamingDataset

if __name__ == "__main__":
    data_dir = "/teamspace/lightning_storage/my-bucket-1/data"

    dataset = StreamingDataset(data_dir)

    for sample in dataset:
    	print(sample)

References to any of the following directories will work similarly:

  1. /teamspace/lightning_storage/...
  2. /teamspace/s3_connections/...
  3. /teamspace/gcs_connections/...
  4. /teamspace/s3_folders/...
  5. /teamspace/gcs_folders/...

Changes

Added
Changed

Full Changelog: v0.2.53...v0.2.54

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@pwgardipee

Thank you ❤️ and we hope you'll keep them coming!

LitData v0.2.53

09 Sep 14:42
8a8e651

Choose a tag to compare

Lightning AI ⚡ is excited to announce the release of LitData v0.2.53

Highlights

Lightning AI Storage - Direct download and upload

Lightning Studios have special directories for data connections that are available to an entire teamspace. LitData functions that reference those directories will experience a significant performance increase as uploads and downloads will happen directly from the bucket that backs the folder. LitData has supported existing folder types like S3 and GCS folders, and this release introduces support for lightning_storage folders which were recently launched.

For example, output artifacts from this code will be directly uploaded to the my-data-1 Lighting Storage bucket.

from litdata import optimize

def should_keep(data):
    if data % 2 == 0:
        yield data

if __name__ == "__main__":
    optimize(
        fn=should_keep,
        inputs=list(range(1000)),
        output_dir="/teamspace/lightning_storage/my-data-1/output",
        chunk_bytes="64MB",
        num_workers=1
    )

Similarly, data will be downloaded directly from the my-data-1 Lightning Storage bucket in this example code.

from litdata import StreamingRawDataset

if __name__ == "__main__":
    data_dir = "/teamspace/lightning_storage/my-bucket-1/data"

    raw_dataset = StreamingRawDataset(data_dir)

    data = list(raw_dataset)
    print(data)

References to any of the following directories will work similarly:

  1. /teamspace/lightning_storage/...
  2. /teamspace/s3_connections/...
  3. /teamspace/gcs_connections/...
  4. /teamspace/s3_folders/...
  5. /teamspace/gcs_folders/...

Changes

Added
  • Add support for resolving directories in /teamspace/lightning_storage by @bhimrazy in #695
  • Add support for direct upload to r2 buckets by @pwgardipee in #705
  • Add readme docs for references to data connection dirs by @pwgardipee in #708
Changed
  • Remove unnecessary fixed sleep by adding predicate-based path check by @Red-Eyed in #700
  • ref(resolver): Refactors data connection resolution by adding a helper function and eliminating code duplication. by @bhimrazy in #706
Chores
  • chore(deps): bump actions/first-interaction from 2 to 3 in the gha-updates group by @dependabot[bot] in #693
  • chore(deps): update coverage requirement from ==7.8.* to ==7.10.* by @dependabot[bot] in #701
  • chore(deps): bump pytest-random-order from 1.1.1 to 1.2.0 by @dependabot[bot] in #703
  • chore(deps): bump cryptography from 45.0.4 to 45.0.7 by @dependabot[bot] in #704
  • chore(deps): bump the gha-updates group with 3 updates by @dependabot[bot] in #707

Full Changelog: v0.2.52...v0.2.53

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@bhimrazy, @pwgardipee

New Contributors

Thank you ❤️ and we hope you'll keep them coming!

LitData v0.2.52

12 Aug 10:14
810ed23

Choose a tag to compare

Lightning AI ⚡ is excited to announce the release of LitData v0.2.52

Highlights

Grouping Support in StreamingRawDataset

StreamingRawDataset now supports flexible grouping of items during setup—ideal for pairing related files like images and masks.

from litdata import StreamingRawDataset
from litdata.raw import FileMetadata
from typing import Union

class CustomStreamingRawDataset(StreamingRawDataset):
    def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
        # Example: group files in pairs [[image_1, mask_1], ...]
        return files

dataset = CustomStreamingRawDataset("s3://bucket/files/")

Remote Index Caching for Faster Startup

StreamingRawDataset now caches its file index both locally and remotely, speeding up initialization for large cloud datasets. It loads from local cache first, then tries remote cache, and rebuilds only if needed. Use recompute_index=True to force rebuild.

from litdata import StreamingRawDataset

dataset = StreamingRawDataset("s3://bucket/files/")  # Loads cached index if available
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True)  # Force rebuild

Shuffle Control Added to train_test_split

Splitting your streaming datasets is now more flexible with the new shuffle parameter. You can choose whether to shuffle your dataset before splitting, giving you better control over how your training, testing, and validation sets are created.

from litdata import train_test_split

train_ds, test_ds = train_test_split(streaming_dataset, splits=[0.8, 0.2], shuffle=True)

Changes

Added
  • Added grouping functionality to StreamingRawDataset allowing flexible item structuring in setup method (#665 by @bhimrazy)
  • Added shuffle parameter to train_test_split (#675 by @otogamer)
  • Added CI workflow to check for broken links (#676 by @Vimal-Shady)
  • Added remote and local index caching in StreamingRawDataset to speed up dataset initialization with multi-level cache system (#666 by @bhimrazy)
Changed
  • Removed asyncio from requirements.txt since it’s included in Python standard library (#670 by @deependujha)
  • Moved raw dataset code to litdata/raw, expose StreamingRawDataset at top-level (#671 by @bhimrazy)
  • Updated README with storie for raw vs optimized streaming option (#677 by @bhimrazy)
Fixed
  • Fixed broken 'Get Started' link in README (#674 by @Vimal-Shady)
  • Fixed and enabled parallel test execution with pytest-xdist in CI workflow (#620 by @deependujha)
  • Clean up leftover chunk lock files by prefix during Reader delete operation (#683 by @jwills)
  • Ensure all tests run correctly with ignore pattern fix (#679 by @Borda)
Chores

Full Changelog: v0.2.51...v0.2.52

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@deependujha, @Borda, @bhimrazy

New Contributors

Thank you ❤️ and we hope you'll keep them coming!

LitData v0.2.51

29 Jul 06:23
f88a139

Choose a tag to compare

Lightning AI ⚡ is excited to announce the release of LitData v0.2.51

Highlights

Stream Raw Datasets from Cloud Storage (Beta)

Effortlessly stream raw files (e.g., images, text) directly from S3, GCS, or Azure cloud storage without preprocessing. Perfect for workflows needing immediate access to data in its original format.

from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader

dataset = StreamingRawDataset("s3://bucket/files/")

# Use with PyTorch DataLoader
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
   # Process raw bytes
    pass

Benchmarks
Streaming speed for raw ImageNet (1.2M images) from cloud storage:

Storage Images/s (No Transform) Images/s (With Transform)
AWS S3 ~6,400 ± 100 ~3,200 ± 100
Google Cloud Storage ~5,650 ± 100 ~3,100 ± 100

Note: Use StreamingRawDataset for direct data streaming. Opt for StreamingDataset for maximum speed with pre-optimized data.

Resume ParallelStreamingDataset

The ParallelStreamingDataset now supports a resume option, allowing you to seamlessly continue training from the previous epoch's state when cycling through datasets. Enable it with resume=True to avoid restarting at index 0 each epoch, ensuring consistent sample progression across epochs.

from litdata.streaming.parallel import ParallelStreamingDataset
from torch.utils.data import DataLoader

dataset = ParallelStreamingDataset(datasets=[dataset1, dataset2], length=100, resume=True)
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
    # Resumes from previous epoch's state
    pass

Per-Dataset Batch Sizes in CombinedStreamingDataset

The CombinedStreamingDataset now supports per-dataset batch sizes when using batching_method="per_stream". Specify unique batch sizes for each dataset using set_batch_size() with a list of integers. The iterator respects these limits, switching datasets once the per-stream quota is met, optimizing GPU utilization for datasets with varying tensor sizes.

from litdata.streaming.combined import CombinedStreamingDataset

dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2],
    weights=[0.5, 0.5],
    batching_method="per_stream",
    seed=123
)
dataset.set_batch_size([4, 8])  # Set batch sizes: 4 for dataset1, 8 for dataset2

for sample in dataset:
    # Iterator yields samples respecting per-dataset batch size limits
    pass

Changes

Added
  • Added support for setting cache directory via LITDATA_CACHE_DIR environment variable (#639 by @deependujha)
  • Added CLI option to clear default cache (#627 by @deependujha)
  • Added resume support to ParallelStreamingDataset (#650 by @philgzl)
  • Added verbose option to optimize_fn (#654 by @deependujha)
  • Added support for multiple transform_fn in StreamingDataset (#655 by @deependujha)
  • Enabled per-dataset batch size support in CombinedStreamingDataset (#635 by @MagellaX)
  • Added support for StreamingRawDataset to stream raw datasets from cloud storage (#652 by @bhimrazy)
  • Added GCP support for directory resolution in resolve_dir (#659 by @bhimrazy)
Changed
  • Cleaned up logic in _loop by removing hacky index assignment (#640 by @deependujha)
  • Updated CODEOWNERS (#646 by @Borda)
  • Switched to astral-sh/setup-uv for Python setup and used uv pip for package installation (#656 by @bhimrazy)
  • Replaced PIL with torchvision's decode_image for more robust JPEG deserialization (#660 by @bhimrazy)
Fixed
  • Fixed performance issue with StreamingDataLoader when using ≥5 workers on Parquet data (#616 by @bhimrazy)
  • Fixed performance bottleneck in train_test_split (#647 by @lukemerrick)
  • Fixed async handling in StreamingRawDataset (#661 by @bhimrazy)
Chores

Full Changelog: v0.2.50...v0.2.51

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@deependujha, @Borda, @bhimrazy, @philgzl

New Contributors

Thank you ❤️ and we hope you'll keep them coming!