Release LitData v0.2.52 · Lightning-AI/litData

Lightning AI ⚡ is excited to announce the release of LitData v0.2.52

Highlights

Grouping Support in StreamingRawDataset

StreamingRawDataset now supports flexible grouping of items during setup—ideal for pairing related files like images and masks.

from litdata import StreamingRawDataset
from litdata.raw import FileMetadata
from typing import Union

class CustomStreamingRawDataset(StreamingRawDataset):
    def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
        # Example: group files in pairs [[image_1, mask_1], ...]
        return files

dataset = CustomStreamingRawDataset("s3://bucket/files/")

Remote Index Caching for Faster Startup

StreamingRawDataset now caches its file index both locally and remotely, speeding up initialization for large cloud datasets. It loads from local cache first, then tries remote cache, and rebuilds only if needed. Use recompute_index=True to force rebuild.

from litdata import StreamingRawDataset

dataset = StreamingRawDataset("s3://bucket/files/")  # Loads cached index if available
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True)  # Force rebuild

Shuffle Control Added to `train_test_split`

Splitting your streaming datasets is now more flexible with the new shuffle parameter. You can choose whether to shuffle your dataset before splitting, giving you better control over how your training, testing, and validation sets are created.

from litdata import train_test_split

train_ds, test_ds = train_test_split(streaming_dataset, splits=[0.8, 0.2], shuffle=True)

Changes

Added

Added grouping functionality to StreamingRawDataset allowing flexible item structuring in setup method (#665 by @bhimrazy)
Added shuffle parameter to train_test_split (#675 by @otogamer)
Added CI workflow to check for broken links (#676 by @Vimal-Shady)
Added remote and local index caching in StreamingRawDataset to speed up dataset initialization with multi-level cache system (#666 by @bhimrazy)

Changed

Removed asyncio from requirements.txt since it’s included in Python standard library (#670 by @deependujha)
Moved raw dataset code to litdata/raw, expose StreamingRawDataset at top-level (#671 by @bhimrazy)
Updated README with storie for raw vs optimized streaming option (#677 by @bhimrazy)

Fixed

Fixed broken 'Get Started' link in README (#674 by @Vimal-Shady)
Fixed and enabled parallel test execution with pytest-xdist in CI workflow (#620 by @deependujha)
Clean up leftover chunk lock files by prefix during Reader delete operation (#683 by @jwills)
Ensure all tests run correctly with ignore pattern fix (#679 by @Borda)

Chores

Bumped lightning-sdk from 0.1.46 to 2025.8.1 (#668 by @dependabot[bot])
Bumped pytest-rerunfailures from 14.0 to 15.1 (#667 by @dependabot[bot])
Bumped pytest-cov from 6.1.1 to 6.2.1 (#669 by @dependabot[bot])
Bumped the gha-updates group with 2 updates (#690 by @dependabot[bot])
Bumped litdata version to 0.2.52 by (#691 by @bhimrazy)

Full Changelog: v0.2.51...v0.2.52

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@deependujha, @Borda, @bhimrazy

New Contributors

@Vimal-Shady made their first contribution in #674
@otogamer made their first contribution in #675
@jwills made their first contribution in #683

Thank you ❤️ and we hope you'll keep them coming!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LitData v0.2.52

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Grouping Support in StreamingRawDataset

Remote Index Caching for Faster Startup

Shuffle Control Added to `train_test_split`

Changes

🧑‍💻 Contributors

Key Contributors

New Contributors

Contributors

Uh oh!

LitData v0.2.52

Highlights

Grouping Support in StreamingRawDataset

Remote Index Caching for Faster Startup

Shuffle Control Added to train_test_split

Changes

🧑‍💻 Contributors

Key Contributors

New Contributors

Contributors

Uh oh!

Shuffle Control Added to `train_test_split`