Skip to content

LitData v0.2.52

Choose a tag to compare

@bhimrazy bhimrazy released this 12 Aug 10:14
· 72 commits to main since this release
810ed23

Lightning AI ⚡ is excited to announce the release of LitData v0.2.52

Highlights

Grouping Support in StreamingRawDataset

StreamingRawDataset now supports flexible grouping of items during setup—ideal for pairing related files like images and masks.

from litdata import StreamingRawDataset
from litdata.raw import FileMetadata
from typing import Union

class CustomStreamingRawDataset(StreamingRawDataset):
    def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
        # Example: group files in pairs [[image_1, mask_1], ...]
        return files

dataset = CustomStreamingRawDataset("s3://bucket/files/")

Remote Index Caching for Faster Startup

StreamingRawDataset now caches its file index both locally and remotely, speeding up initialization for large cloud datasets. It loads from local cache first, then tries remote cache, and rebuilds only if needed. Use recompute_index=True to force rebuild.

from litdata import StreamingRawDataset

dataset = StreamingRawDataset("s3://bucket/files/")  # Loads cached index if available
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True)  # Force rebuild

Shuffle Control Added to train_test_split

Splitting your streaming datasets is now more flexible with the new shuffle parameter. You can choose whether to shuffle your dataset before splitting, giving you better control over how your training, testing, and validation sets are created.

from litdata import train_test_split

train_ds, test_ds = train_test_split(streaming_dataset, splits=[0.8, 0.2], shuffle=True)

Changes

Added
  • Added grouping functionality to StreamingRawDataset allowing flexible item structuring in setup method (#665 by @bhimrazy)
  • Added shuffle parameter to train_test_split (#675 by @otogamer)
  • Added CI workflow to check for broken links (#676 by @Vimal-Shady)
  • Added remote and local index caching in StreamingRawDataset to speed up dataset initialization with multi-level cache system (#666 by @bhimrazy)
Changed
  • Removed asyncio from requirements.txt since it’s included in Python standard library (#670 by @deependujha)
  • Moved raw dataset code to litdata/raw, expose StreamingRawDataset at top-level (#671 by @bhimrazy)
  • Updated README with storie for raw vs optimized streaming option (#677 by @bhimrazy)
Fixed
  • Fixed broken 'Get Started' link in README (#674 by @Vimal-Shady)
  • Fixed and enabled parallel test execution with pytest-xdist in CI workflow (#620 by @deependujha)
  • Clean up leftover chunk lock files by prefix during Reader delete operation (#683 by @jwills)
  • Ensure all tests run correctly with ignore pattern fix (#679 by @Borda)
Chores

Full Changelog: v0.2.51...v0.2.52

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@deependujha, @Borda, @bhimrazy

New Contributors

Thank you ❤️ and we hope you'll keep them coming!