LitData v0.2.52
Lightning AI ⚡ is excited to announce the release of LitData v0.2.52
Highlights
Grouping Support in StreamingRawDataset
StreamingRawDataset now supports flexible grouping of items during setup—ideal for pairing related files like images and masks.
from litdata import StreamingRawDataset
from litdata.raw import FileMetadata
from typing import Union
class CustomStreamingRawDataset(StreamingRawDataset):
def setup(self, files: list[FileMetadata]) -> Union[list[FileMetadata], list[list[FileMetadata]]]:
# Example: group files in pairs [[image_1, mask_1], ...]
return files
dataset = CustomStreamingRawDataset("s3://bucket/files/")Remote Index Caching for Faster Startup
StreamingRawDataset now caches its file index both locally and remotely, speeding up initialization for large cloud datasets. It loads from local cache first, then tries remote cache, and rebuilds only if needed. Use recompute_index=True to force rebuild.
from litdata import StreamingRawDataset
dataset = StreamingRawDataset("s3://bucket/files/") # Loads cached index if available
dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True) # Force rebuildShuffle Control Added to train_test_split
Splitting your streaming datasets is now more flexible with the new shuffle parameter. You can choose whether to shuffle your dataset before splitting, giving you better control over how your training, testing, and validation sets are created.
from litdata import train_test_split
train_ds, test_ds = train_test_split(streaming_dataset, splits=[0.8, 0.2], shuffle=True)Changes
Added
- Added grouping functionality to
StreamingRawDatasetallowing flexible item structuring insetupmethod (#665 by @bhimrazy) - Added shuffle parameter to
train_test_split(#675 by @otogamer) - Added CI workflow to check for broken links (#676 by @Vimal-Shady)
- Added remote and local index caching in
StreamingRawDatasetto speed up dataset initialization with multi-level cache system (#666 by @bhimrazy)
Changed
Fixed
- Fixed broken 'Get Started' link in README (#674 by @Vimal-Shady)
- Fixed and enabled parallel test execution with pytest-xdist in CI workflow (#620 by @deependujha)
- Clean up leftover chunk lock files by prefix during Reader delete operation (#683 by @jwills)
- Ensure all tests run correctly with ignore pattern fix (#679 by @Borda)
Chores
- Bumped lightning-sdk from 0.1.46 to 2025.8.1 (#668 by @dependabot[bot])
- Bumped pytest-rerunfailures from 14.0 to 15.1 (#667 by @dependabot[bot])
- Bumped pytest-cov from 6.1.1 to 6.2.1 (#669 by @dependabot[bot])
- Bumped the gha-updates group with 2 updates (#690 by @dependabot[bot])
- Bumped
litdataversion to 0.2.52 by (#691 by @bhimrazy)
Full Changelog: v0.2.51...v0.2.52
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
@deependujha, @Borda, @bhimrazy
New Contributors
- @Vimal-Shady made their first contribution in #674
- @otogamer made their first contribution in #675
- @jwills made their first contribution in #683
Thank you ❤️ and we hope you'll keep them coming!