Skip to content

Commit e2135bc

Browse files
docs: add anchor links to feature sections in README for easy referencing (Lightning-AI#743)
1 parent a2238e7 commit e2135bc

File tree

1 file changed

+33
-33
lines changed

1 file changed

+33
-33
lines changed

README.md

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,7 @@ ld.map(
241241
## Features for optimizing and streaming datasets for model training
242242

243243
<details>
244-
<summary>✅ Stream raw datasets from cloud storage (beta)</summary>
244+
<summary> ✅ Stream raw datasets from cloud storage (beta) <a id="stream-raw" href="#stream-raw">🔗</a> </summary>
245245
&nbsp;
246246

247247
Effortlessly stream raw files (images, text, etc.) directly from S3, GCS, and Azure cloud storage without any optimization or conversion. Ideal for workflows requiring instant access to original data in its native format.
@@ -317,7 +317,7 @@ dataset = StreamingRawDataset("s3://bucket/files/", recompute_index=True)
317317
</details>
318318

319319
<details>
320-
<summary> ✅ Stream large cloud datasets</summary>
320+
<summary> ✅ Stream large cloud datasets <a id="stream-large" href="#stream-large">🔗</a> </summary>
321321
&nbsp;
322322

323323
Use data stored on the cloud without needing to download it all to your computer, saving time and space.
@@ -367,7 +367,7 @@ dataset = StreamingDataset('s3://my-bucket/my-data', cache_dir="/path/to/cache")
367367
</details>
368368

369369
<details>
370-
<summary> ✅ Stream Hugging Face 🤗 datasets</summary>
370+
<summary> ✅ Stream Hugging Face 🤗 datasets <a id="stream-hf" href="#stream-hf">🔗</a> </summary>
371371

372372
&nbsp;
373373

@@ -480,7 +480,7 @@ Below is the benchmark for the `Imagenet dataset (155 GB)`, demonstrating that *
480480
</details>
481481

482482
<details>
483-
<summary> ✅ Streams on multi-GPU, multi-node</summary>
483+
<summary> ✅ Streams on multi-GPU, multi-node <a id="multi-gpu" href="#multi-gpu">🔗</a> </summary>
484484

485485
&nbsp;
486486

@@ -512,7 +512,7 @@ for batch in val_dataloader:
512512
</details>
513513

514514
<details>
515-
<summary> ✅ Stream from multiple cloud providers</summary>
515+
<summary> ✅ Stream from multiple cloud providers <a id="cloud-providers" href="#cloud-providers">🔗</a> </summary>
516516

517517
&nbsp;
518518

@@ -570,7 +570,7 @@ dataset = ld.StreamingDataset("azure://my-bucket/my-data", storage_options=azure
570570
</details>
571571

572572
<details>
573-
<summary> ✅ Pause, resume data streaming</summary>
573+
<summary> ✅ Pause, resume data streaming <a id="pause-resume" href="#pause-resume">🔗</a> </summary>
574574
&nbsp;
575575

576576
Stream data during long training, if interrupted, pick up right where you left off without any issues.
@@ -604,7 +604,7 @@ for batch_idx, batch in enumerate(dataloader):
604604

605605

606606
<details>
607-
<summary> ✅ Use shared queue for Optimizing</summary>
607+
<summary> ✅ Use shared queue for Optimizing <a id="shared-queue" href="#shared-queue">🔗</a> </summary>
608608
&nbsp;
609609

610610
If you are using multiple workers to optimize your dataset, you can use a shared queue to speed up the process.
@@ -661,7 +661,7 @@ if __name__ == "__main__":
661661

662662

663663
<details>
664-
<summary> ✅ Use a <code>Queue</code> as input for optimizing data</summary>
664+
<summary> ✅ Use a <code>Queue</code> as input for optimizing data <a id="queue-input" href="#queue-input">🔗</a> </summary>
665665
&nbsp;
666666

667667
Sometimes you don’t have a static list of inputs to optimize — instead, you have a stream of data coming in over time. In such cases, you can use a multiprocessing.Queue to feed data into the optimize() function.
@@ -718,7 +718,7 @@ if __name__ == "__main__":
718718

719719

720720
<details>
721-
<summary> ✅ LLM Pre-training </summary>
721+
<summary> ✅ LLM Pre-training <a id="llm-training" href="#llm-training">🔗</a> </summary>
722722
&nbsp;
723723

724724
LitData is highly optimized for LLM pre-training. First, we need to tokenize the entire dataset and then we can consume it.
@@ -781,7 +781,7 @@ for batch in tqdm(train_dataloader):
781781
</details>
782782

783783
<details>
784-
<summary> ✅ Filter illegal data </summary>
784+
<summary> ✅ Filter illegal data <a id="filter-data" href="#filter-data">🔗</a> </summary>
785785
&nbsp;
786786

787787
Sometimes, you have bad data that you don't want to include in the optimized dataset. With LitData, yield only the good data sample to include.
@@ -843,7 +843,7 @@ if __name__ == "__main__":
843843
</details>
844844

845845
<details>
846-
<summary> ✅ Combine datasets</summary>
846+
<summary> ✅ Combine datasets <a id="combine-datasets" href="#combine-datasets">🔗</a> </summary>
847847
&nbsp;
848848

849849
Mix and match different sets of data to experiment and create better models.
@@ -915,7 +915,7 @@ combined_dataset = CombinedStreamingDataset(
915915
</details>
916916

917917
<details>
918-
<summary> ✅ Parallel streaming</summary>
918+
<summary> ✅ Parallel streaming <a id="parallel-streaming" href="#parallel-streaming">🔗</a> </summary>
919919
&nbsp;
920920

921921
While `CombinedDataset` allows to fetch a sample from one of the datasets it wraps at each iteration, `ParallelStreamingDataset` can be used to fetch a sample from all the wrapped datasets at each iteration:
@@ -965,7 +965,7 @@ parallel_dataset = ParallelStreamingDataset([dset_1, dset_2], transform=transfor
965965
</details>
966966

967967
<details>
968-
<summary> ✅ Cycle datasets</summary>
968+
<summary> ✅ Cycle datasets <a id="cycle-datasets" href="#cycle-datasets">🔗</a> </summary>
969969
&nbsp;
970970

971971
`ParallelStreamingDataset` can also be used to cycle a `StreamingDataset`. This allows to dissociate the epoch length from the number of samples in the dataset.
@@ -992,7 +992,7 @@ You can even set `length` to `float("inf")` for an infinite dataset!
992992
</details>
993993

994994
<details>
995-
<summary> ✅ Merge datasets</summary>
995+
<summary> ✅ Merge datasets <a id="merge-datasets" href="#merge-datasets">🔗</a> </summary>
996996
&nbsp;
997997

998998
Merge multiple optimized datasets into one.
@@ -1027,7 +1027,7 @@ if __name__ == "__main__":
10271027
</details>
10281028

10291029
<details>
1030-
<summary> ✅ Transform datasets while Streaming</summary>
1030+
<summary> ✅ Transform datasets while Streaming <a id="transform-streaming" href="#transform-streaming">🔗</a> </summary>
10311031
&nbsp;
10321032

10331033
Transform datasets on-the-fly while streaming them, allowing for efficient data processing without the need to store intermediate results.
@@ -1083,7 +1083,7 @@ dataset = StreamingDatasetWithTransform(data_dir, cache_dir=str(cache_dir), shuf
10831083
</details>
10841084

10851085
<details>
1086-
<summary> ✅ Split datasets for train, val, test</summary>
1086+
<summary> ✅ Split datasets for train, val, test <a id="split-datasets" href="#split-datasets">🔗</a> </summary>
10871087

10881088
&nbsp;
10891089

@@ -1112,7 +1112,7 @@ print(test_dataset)
11121112
</details>
11131113

11141114
<details>
1115-
<summary> ✅ Load a subset of the remote dataset</summary>
1115+
<summary> ✅ Load a subset of the remote dataset <a id="load-subset" href="#load-subset">🔗</a> </summary>
11161116

11171117
&nbsp;
11181118
Work on a smaller, manageable portion of your data to save time and resources.
@@ -1130,7 +1130,7 @@ print(len(dataset)) # display the length of your data
11301130
</details>
11311131

11321132
<details>
1133-
<summary> ✅ Upsample from your source datasets </summary>
1133+
<summary> ✅ Upsample from your source datasets <a id="upsample-datasets" href="#upsample-datasets">🔗</a> </summary>
11341134

11351135
&nbsp;
11361136
Use to control the size of one iteration of a StreamingDataset using repeats. Contains `floor(N)` possibly shuffled copies of the source data, then a subsampling of the remainder.
@@ -1148,7 +1148,7 @@ print(len(dataset)) # display the length of your data
11481148
</details>
11491149

11501150
<details>
1151-
<summary> ✅ Easily modify optimized cloud datasets</summary>
1151+
<summary> ✅ Easily modify optimized cloud datasets <a id="modify-datasets" href="#modify-datasets">🔗</a> </summary>
11521152
&nbsp;
11531153

11541154
Add new data to an existing dataset or start fresh if needed, providing flexibility in data management.
@@ -1189,7 +1189,7 @@ The `overwrite` mode will delete the existing data and start from fresh.
11891189
</details>
11901190

11911191
<details>
1192-
<summary> ✅ Stream parquet datasets</summary>
1192+
<summary> ✅ Stream parquet datasets <a id="stream-parquet" href="#stream-parquet">🔗</a> </summary>
11931193
&nbsp;
11941194

11951195
Stream Parquet datasets directly with LitData—no need to convert them into LitData’s optimized binary format! If your dataset is already in Parquet format, you can efficiently index and stream it using `StreamingDataset` and `StreamingDataLoader`.
@@ -1248,7 +1248,7 @@ for sample in dataloader:
12481248
</details>
12491249

12501250
<details>
1251-
<summary> ✅ Use compression</summary>
1251+
<summary> ✅ Use compression <a id="compression" href="#compression">🔗</a> </summary>
12521252
&nbsp;
12531253

12541254
Reduce your data footprint by using advanced compression algorithms.
@@ -1281,7 +1281,7 @@ Using [zstd](https://github.com/facebook/zstd), you can achieve high compression
12811281
</details>
12821282

12831283
<details>
1284-
<summary> ✅ Access samples without full data download</summary>
1284+
<summary> ✅ Access samples without full data download <a id="access-samples" href="#access-samples">🔗</a> </summary>
12851285
&nbsp;
12861286

12871287
Look at specific parts of a large dataset without downloading the whole thing or loading it on a local machine.
@@ -1299,7 +1299,7 @@ print(dataset[42]) # show the 42th element of the dataset
12991299
</details>
13001300

13011301
<details>
1302-
<summary> ✅ Use any data transforms</summary>
1302+
<summary> ✅ Use any data transforms <a id="data-transforms" href="#data-transforms">🔗</a> </summary>
13031303
&nbsp;
13041304

13051305
Customize how your data is processed to better fit your needs.
@@ -1327,7 +1327,7 @@ for batch in dataloader:
13271327
</details>
13281328

13291329
<details>
1330-
<summary> ✅ Profile data loading speed</summary>
1330+
<summary> ✅ Profile data loading speed <a id="profile-loading" href="#profile-loading">🔗</a> </summary>
13311331
&nbsp;
13321332

13331333
Measure and optimize how fast your data is being loaded, improving efficiency.
@@ -1345,7 +1345,7 @@ This generates a Chrome trace called `result.json`. Then, visualize this trace b
13451345
</details>
13461346

13471347
<details>
1348-
<summary> ✅ Reduce memory use for large files</summary>
1348+
<summary> ✅ Reduce memory use for large files <a id="reduce-memory" href="#reduce-memory">🔗</a> </summary>
13491349
&nbsp;
13501350

13511351
Handle large data files efficiently without using too much of your computer's memory.
@@ -1383,7 +1383,7 @@ outputs = optimize(
13831383
</details>
13841384

13851385
<details>
1386-
<summary> ✅ Limit local cache space</summary>
1386+
<summary> ✅ Limit local cache space <a id="limit-cache" href="#limit-cache">🔗</a> </summary>
13871387
&nbsp;
13881388

13891389
Limit the amount of disk space used by temporary files, preventing storage issues.
@@ -1399,7 +1399,7 @@ dataset = StreamingDataset(..., max_cache_size="10GB")
13991399
</details>
14001400

14011401
<details>
1402-
<summary> ✅ Change cache directory path</summary>
1402+
<summary> ✅ Change cache directory path <a id="cache-directory" href="#cache-directory">🔗</a> </summary>
14031403
&nbsp;
14041404

14051405
Specify the directory where cached files should be stored, ensuring efficient data retrieval and management. This is particularly useful for organizing your data storage and improving access times.
@@ -1417,7 +1417,7 @@ dataset = StreamingDataset(input_dir=Dir(path=cache_dir, url=data_dir))
14171417
</details>
14181418

14191419
<details>
1420-
<summary> ✅ Optimize loading on networked drives</summary>
1420+
<summary> ✅ Optimize loading on networked drives <a id="networked-drives" href="#networked-drives">🔗</a> </summary>
14211421
&nbsp;
14221422

14231423
Optimize data handling for computers on a local network to improve performance for on-site setups.
@@ -1433,7 +1433,7 @@ dataset = StreamingDataset(input_dir="local:/data/shared-drive/some-data")
14331433
</details>
14341434

14351435
<details>
1436-
<summary> ✅ Optimize dataset in distributed environment</summary>
1436+
<summary> ✅ Optimize dataset in distributed environment <a id="distributed-optimization" href="#distributed-optimization">🔗</a> </summary>
14371437
&nbsp;
14381438

14391439
Lightning can distribute large workloads across hundreds of machines in parallel. This can reduce the time to complete a data processing task from weeks to minutes by scaling to enough machines.
@@ -1475,7 +1475,7 @@ print(dataset[:])
14751475
</details>
14761476

14771477
<details>
1478-
<summary> ✅ Encrypt, decrypt data at chunk/sample level</summary>
1478+
<summary> ✅ Encrypt, decrypt data at chunk/sample level <a id="encrypt-decrypt" href="#encrypt-decrypt">🔗</a> </summary>
14791479
&nbsp;
14801480

14811481
Secure data by applying encryption to individual samples or chunks, ensuring sensitive information is protected during storage.
@@ -1544,7 +1544,7 @@ This allows the data to remain secure while maintaining flexibility in the encry
15441544
</details>
15451545

15461546
<details>
1547-
<summary> ✅ Debug & Profile LitData with logs & Litracer</summary>
1547+
<summary> ✅ Debug & Profile LitData with logs & Litracer <a id="debug-profile" href="#debug-profile">🔗</a> </summary>
15481548

15491549
&nbsp;
15501550

@@ -1612,7 +1612,7 @@ if __name__ == "__main__":
16121612
</details>
16131613

16141614
<details>
1615-
<summary> ✅ Lightning AI Data Connections - Direct download and upload </summary>
1615+
<summary> ✅ Lightning AI Data Connections - Direct download and upload <a id="lightning-connections" href="#lightning-connections">🔗</a> </summary>
16161616

16171617
&nbsp;
16181618

@@ -1666,7 +1666,7 @@ References to any of the following directories will work similarly:
16661666
## Features for transforming datasets
16671667
16681668
<details>
1669-
<summary> ✅ Parallelize data transformations (map)</summary>
1669+
<summary> ✅ Parallelize data transformations (map) <a id="map" href="#map">🔗</a> </summary>
16701670
&nbsp;
16711671
16721672
Apply the same change to different parts of the dataset at once to save time and effort.

0 commit comments

Comments
 (0)