You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+33-33Lines changed: 33 additions & 33 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -241,7 +241,7 @@ ld.map(
241
241
## Features for optimizing and streaming datasets for model training
242
242
243
243
<details>
244
-
<summary>✅ Stream raw datasets from cloud storage (beta)</summary>
244
+
<summary>✅ Stream raw datasets from cloud storage (beta) <aid="stream-raw"href="#stream-raw">🔗</a> </summary>
245
245
246
246
247
247
Effortlessly stream raw files (images, text, etc.) directly from S3, GCS, and Azure cloud storage without any optimization or conversion. Ideal for workflows requiring instant access to original data in its native format.
<summary> ✅ Pause, resume data streaming</summary>
573
+
<summary> ✅ Pause, resume data streaming <aid="pause-resume"href="#pause-resume">🔗</a> </summary>
574
574
575
575
576
576
Stream data during long training, if interrupted, pick up right where you left off without any issues.
@@ -604,7 +604,7 @@ for batch_idx, batch in enumerate(dataloader):
604
604
605
605
606
606
<details>
607
-
<summary> ✅ Use shared queue for Optimizing</summary>
607
+
<summary> ✅ Use shared queue for Optimizing <aid="shared-queue"href="#shared-queue">🔗</a> </summary>
608
608
609
609
610
610
If you are using multiple workers to optimize your dataset, you can use a shared queue to speed up the process.
@@ -661,7 +661,7 @@ if __name__ == "__main__":
661
661
662
662
663
663
<details>
664
-
<summary> ✅ Use a <code>Queue</code> as input for optimizing data</summary>
664
+
<summary> ✅ Use a <code>Queue</code> as input for optimizing data <aid="queue-input"href="#queue-input">🔗</a> </summary>
665
665
666
666
667
667
Sometimes you don’t have a static list of inputs to optimize — instead, you have a stream of data coming in over time. In such cases, you can use a multiprocessing.Queue to feed data into the optimize() function.
While `CombinedDataset` allows to fetch a sample from one of the datasets it wraps at each iteration, `ParallelStreamingDataset` can be used to fetch a sample from all the wrapped datasets at each iteration:
`ParallelStreamingDataset` can also be used to cycle a `StreamingDataset`. This allows to dissociate the epoch length from the number of samples in the dataset.
@@ -992,7 +992,7 @@ You can even set `length` to `float("inf")` for an infinite dataset!
<summary> ✅ Split datasets for train, val, test</summary>
1086
+
<summary> ✅ Split datasets for train, val, test <aid="split-datasets"href="#split-datasets">🔗</a> </summary>
1087
1087
1088
1088
1089
1089
@@ -1112,7 +1112,7 @@ print(test_dataset)
1112
1112
</details>
1113
1113
1114
1114
<details>
1115
-
<summary> ✅ Load a subset of the remote dataset</summary>
1115
+
<summary> ✅ Load a subset of the remote dataset <aid="load-subset"href="#load-subset">🔗</a> </summary>
1116
1116
1117
1117
1118
1118
Work on a smaller, manageable portion of your data to save time and resources.
@@ -1130,7 +1130,7 @@ print(len(dataset)) # display the length of your data
1130
1130
</details>
1131
1131
1132
1132
<details>
1133
-
<summary> ✅ Upsample from your source datasets </summary>
1133
+
<summary> ✅ Upsample from your source datasets <aid="upsample-datasets"href="#upsample-datasets">🔗</a> </summary>
1134
1134
1135
1135
1136
1136
Use to control the size of one iteration of a StreamingDataset using repeats. Contains `floor(N)` possibly shuffled copies of the source data, then a subsampling of the remainder.
@@ -1148,7 +1148,7 @@ print(len(dataset)) # display the length of your data
Stream Parquet datasets directly with LitData—no need to convert them into LitData’s optimized binary format! If your dataset is already in Parquet format, you can efficiently index and stream it using `StreamingDataset` and `StreamingDataLoader`.
@@ -1248,7 +1248,7 @@ for sample in dataloader:
1248
1248
</details>
1249
1249
1250
1250
<details>
1251
-
<summary> ✅ Use compression</summary>
1251
+
<summary> ✅ Use compression <aid="compression"href="#compression">🔗</a> </summary>
1252
1252
1253
1253
1254
1254
Reduce your data footprint by using advanced compression algorithms.
@@ -1281,7 +1281,7 @@ Using [zstd](https://github.com/facebook/zstd), you can achieve high compression
1281
1281
</details>
1282
1282
1283
1283
<details>
1284
-
<summary> ✅ Access samples without full data download</summary>
1284
+
<summary> ✅ Access samples without full data download <aid="access-samples"href="#access-samples">🔗</a> </summary>
1285
1285
1286
1286
1287
1287
Look at specific parts of a large dataset without downloading the whole thing or loading it on a local machine.
@@ -1299,7 +1299,7 @@ print(dataset[42]) # show the 42th element of the dataset
1299
1299
</details>
1300
1300
1301
1301
<details>
1302
-
<summary> ✅ Use any data transforms</summary>
1302
+
<summary> ✅ Use any data transforms <aid="data-transforms"href="#data-transforms">🔗</a> </summary>
1303
1303
1304
1304
1305
1305
Customize how your data is processed to better fit your needs.
@@ -1327,7 +1327,7 @@ for batch in dataloader:
1327
1327
</details>
1328
1328
1329
1329
<details>
1330
-
<summary> ✅ Profile data loading speed</summary>
1330
+
<summary> ✅ Profile data loading speed <aid="profile-loading"href="#profile-loading">🔗</a> </summary>
1331
1331
1332
1332
1333
1333
Measure and optimize how fast your data is being loaded, improving efficiency.
@@ -1345,7 +1345,7 @@ This generates a Chrome trace called `result.json`. Then, visualize this trace b
1345
1345
</details>
1346
1346
1347
1347
<details>
1348
-
<summary> ✅ Reduce memory use for large files</summary>
1348
+
<summary> ✅ Reduce memory use for large files <aid="reduce-memory"href="#reduce-memory">🔗</a> </summary>
1349
1349
1350
1350
1351
1351
Handle large data files efficiently without using too much of your computer's memory.
@@ -1383,7 +1383,7 @@ outputs = optimize(
1383
1383
</details>
1384
1384
1385
1385
<details>
1386
-
<summary> ✅ Limit local cache space</summary>
1386
+
<summary> ✅ Limit local cache space <aid="limit-cache"href="#limit-cache">🔗</a> </summary>
1387
1387
1388
1388
1389
1389
Limit the amount of disk space used by temporary files, preventing storage issues.
Specify the directory where cached files should be stored, ensuring efficient data retrieval and management. This is particularly useful for organizing your data storage and improving access times.
<summary> ✅ Optimize dataset in distributed environment</summary>
1436
+
<summary> ✅ Optimize dataset in distributed environment <aid="distributed-optimization"href="#distributed-optimization">🔗</a> </summary>
1437
1437
1438
1438
1439
1439
Lightning can distribute large workloads across hundreds of machines in parallel. This can reduce the time to complete a data processing task from weeks to minutes by scaling to enough machines.
@@ -1475,7 +1475,7 @@ print(dataset[:])
1475
1475
</details>
1476
1476
1477
1477
<details>
1478
-
<summary> ✅ Encrypt, decrypt data at chunk/sample level</summary>
1478
+
<summary> ✅ Encrypt, decrypt data at chunk/sample level <aid="encrypt-decrypt"href="#encrypt-decrypt">🔗</a> </summary>
1479
1479
1480
1480
1481
1481
Secure data by applying encryption to individual samples or chunks, ensuring sensitive information is protected during storage.
@@ -1544,7 +1544,7 @@ This allows the data to remain secure while maintaining flexibility in the encry
1544
1544
</details>
1545
1545
1546
1546
<details>
1547
-
<summary> ✅ Debug & Profile LitData with logs & Litracer</summary>
0 commit comments