Commit 6e4a409
Feat: Update indexing of parquet dataset and also add streaming support to huggingface datasets (#505)
* moved to constants
* moved constants
* update hf downloader
* updated writer if file obj
* updated num workers
* add existence check for chunk file before loading in ParquetLoader
* add close method to ParquetLoader for memory management
* fix closing of parquet chunks
* refactor: replace shutil.copy2 with shutil.copyfile
* update preload
* upd documentation for default_cache_dir function
* added test case for hf downloader
* update test cases for parquet
* update index hf dataset
* updated parquet writer
* added test case for index hf dataset
* validate item_loader type for hf datasets and improve error handling
* add support for ParquetLoader in StreamingDataset tests
* simplified the parquet indexing process from different file services
* update num workers
* cleanup
* updaet the order
* add validation for low memory mode with ParquetLoader in StreamingDataset
* update params
* update item loader for low memory usage
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* update naming conventions
* fix type error
* fix type errors
* fix patch
* add read count
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>1 parent a2b2570 commit 6e4a409
File tree
13 files changed
+536
-294
lines changed- src/litdata
- processing
- streaming
- utilities
- tests
- processing
- streaming
13 files changed
+536
-294
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | | - | |
| 19 | + | |
21 | 20 | | |
22 | 21 | | |
23 | 22 | | |
24 | | - | |
25 | | - | |
26 | 23 | | |
27 | 24 | | |
28 | 25 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
100 | 100 | | |
101 | 101 | | |
102 | 102 | | |
103 | | - | |
| 103 | + | |
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
107 | | - | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
108 | 122 | | |
109 | 123 | | |
110 | 124 | | |
| |||
548 | 562 | | |
549 | 563 | | |
550 | 564 | | |
551 | | - | |
552 | | - | |
553 | | - | |
| 565 | + | |
554 | 566 | | |
555 | 567 | | |
556 | 568 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| |||
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
61 | | - | |
| 62 | + | |
62 | 63 | | |
63 | | - | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
77 | | - | |
| 78 | + | |
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
| |||
158 | 159 | | |
159 | 160 | | |
160 | 161 | | |
161 | | - | |
| 162 | + | |
162 | 163 | | |
163 | 164 | | |
164 | 165 | | |
| |||
193 | 194 | | |
194 | 195 | | |
195 | 196 | | |
196 | | - | |
| 197 | + | |
197 | 198 | | |
198 | 199 | | |
199 | 200 | | |
| |||
220 | 221 | | |
221 | 222 | | |
222 | 223 | | |
223 | | - | |
| 224 | + | |
224 | 225 | | |
225 | 226 | | |
226 | 227 | | |
| |||
248 | 249 | | |
249 | 250 | | |
250 | 251 | | |
251 | | - | |
252 | 252 | | |
253 | | - | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
254 | 260 | | |
255 | | - | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
261 | | - | |
262 | | - | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
263 | 265 | | |
264 | | - | |
| 266 | + | |
| 267 | + | |
265 | 268 | | |
266 | | - | |
267 | | - | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
268 | 272 | | |
269 | | - | |
270 | | - | |
271 | | - | |
272 | | - | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
273 | 284 | | |
274 | 285 | | |
275 | 286 | | |
276 | | - | |
| 287 | + | |
277 | 288 | | |
278 | 289 | | |
279 | 290 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
14 | 13 | | |
| 14 | + | |
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
| 32 | + | |
32 | 33 | | |
33 | 34 | | |
34 | 35 | | |
| |||
37 | 38 | | |
38 | 39 | | |
39 | 40 | | |
| 41 | + | |
| 42 | + | |
40 | 43 | | |
41 | 44 | | |
42 | 45 | | |
| |||
527 | 530 | | |
528 | 531 | | |
529 | 532 | | |
530 | | - | |
| 533 | + | |
531 | 534 | | |
532 | 535 | | |
533 | 536 | | |
534 | 537 | | |
535 | 538 | | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
536 | 542 | | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
537 | 552 | | |
538 | 553 | | |
539 | 554 | | |
| |||
548 | 563 | | |
549 | 564 | | |
550 | 565 | | |
551 | | - | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
552 | 569 | | |
553 | 570 | | |
554 | 571 | | |
| |||
566 | 583 | | |
567 | 584 | | |
568 | 585 | | |
569 | | - | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
570 | 590 | | |
571 | 591 | | |
572 | | - | |
573 | | - | |
| 592 | + | |
| 593 | + | |
574 | 594 | | |
575 | 595 | | |
576 | 596 | | |
| |||
580 | 600 | | |
581 | 601 | | |
582 | 602 | | |
583 | | - | |
| 603 | + | |
584 | 604 | | |
585 | 605 | | |
586 | 606 | | |
| |||
593 | 613 | | |
594 | 614 | | |
595 | 615 | | |
596 | | - | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
597 | 620 | | |
598 | | - | |
| 621 | + | |
| 622 | + | |
| 623 | + | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
599 | 637 | | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
600 | 665 | | |
601 | | - | |
602 | | - | |
603 | | - | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
604 | 699 | | |
605 | 700 | | |
606 | 701 | | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
607 | 709 | | |
608 | 710 | | |
609 | | - | |
610 | | - | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
611 | 722 | | |
612 | 723 | | |
613 | 724 | | |
0 commit comments