Less zip false positives #5640

lhoestq · 2023-03-15T16:48:59Z

zipfile.is_zipfile return false positives for some Parquet files. It causes errors when loading certain parquet datasets, where some files are considered ZIP files by zipfile.is_zipfile

This is a known issue: python/cpython#72680

At first I wanted to rely only on magic numbers, but then I found that someone contributed a fix to is_zipfile - do you think we should use it @albertvillanova or not ?

IMO it's ok to rely on magic numbers only for now, since in streaming mode we've had no issue checking only the magic number so far.

Close #5639

github-actions · 2023-03-15T16:50:59Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006998 / 0.011353 (-0.004355)	0.005093 / 0.011008 (-0.005916)	0.100490 / 0.038508 (0.061982)	0.032736 / 0.023109 (0.009627)	0.297738 / 0.275898 (0.021840)	0.322255 / 0.323480 (-0.001225)	0.005583 / 0.007986 (-0.002402)	0.004007 / 0.004328 (-0.000321)	0.075863 / 0.004250 (0.071613)	0.044212 / 0.037052 (0.007159)	0.300033 / 0.258489 (0.041544)	0.341997 / 0.293841 (0.048156)	0.036172 / 0.128546 (-0.092374)	0.012176 / 0.075646 (-0.063471)	0.356052 / 0.419271 (-0.063220)	0.050438 / 0.043533 (0.006905)	0.294677 / 0.255139 (0.039538)	0.318050 / 0.283200 (0.034850)	0.104733 / 0.141683 (-0.036950)	1.435681 / 1.452155 (-0.016474)	1.534793 / 1.492716 (0.042076)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.242815 / 0.018006 (0.224809)	0.565983 / 0.000490 (0.565494)	0.006800 / 0.000200 (0.006600)	0.000124 / 0.000054 (0.000070)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026548 / 0.037411 (-0.010863)	0.104816 / 0.014526 (0.090290)	0.116222 / 0.176557 (-0.060335)	0.172143 / 0.737135 (-0.564992)	0.121631 / 0.296338 (-0.174707)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.400126 / 0.215209 (0.184917)	4.004538 / 2.077655 (1.926883)	1.798822 / 1.504120 (0.294702)	1.595191 / 1.541195 (0.053996)	1.645777 / 1.468490 (0.177287)	0.705643 / 4.584777 (-3.879134)	3.750887 / 3.745712 (0.005175)	2.136547 / 5.269862 (-3.133315)	1.475881 / 4.565676 (-3.089795)	0.086921 / 0.424275 (-0.337354)	0.012379 / 0.007607 (0.004771)	0.505824 / 0.226044 (0.279779)	5.052364 / 2.268929 (2.783435)	2.279983 / 55.444624 (-53.164641)	1.932253 / 6.876477 (-4.944224)	2.051359 / 2.142072 (-0.090714)	0.851906 / 4.805227 (-3.953321)	0.169566 / 6.500664 (-6.331098)	0.064600 / 0.075469 (-0.010869)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.165859 / 1.841788 (-0.675929)	15.049950 / 8.074308 (6.975642)	14.095981 / 10.191392 (3.904589)	0.151779 / 0.680424 (-0.528645)	0.017537 / 0.534201 (-0.516664)	0.420164 / 0.579283 (-0.159119)	0.418932 / 0.434364 (-0.015432)	0.488749 / 0.540337 (-0.051588)	0.582359 / 1.386936 (-0.804577)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007426 / 0.011353 (-0.003927)	0.005248 / 0.011008 (-0.005761)	0.074118 / 0.038508 (0.035610)	0.034223 / 0.023109 (0.011114)	0.337780 / 0.275898 (0.061882)	0.376300 / 0.323480 (0.052820)	0.006142 / 0.007986 (-0.001843)	0.004246 / 0.004328 (-0.000083)	0.074177 / 0.004250 (0.069926)	0.052698 / 0.037052 (0.015646)	0.340229 / 0.258489 (0.081740)	0.396172 / 0.293841 (0.102331)	0.037293 / 0.128546 (-0.091253)	0.012514 / 0.075646 (-0.063132)	0.087144 / 0.419271 (-0.332128)	0.051922 / 0.043533 (0.008390)	0.333188 / 0.255139 (0.078049)	0.355420 / 0.283200 (0.072220)	0.110273 / 0.141683 (-0.031410)	1.447826 / 1.452155 (-0.004329)	1.561135 / 1.492716 (0.068419)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.269203 / 0.018006 (0.251197)	0.551997 / 0.000490 (0.551508)	0.001558 / 0.000200 (0.001359)	0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029511 / 0.037411 (-0.007900)	0.108614 / 0.014526 (0.094089)	0.123438 / 0.176557 (-0.053118)	0.171596 / 0.737135 (-0.565539)	0.126828 / 0.296338 (-0.169511)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.420520 / 0.215209 (0.205310)	4.175672 / 2.077655 (2.098017)	1.982220 / 1.504120 (0.478101)	1.788575 / 1.541195 (0.247381)	1.860840 / 1.468490 (0.392349)	0.706730 / 4.584777 (-3.878047)	3.858718 / 3.745712 (0.113005)	3.069389 / 5.269862 (-2.200472)	1.827603 / 4.565676 (-2.738073)	0.087893 / 0.424275 (-0.336382)	0.012613 / 0.007607 (0.005006)	0.524177 / 0.226044 (0.298132)	5.177077 / 2.268929 (2.908148)	2.494397 / 55.444624 (-52.950227)	2.189484 / 6.876477 (-4.686992)	2.217626 / 2.142072 (0.075554)	0.846326 / 4.805227 (-3.958901)	0.176558 / 6.500664 (-6.324106)	0.065018 / 0.075469 (-0.010451)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.268618 / 1.841788 (-0.573170)	15.132711 / 8.074308 (7.058403)	14.585530 / 10.191392 (4.394138)	0.163454 / 0.680424 (-0.516970)	0.017442 / 0.534201 (-0.516759)	0.421746 / 0.579283 (-0.157537)	0.425412 / 0.434364 (-0.008952)	0.499178 / 0.540337 (-0.041159)	0.595458 / 1.386936 (-0.791478)

HuggingFaceDocBuilderDev · 2023-03-15T16:52:47Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-03-15T17:11:58Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007980 / 0.011353 (-0.003373)	0.005414 / 0.011008 (-0.005594)	0.099226 / 0.038508 (0.060718)	0.035442 / 0.023109 (0.012332)	0.304851 / 0.275898 (0.028952)	0.337144 / 0.323480 (0.013664)	0.006162 / 0.007986 (-0.001823)	0.004151 / 0.004328 (-0.000177)	0.074708 / 0.004250 (0.070458)	0.049690 / 0.037052 (0.012638)	0.307658 / 0.258489 (0.049168)	0.358472 / 0.293841 (0.064631)	0.037181 / 0.128546 (-0.091365)	0.012259 / 0.075646 (-0.063387)	0.335426 / 0.419271 (-0.083846)	0.050790 / 0.043533 (0.007257)	0.301715 / 0.255139 (0.046576)	0.320834 / 0.283200 (0.037634)	0.102357 / 0.141683 (-0.039326)	1.454750 / 1.452155 (0.002596)	1.571994 / 1.492716 (0.079278)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.218708 / 0.018006 (0.200702)	0.444391 / 0.000490 (0.443901)	0.005717 / 0.000200 (0.005517)	0.000089 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028017 / 0.037411 (-0.009395)	0.112753 / 0.014526 (0.098227)	0.121003 / 0.176557 (-0.055554)	0.181085 / 0.737135 (-0.556050)	0.127211 / 0.296338 (-0.169127)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.400803 / 0.215209 (0.185594)	4.007315 / 2.077655 (1.929660)	1.826911 / 1.504120 (0.322791)	1.637799 / 1.541195 (0.096605)	1.699754 / 1.468490 (0.231264)	0.709413 / 4.584777 (-3.875364)	4.008904 / 3.745712 (0.263192)	3.916540 / 5.269862 (-1.353322)	1.902102 / 4.565676 (-2.663575)	0.089048 / 0.424275 (-0.335227)	0.012763 / 0.007607 (0.005155)	0.498957 / 0.226044 (0.272913)	4.979865 / 2.268929 (2.710937)	2.301987 / 55.444624 (-53.142637)	1.929404 / 6.876477 (-4.947073)	2.107839 / 2.142072 (-0.034233)	0.857253 / 4.805227 (-3.947974)	0.171935 / 6.500664 (-6.328729)	0.066753 / 0.075469 (-0.008716)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.186811 / 1.841788 (-0.654977)	15.866319 / 8.074308 (7.792011)	14.738555 / 10.191392 (4.547163)	0.142879 / 0.680424 (-0.537544)	0.017679 / 0.534201 (-0.516522)	0.422840 / 0.579283 (-0.156443)	0.450307 / 0.434364 (0.015943)	0.491802 / 0.540337 (-0.048536)	0.588837 / 1.386936 (-0.798099)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007659 / 0.011353 (-0.003694)	0.005331 / 0.011008 (-0.005678)	0.075360 / 0.038508 (0.036852)	0.034011 / 0.023109 (0.010902)	0.354488 / 0.275898 (0.078590)	0.401781 / 0.323480 (0.078301)	0.005806 / 0.007986 (-0.002179)	0.004029 / 0.004328 (-0.000300)	0.073822 / 0.004250 (0.069572)	0.049067 / 0.037052 (0.012015)	0.364483 / 0.258489 (0.105994)	0.405637 / 0.293841 (0.111796)	0.037166 / 0.128546 (-0.091380)	0.012397 / 0.075646 (-0.063249)	0.087346 / 0.419271 (-0.331926)	0.050888 / 0.043533 (0.007355)	0.334796 / 0.255139 (0.079657)	0.387681 / 0.283200 (0.104481)	0.105056 / 0.141683 (-0.036627)	1.471630 / 1.452155 (0.019475)	1.554764 / 1.492716 (0.062047)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.231825 / 0.018006 (0.213819)	0.449746 / 0.000490 (0.449256)	0.000888 / 0.000200 (0.000688)	0.000078 / 0.000054 (0.000023)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030363 / 0.037411 (-0.007049)	0.115234 / 0.014526 (0.100708)	0.123005 / 0.176557 (-0.053551)	0.172772 / 0.737135 (-0.564363)	0.127818 / 0.296338 (-0.168520)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425761 / 0.215209 (0.210552)	4.237950 / 2.077655 (2.160295)	1.992045 / 1.504120 (0.487925)	1.801622 / 1.541195 (0.260427)	1.918477 / 1.468490 (0.449987)	0.722730 / 4.584777 (-3.862047)	4.015968 / 3.745712 (0.270256)	3.720412 / 5.269862 (-1.549450)	1.763111 / 4.565676 (-2.802566)	0.089041 / 0.424275 (-0.335234)	0.012608 / 0.007607 (0.005001)	0.522645 / 0.226044 (0.296601)	5.227108 / 2.268929 (2.958180)	2.444714 / 55.444624 (-52.999910)	2.109745 / 6.876477 (-4.766732)	2.194042 / 2.142072 (0.051969)	0.871781 / 4.805227 (-3.933447)	0.173149 / 6.500664 (-6.327515)	0.066192 / 0.075469 (-0.009277)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.312051 / 1.841788 (-0.529737)	16.024315 / 8.074308 (7.950007)	15.123823 / 10.191392 (4.932431)	0.163997 / 0.680424 (-0.516427)	0.017595 / 0.534201 (-0.516606)	0.426379 / 0.579283 (-0.152904)	0.467709 / 0.434364 (0.033345)	0.498308 / 0.540337 (-0.042030)	0.591426 / 1.386936 (-0.795510)

lhoestq · 2023-03-15T18:27:16Z

CI is failing due to unrelated issues, hopefully #5642 fixes it

albertvillanova

Thanks for the improvement.

I agree we should tweak the zipfile.is_zipfile default implementation if that is flaky.

albertvillanova · 2023-03-16T08:26:10Z

src/datasets/utils/extract.py

+                                if centdir[_CD_SIGNATURE] == stringCentralDir:
+                                    return True  # First central directory entry  has correct magic number
+        except Exception:  # catch all errors in case future python versions change the zipfile internals
+            return False


Please note that this function as it is could return None. To fix this:

Suggested change

return False

return False

return False

albertvillanova · 2023-03-16T08:38:56Z

tests/test_extract.py

+    with not_a_zip_file.open("wb") as f:
+        f.write(data)
+    assert zipfile.is_zipfile(str(not_a_zip_file))  # is a false positive for `zipfile`
+    assert not ZipExtractor.is_extractable(not_a_zip_file)  # but we're right


The test passes because not None is True.

src/datasets/utils/extract.py

Co-authored-by: Albert Villanova del Moral <[email protected]>

github-actions · 2023-03-16T13:25:27Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006478 / 0.011353 (-0.004875)	0.004347 / 0.011008 (-0.006661)	0.097103 / 0.038508 (0.058595)	0.027650 / 0.023109 (0.004541)	0.372355 / 0.275898 (0.096457)	0.408794 / 0.323480 (0.085314)	0.005034 / 0.007986 (-0.002952)	0.003252 / 0.004328 (-0.001076)	0.074068 / 0.004250 (0.069818)	0.035542 / 0.037052 (-0.001510)	0.367392 / 0.258489 (0.108903)	0.409644 / 0.293841 (0.115803)	0.031745 / 0.128546 (-0.096801)	0.011501 / 0.075646 (-0.064145)	0.323355 / 0.419271 (-0.095917)	0.043065 / 0.043533 (-0.000467)	0.377313 / 0.255139 (0.122174)	0.395326 / 0.283200 (0.112127)	0.087101 / 0.141683 (-0.054582)	1.461228 / 1.452155 (0.009073)	1.529413 / 1.492716 (0.036696)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.199245 / 0.018006 (0.181239)	0.409978 / 0.000490 (0.409488)	0.002655 / 0.000200 (0.002455)	0.000070 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023903 / 0.037411 (-0.013508)	0.097855 / 0.014526 (0.083330)	0.106405 / 0.176557 (-0.070152)	0.166889 / 0.737135 (-0.570247)	0.110256 / 0.296338 (-0.186082)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.440351 / 0.215209 (0.225142)	4.382848 / 2.077655 (2.305194)	2.049602 / 1.504120 (0.545482)	1.824638 / 1.541195 (0.283443)	1.850519 / 1.468490 (0.382029)	0.702652 / 4.584777 (-3.882125)	3.394571 / 3.745712 (-0.351141)	1.940608 / 5.269862 (-3.329254)	1.263961 / 4.565676 (-3.301716)	0.083985 / 0.424275 (-0.340290)	0.013046 / 0.007607 (0.005439)	0.538272 / 0.226044 (0.312228)	5.407563 / 2.268929 (3.138634)	2.519207 / 55.444624 (-52.925418)	2.153379 / 6.876477 (-4.723098)	2.394512 / 2.142072 (0.252439)	0.812840 / 4.805227 (-3.992387)	0.152868 / 6.500664 (-6.347796)	0.067823 / 0.075469 (-0.007646)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.220031 / 1.841788 (-0.621757)	13.781237 / 8.074308 (5.706929)	14.203975 / 10.191392 (4.012583)	0.141077 / 0.680424 (-0.539347)	0.016518 / 0.534201 (-0.517682)	0.379079 / 0.579283 (-0.200204)	0.378916 / 0.434364 (-0.055448)	0.434589 / 0.540337 (-0.105749)	0.521129 / 1.386936 (-0.865807)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006997 / 0.011353 (-0.004356)	0.004599 / 0.011008 (-0.006410)	0.078700 / 0.038508 (0.040192)	0.027902 / 0.023109 (0.004793)	0.344406 / 0.275898 (0.068508)	0.392918 / 0.323480 (0.069438)	0.005175 / 0.007986 (-0.002811)	0.004755 / 0.004328 (0.000427)	0.077707 / 0.004250 (0.073457)	0.039409 / 0.037052 (0.002357)	0.343250 / 0.258489 (0.084761)	0.405544 / 0.293841 (0.111703)	0.032286 / 0.128546 (-0.096260)	0.011674 / 0.075646 (-0.063972)	0.087633 / 0.419271 (-0.331639)	0.043346 / 0.043533 (-0.000186)	0.355076 / 0.255139 (0.099937)	0.382155 / 0.283200 (0.098955)	0.090914 / 0.141683 (-0.050769)	1.518369 / 1.452155 (0.066215)	1.583530 / 1.492716 (0.090813)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.160369 / 0.018006 (0.142362)	0.406844 / 0.000490 (0.406354)	0.002651 / 0.000200 (0.002451)	0.000080 / 0.000054 (0.000025)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025295 / 0.037411 (-0.012116)	0.101490 / 0.014526 (0.086964)	0.108825 / 0.176557 (-0.067732)	0.161673 / 0.737135 (-0.575462)	0.113610 / 0.296338 (-0.182729)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.443514 / 0.215209 (0.228305)	4.436722 / 2.077655 (2.359067)	2.144008 / 1.504120 (0.639888)	2.005324 / 1.541195 (0.464129)	2.123356 / 1.468490 (0.654866)	0.697217 / 4.584777 (-3.887560)	3.401105 / 3.745712 (-0.344607)	1.874621 / 5.269862 (-3.395240)	1.165069 / 4.565676 (-3.400608)	0.082799 / 0.424275 (-0.341476)	0.012806 / 0.007607 (0.005199)	0.542688 / 0.226044 (0.316644)	5.420963 / 2.268929 (3.152034)	2.579034 / 55.444624 (-52.865590)	2.240201 / 6.876477 (-4.636276)	2.261309 / 2.142072 (0.119237)	0.800246 / 4.805227 (-4.004981)	0.150380 / 6.500664 (-6.350285)	0.066880 / 0.075469 (-0.008589)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.281721 / 1.841788 (-0.560067)	13.906361 / 8.074308 (5.832053)	14.135336 / 10.191392 (3.943944)	0.128865 / 0.680424 (-0.551559)	0.016452 / 0.534201 (-0.517749)	0.373563 / 0.579283 (-0.205720)	0.385321 / 0.434364 (-0.049043)	0.437198 / 0.540337 (-0.103139)	0.530720 / 1.386936 (-0.856216)

github-actions · 2023-03-16T13:47:37Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008099 / 0.011353 (-0.003254)	0.005093 / 0.011008 (-0.005916)	0.106258 / 0.038508 (0.067750)	0.037051 / 0.023109 (0.013942)	0.347960 / 0.275898 (0.072062)	0.370849 / 0.323480 (0.047369)	0.006122 / 0.007986 (-0.001863)	0.004094 / 0.004328 (-0.000235)	0.079549 / 0.004250 (0.075299)	0.046563 / 0.037052 (0.009510)	0.332735 / 0.258489 (0.074246)	0.417061 / 0.293841 (0.123220)	0.038105 / 0.128546 (-0.090441)	0.011886 / 0.075646 (-0.063760)	0.342103 / 0.419271 (-0.077169)	0.053233 / 0.043533 (0.009700)	0.344754 / 0.255139 (0.089615)	0.355354 / 0.283200 (0.072155)	0.101059 / 0.141683 (-0.040624)	1.518561 / 1.452155 (0.066406)	1.558652 / 1.492716 (0.065935)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.225919 / 0.018006 (0.207913)	0.518539 / 0.000490 (0.518049)	0.006230 / 0.000200 (0.006030)	0.000124 / 0.000054 (0.000070)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026782 / 0.037411 (-0.010629)	0.108457 / 0.014526 (0.093931)	0.125203 / 0.176557 (-0.051353)	0.175726 / 0.737135 (-0.561409)	0.127051 / 0.296338 (-0.169287)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.416427 / 0.215209 (0.201217)	4.168851 / 2.077655 (2.091196)	1.962238 / 1.504120 (0.458118)	1.825224 / 1.541195 (0.284029)	1.831200 / 1.468490 (0.362710)	0.765526 / 4.584777 (-3.819250)	4.303957 / 3.745712 (0.558245)	2.193467 / 5.269862 (-3.076395)	1.654605 / 4.565676 (-2.911071)	0.096709 / 0.424275 (-0.327566)	0.013792 / 0.007607 (0.006185)	0.537862 / 0.226044 (0.311818)	5.152230 / 2.268929 (2.883302)	2.520938 / 55.444624 (-52.923686)	2.108422 / 6.876477 (-4.768054)	2.214220 / 2.142072 (0.072147)	0.834320 / 4.805227 (-3.970907)	0.170635 / 6.500664 (-6.330029)	0.063131 / 0.075469 (-0.012338)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.215767 / 1.841788 (-0.626020)	15.254781 / 8.074308 (7.180473)	14.360764 / 10.191392 (4.169372)	0.172511 / 0.680424 (-0.507913)	0.020161 / 0.534201 (-0.514040)	0.426936 / 0.579283 (-0.152347)	0.438771 / 0.434364 (0.004407)	0.486973 / 0.540337 (-0.053364)	0.584238 / 1.386936 (-0.802698)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006777 / 0.011353 (-0.004576)	0.005304 / 0.011008 (-0.005704)	0.073717 / 0.038508 (0.035209)	0.033604 / 0.023109 (0.010494)	0.340448 / 0.275898 (0.064550)	0.351861 / 0.323480 (0.028381)	0.005786 / 0.007986 (-0.002199)	0.005013 / 0.004328 (0.000685)	0.071263 / 0.004250 (0.067012)	0.048189 / 0.037052 (0.011137)	0.339457 / 0.258489 (0.080968)	0.384383 / 0.293841 (0.090542)	0.035563 / 0.128546 (-0.092983)	0.011509 / 0.075646 (-0.064137)	0.083722 / 0.419271 (-0.335550)	0.048886 / 0.043533 (0.005353)	0.350184 / 0.255139 (0.095045)	0.361037 / 0.283200 (0.077837)	0.105191 / 0.141683 (-0.036492)	1.503247 / 1.452155 (0.051093)	1.582298 / 1.492716 (0.089581)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.221687 / 0.018006 (0.203681)	0.466489 / 0.000490 (0.465999)	0.000484 / 0.000200 (0.000284)	0.000069 / 0.000054 (0.000015)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027978 / 0.037411 (-0.009434)	0.119572 / 0.014526 (0.105047)	0.133530 / 0.176557 (-0.043026)	0.177892 / 0.737135 (-0.559243)	0.127045 / 0.296338 (-0.169294)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.430198 / 0.215209 (0.214989)	4.435512 / 2.077655 (2.357858)	2.007183 / 1.504120 (0.503063)	1.799230 / 1.541195 (0.258036)	1.884750 / 1.468490 (0.416260)	0.745232 / 4.584777 (-3.839545)	4.088069 / 3.745712 (0.342357)	4.114669 / 5.269862 (-1.155193)	2.374086 / 4.565676 (-2.191590)	0.089154 / 0.424275 (-0.335121)	0.012938 / 0.007607 (0.005331)	0.505954 / 0.226044 (0.279909)	5.194226 / 2.268929 (2.925298)	2.487230 / 55.444624 (-52.957394)	2.163353 / 6.876477 (-4.713124)	2.177879 / 2.142072 (0.035807)	0.828728 / 4.805227 (-3.976499)	0.171157 / 6.500664 (-6.329507)	0.062883 / 0.075469 (-0.012586)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.275906 / 1.841788 (-0.565882)	15.235484 / 8.074308 (7.161176)	14.467396 / 10.191392 (4.276004)	0.198994 / 0.680424 (-0.481430)	0.020203 / 0.534201 (-0.513998)	0.447904 / 0.579283 (-0.131380)	0.454210 / 0.434364 (0.019846)	0.528062 / 0.540337 (-0.012275)	0.619311 / 1.386936 (-0.767625)

lhoestq added 2 commits March 15, 2023 17:41

use magic number for zip

885e9c5

test

ab77e58

alternative version of zipfile.is_zipfile

13488cc

lhoestq marked this pull request as ready for review March 15, 2023 17:08

lhoestq requested a review from albertvillanova March 15, 2023 17:13

albertvillanova approved these changes Mar 16, 2023

View reviewed changes

src/datasets/utils/extract.py Show resolved Hide resolved

Update src/datasets/utils/extract.py

e2f8e17

Co-authored-by: Albert Villanova del Moral <[email protected]>

lhoestq merged commit 11cd0f7 into main Mar 16, 2023

lhoestq deleted the less-zip-false-positives branch March 16, 2023 13:40

lhoestq mentioned this pull request May 23, 2023

ImageFolder BadZipFile: Bad offset for central directory #5451

Closed

Less zip false positives #5640

Less zip false positives #5640

Uh oh!

Conversation

lhoestq commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

HuggingFaceDocBuilderDev commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 15, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

lhoestq commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Mar 16, 2023

Choose a reason for hiding this comment

Uh oh!

albertvillanova Mar 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Mar 16, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lhoestq commented Mar 15, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 15, 2023 •

edited

Loading

lhoestq commented Mar 15, 2023 •

edited

Loading