- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3k
 
Less zip false positives #5640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less zip false positives #5640
Conversation
          Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
  | 
    
| 
           The documentation is not available anymore as the PR was closed or merged.  | 
    
          Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
  | 
    
| 
           CI is failing due to unrelated issues, hopefully #5642 fixes it  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement.
I agree we should tweak the zipfile.is_zipfile default implementation if that is flaky.
| if centdir[_CD_SIGNATURE] == stringCentralDir: | ||
| return True # First central directory entry has correct magic number | ||
| except Exception: # catch all errors in case future python versions change the zipfile internals | ||
| return False | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that this function as it is could return None. To fix this:
| return False | |
| return False | |
| return False | 
| with not_a_zip_file.open("wb") as f: | ||
| f.write(data) | ||
| assert zipfile.is_zipfile(str(not_a_zip_file)) # is a false positive for `zipfile` | ||
| assert not ZipExtractor.is_extractable(not_a_zip_file) # but we're right | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test passes because not None is True.
Co-authored-by: Albert Villanova del Moral <[email protected]>
          Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
  | 
    
          Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
 Show updated benchmarks!Benchmark: benchmark_array_xd.json
 Benchmark: benchmark_getitem_100B.json
 Benchmark: benchmark_indices_mapping.json
 Benchmark: benchmark_iterating.json
 Benchmark: benchmark_map_filter.json
  | 
    




zipfile.is_zipfilereturn false positives for some Parquet files. It causes errors when loading certain parquet datasets, where some files are considered ZIP files byzipfile.is_zipfileThis is a known issue: python/cpython#72680
At first I wanted to rely only on magic numbers, but then I found that someone contributed a fix to is_zipfile - do you think we should use it @albertvillanova or not ?
IMO it's ok to rely on magic numbers only for now, since in streaming mode we've had no issue checking only the magic number so far.
Close #5639