-
Notifications
You must be signed in to change notification settings - Fork 0
feat: S3 + HTTP integration including unit tests #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I did not read the entire checkin, but here are some thoughts about things that need to be in it:
|
|
The mocking required a new dependency which made all the CI runs fail -- but now it's working :)
I added a new section "Remote File System Support" to the README.rst and created a new CONTRIBUTING.rst that explains how to run the S3 live tests.
Since it is not only s3 but also http, it is warcio[remote] and not warcio[s3]
Added in the "Remote File System Support" section
Done.
It's not a separate script but an environment variable to enable S3 live tests (explained in CONTRIBUTING) and a separate CI workflow that can be manually run. We could also run this only on pushes to main.
There are already couple of tests for bad WARC files: https://github.com/commoncrawl/warcio-s3/blob/master/test/test_archiveiterator.py#L257-L271 |
|
I think we should use the fsspec name "s3" -- even though it's not the best choice. Here's the full fsspec list:
For CI, remember that this is going to live in the webrecorder github, not ours. I think what you have now is fine. Let me make a couple of cosmetic changes and then I'll open a PR over at webrecorder. |
|
Quirk: NOTALL in the CI installs [testing] which installs warcio[s3], whoops. |
|
@malteos I see that fsspec now successfully installs with python 3.8. Is that correct? |
|
Looks like every job ran twice? once for push and once for pull_request. |
|
Grrrrr: Tip version of s3fs is 2026.1.0 -- 0.4.2 was March 30, 2020. Presumably this is related to urllib3 and httpbin onstraints. |
This PR enables reading from S3 + HTTP and writing S3 to using fsspec (instead of smart-open due to fsspec being used by major projects like pandas).
The unit tests check the S3 integration using the CC temp bucket.