Skip to content

Conversation

@malteos
Copy link

@malteos malteos commented Nov 19, 2025

This PR enables reading from S3 + HTTP and writing S3 to using fsspec (instead of smart-open due to fsspec being used by major projects like pandas).

The unit tests check the S3 integration using the CC temp bucket.

@malteos malteos changed the title feat: S3 integration including unit tests feat: S3 + HTTP integration including unit tests Nov 21, 2025
@malteos malteos marked this pull request as ready for review November 21, 2025 12:18
@jenenglish jenenglish requested a review from wumpus November 21, 2025 16:11
@jenenglish
Copy link

@wumpus
Copy link
Member

wumpus commented Jan 1, 2026

I did not read the entire checkin, but here are some thoughts about things that need to be in it:

  • doc this new capability
  • add [s3], add s3 to [all]
  • doc should mention installing fsspec[all] if you want more than https/s3
  • main ci script should mock writing to s3
  • live s3 test should be a separate ci script
  • add a test to be sure that an incorrectly gzipped warc throws an error when iterated (might already be tested?)

@malteos
Copy link
Author

malteos commented Jan 8, 2026

The mocking required a new dependency which made all the CI runs fail -- but now it's working :)

doc this new capability

I added a new section "Remote File System Support" to the README.rst and created a new CONTRIBUTING.rst that explains how to run the S3 live tests.

add [s3], add s3 to [all]

Since it is not only s3 but also http, it is warcio[remote] and not warcio[s3]

doc should mention installing fsspec[all] if you want more than https/s3

Added in the "Remote File System Support" section

main ci script should mock writing to s3

Done.

live s3 test should be a separate ci script

It's not a separate script but an environment variable to enable S3 live tests (explained in CONTRIBUTING) and a separate CI workflow that can be manually run. We could also run this only on pushes to main.

add a test to be sure that an incorrectly gzipped warc throws an error when iterated (might already be tested

There are already couple of tests for bad WARC files: https://github.com/commoncrawl/warcio-s3/blob/master/test/test_archiveiterator.py#L257-L271

@wumpus
Copy link
Member

wumpus commented Jan 9, 2026

I think we should use the fsspec name "s3" -- even though it's not the best choice. Here's the full fsspec list:

'abfs', 'adl', 'arrow_hdfs', 'async_wrapper', 'asynclocal', 'asyncwrapper', 'az', 'blockcache', 'box', 'cached', 'dask', 'data', 'dbfs', 'dir', 'dropbox', 'dvc', 'file', 'filecache', 'ftp', 'gcs', 'gdrive', 'generic', 'gist', 'git', 'github', 'gs', 'hdfs', 'hf', 'http', 'https', 'jlab', 'jupyter', 'lakefs', 'libarchive', 'local', 'memory', 'oci', 'ocilake', 'oss', 'pyscript', 'reference', 'root', 's3', 's3a', 'sftp', 'simplecache', 'smb', 'ssh', 'tar', 'tos', 'tosfs', 'wandb', 'webdav', 'webhdfs', 'zip']

For CI, remember that this is going to live in the webrecorder github, not ours. I think what you have now is fine.

Let me make a couple of cosmetic changes and then I'll open a PR over at webrecorder.

@wumpus
Copy link
Member

wumpus commented Jan 9, 2026

Quirk: NOTALL in the CI installs [testing] which installs warcio[s3], whoops.

@wumpus
Copy link
Member

wumpus commented Jan 10, 2026

@malteos I see that fsspec now successfully installs with python 3.8. Is that correct?

@wumpus
Copy link
Member

wumpus commented Jan 10, 2026

Looks like every job ran twice? once for push and once for pull_request.

@wumpus
Copy link
Member

wumpus commented Jan 10, 2026

Grrrrr:

warcio index s3://commoncrawl/crawl-data/CC-MAIN-2025-51/segments/1764871645602.73/warc/CC-MAIN-20251215005813-20251215035813-00995.warc.gz | head -n 10
/home/ccgreg/venv/warcio-s3/lib/python3.10/site-packages/fsspec/registry.py:301: UserWarning: Your installed version of s3fs is very old and known to cause
severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.

  warnings.warn(s3_msg)

>>> s3fs.__version__
'0.4.2'

Tip version of s3fs is 2026.1.0 -- 0.4.2 was March 30, 2020. Presumably this is related to urllib3 and httpbin onstraints.

@wumpus wumpus merged commit c9513ea into master Jan 11, 2026
12 checks passed
@wumpus wumpus deleted the feat/s3-integration branch January 11, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants