Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@ name: CI

on:
push:
branches:
- master
pull_request:
branches:
- master

jobs:
unit-tests:
Expand All @@ -11,28 +15,28 @@ jobs:
fail-fast: false
#max-parallel: 3
matrix:
python-version: ['3.7', '3.8', '3.9', '3.10', '3.11', '3.12', '3.13']
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12', '3.13']
os: [ubuntu-latest]
EXTRA: [false] # used to force includes to get included
include:
- python-version: '3.12'
os: ubuntu-latest
EXTRA: true
NOTALL: true # without warcio[all], currently not brotli
NOTALL: true # without warcio[all]
- python-version: '3.11'
os: macos-latest
EXTRA: true
- python-version: '3.13'
os: macos-latest
EXTRA: true
- python-version: '3.7'
- python-version: '3.8'
os: windows-latest
EXTRA: true
- python-version: '3.13'
os: windows-latest
EXTRA: true
- python-version: '3.7'
os: ubuntu-20.04 # oldest version on github actions
- python-version: '3.8'
os: ubuntu-22.04 # oldest version on github actions
EXTRA: true

steps:
Expand Down
85 changes: 85 additions & 0 deletions .github/workflows/ci_s3_live.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
name: CI with live S3 tests

on:
workflow_dispatch:

# These permissions are needed to interact with AWS S3 via GitHub's OIDC Token endpoint
permissions:
id-token: write
contents: read
pull-requests: read

jobs:
unit-tests:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
#max-parallel: 3
matrix:
python-version: [
# '3.7', # not supported by GitHub actions anymore
# '3.8', # disabled for S3
'3.9', '3.10', '3.11', '3.12', '3.13']
os: [ubuntu-latest]
EXTRA: [false] # used to force includes to get included
include:
- python-version: '3.12'
os: ubuntu-latest
EXTRA: true
NOTALL: true # without warcio[all], currently not brotli
- python-version: '3.11'
os: macos-latest
EXTRA: true
- python-version: '3.13'
os: macos-latest
EXTRA: true
# disabled for S3
# - python-version: '3.8'
# os: windows-latest
# EXTRA: true
- python-version: '3.13'
os: windows-latest
EXTRA: true
- python-version: '3.8'
os: ubuntu-22.04 # oldest version on github actions
EXTRA: true

steps:
- name: checkout
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install setuptools on python 3.12+
if: ${{ matrix.python-version >= '3.12' }}
run: |
pip install setuptools

- name: Install warcio ALL
if: ${{ ! matrix.NOTALL }}
run: pip install .[all,testing]
- name: Install warcio NOTALL
if: ${{ matrix.NOTALL }}
run: pip install .[testing]

- name: Configure AWS credentials from OIDC (disabled for forks)
if: github.event.pull_request.head.repo.full_name == github.repository || github.event_name == 'push'
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::837454214164:role/GitHubActions-Role
aws-region: us-east-1

- name: Enble S3 unit tests
uses: actions/github-script@v7
with:
script: |
core.exportVariable('WARCIO_ENABLE_S3_TESTS', '1')

- name: Run tests
run: python -m pytest

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
37 changes: 37 additions & 0 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
Contributing to warcio
======================

We welcome contributions to warcio! Whether you're adding new features, improving documentation, or fixing bugs, your help is greatly appreciated.


Local installation
------------------

Clone the repository, setup a virtual environment, and run the following command to install test dependencies:

::

pip install warcio[testing]


Tests
-----

To test code changes, please run our test suite before submitting pull requests:

::

pytest test

By default, all remote requests to S3 are mocked. To change this behaviour and actually do live S3 reads and writes (if AWS credentials are set), the following environment variable can be set:

::

WARCIO_ENABLE_S3_TESTS=1

The S3 bucket used for testing can be set via environment variable (default: `commoncrawl-ci-temp`):

::

WARCIO_TEST_S3_BUCKET=my-s3-bucket

33 changes: 32 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,7 @@ The block and payload digests are computed automatically.


The library also includes additional semantics for:

- Creating ``warcinfo`` and ``revisit`` records
- Writing ``response`` and ``request`` records together
- Writing custom WARC records
Expand Down Expand Up @@ -403,8 +404,38 @@ Specifying --payload or --headers will output only the payload or only the WARC
warcio extract [--payload | --headers] filename offset


Remote File System Support
--------------------------

The library supports reading and writing WARC files to a remote file system such as HTTP or S3.
To enable this feature, you need to install the optional dependencies with ``pip install warcio[s3]``.
For example, you can then read WARC files directly from `Common Crawl's S3 bucket <https://commoncrawl.org/get-started>`_.

This command will read a WARC file from outside AWS, using https, and print the first 10 records to stdin:

::

warcio index https://data.commoncrawl.org/crawl-data/CC-MAIN-2025-51/segments/1764871645602.73/warc/CC-MAIN-20251215005813-20251215035813-00995.warc.gz | head -n 10

This command will read a WARC file from from inside AWS, using S3, and print the first 10 records to stdin:

::

warcio index s3://commoncrawl/crawl-data/CC-MAIN-2025-51/segments/1764871645602.73/warc/CC-MAIN-20251215005813-20251215035813-00995.warc.gz | head -n 10

This is implemented with `fsspec <https://filesystem-spec.readthedocs.io/en/latest/index.html>`_.
By default, only HTTP, S3, and other built-in fsspec file systems are integrated.
To support other file systems, you need to install the corresponding fsspec dependencies such as ``fsspec[gcs]`` for Google Cloud storage or ``fsspec[all]`` for all available file systems.


Contributing
------------

See `CONTRIBUTING.rst <CONTRIBUTING.rst>`__ for guidelines on contributing and running tests.


License
~~~~~~~
-------

``warcio`` is licensed under the Apache 2.0 License and is part of the
Webrecorder project.
Expand Down
14 changes: 12 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,30 @@
'requests',
'wsgiprox',
'hookdns',
# fsspec testing
'warcio[s3]', # note: drags in fsspec
'moto>=4',
'flask',
'flask_cors',
'botocore',
],
'all': [
'brotlipy',
'warcio[s3]',
],
's3': [
'smart_open[s3]',
'fsspec',
's3fs',
'aiohttp',
'requests',
]
},
classifiers=[
'Development Status :: 5 - Production/Stable',
'Environment :: Web Environment',
'License :: OSI Approved :: Apache Software License',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.7',
#'Programming Language :: Python :: 3.7', # no longer in github actions
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10',
Expand Down
115 changes: 115 additions & 0 deletions test/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
import os
import uuid
import pytest

# IMPORTANT:
# Import capture_http before any other indirect imports of urllib3/requests
# This ensures the monkey patch is applied before botocore imports urllib3
from warcio.capture_http import capture_http # noqa: F401

try:
import botocore.session # noqa: F401

HAS_AWS_DEPS = True
except ImportError:
HAS_AWS_DEPS = False


TEST_S3_BUCKET = os.environ.get("WARCIO_TEST_S3_BUCKET", "commoncrawl-ci-temp")
ENABLE_S3_TESTS = bool(os.environ.get("WARCIO_ENABLE_S3_TESTS", False))

# Cache for AWS access check to avoid repeated network calls
_aws_s3_access_cache = None


def check_aws_s3_access():
"""Check if AWS S3 access is available (cached result)."""
global _aws_s3_access_cache

if _aws_s3_access_cache is not None:
return _aws_s3_access_cache

if not HAS_AWS_DEPS:
return False

from botocore.config import Config
from botocore.exceptions import (
NoCredentialsError,
ClientError,
EndpointConnectionError,
)

try:
config = Config(retries={"max_attempts": 1, "mode": "standard"})
session = botocore.session.Session()
s3_client = session.create_client("s3", config=config)

# Try list objects on test bucket
s3_client.list_objects_v2(Bucket=TEST_S3_BUCKET, MaxKeys=1)
_aws_s3_access_cache = True
except (NoCredentialsError, ClientError, ConnectionError,
EndpointConnectionError):
_aws_s3_access_cache = False

return _aws_s3_access_cache


def requires_aws_s3(func):
"""Pytest decorator that checks if AWS S3 test can be run."""
return pytest.mark.skipif(
not ENABLE_S3_TESTS,
reason="S3 tests are NOT enabled via environment variable."
)(
pytest.mark.skipif(
not HAS_AWS_DEPS,
reason="S3 unavailable (missing dependency)",
)(
pytest.mark.skipif(
not check_aws_s3_access(),
reason="S3 unavailable (no access)",
)(func)
)
)


@pytest.fixture
def s3_tmpdir():
"""S3 equivalent of tmpdir: provides a temporary S3 path and cleans up."""
from botocore.exceptions import (
NoCredentialsError,
ClientError,
EndpointConnectionError,
)

bucket_name = TEST_S3_BUCKET

# Generate unique prefix using UUID to avoid collisions
temp_prefix = f'warcio/ci/tmpdirs/{uuid.uuid4().hex}'

# Yield the S3 path
yield f's3://{bucket_name}/{temp_prefix}'

try:
# Cleanup: delete all objects with this prefix
session = botocore.session.Session()
s3_client = session.create_client('s3')

# List all objects with the temp prefix
response = s3_client.list_objects_v2(
Bucket=bucket_name,
Prefix=temp_prefix
)

if 'Contents' in response:
# Delete all objects
objects_to_delete = [
{'Key': obj['Key']} for obj in response['Contents']
]
s3_client.delete_objects(
Bucket=bucket_name,
Delete={'Objects': objects_to_delete}
)
except (NoCredentialsError, ClientError, ConnectionError,
EndpointConnectionError):
# Ignore cleanup errors - test objects will eventually expire
pass
Loading