Skip to content

Conversation

@malteos
Copy link
Collaborator

@malteos malteos commented Nov 19, 2025

This PR integrates a couple of general changes from the EOT PR (#54):

  • Settings variables are loaded from environment variables in settings.py
  • Common CLI methods are moved to utils.py
  • Read and write to S3 (via fsspec).
  • Missing writer.close() statement is added to CLI and example.

@malteos malteos changed the title feat: Adding settings, utils, and writer close feat: Adding settings, utils, write to S3, and writer close Nov 19, 2025
@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 96.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.05%. Comparing base (b6f5f02) to head (b9f01d4).

Files with missing lines Patch % Lines
cdx_toolkit/cli.py 75.00% 1 Missing ⚠️
cdx_toolkit/commoncrawl.py 80.00% 1 Missing ⚠️
cdx_toolkit/utils.py 97.22% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #67      +/-   ##
==========================================
+ Coverage   95.78%   97.05%   +1.27%     
==========================================
  Files           7        9       +2     
  Lines         877      918      +41     
==========================================
+ Hits          840      891      +51     
+ Misses         37       27      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@malteos malteos requested a review from wumpus December 8, 2025 17:52
@malteos
Copy link
Collaborator Author

malteos commented Dec 8, 2025

Given that the test coverage check fails on this one, what about adjusting the coverage check to a fixed value instead of the last coverage value?

@wumpus
Copy link
Member

wumpus commented Dec 11, 2025

While the coverage check failed, I don't think it will cause a problem in the next CI run? It's just a slightly easier hurdle next time.

@malteos malteos closed this Dec 29, 2025
@malteos malteos reopened this Dec 29, 2025
@malteos
Copy link
Collaborator Author

malteos commented Dec 30, 2025

While the coverage check failed, I don't think it will cause a problem in the next CI run? It's just a slightly easier hurdle next time.

I added additional unit tests which should increase the coverage above the threshold.

However, codecov seemed to stopped working after moving the repo to the CC org. I already requested access to the org. I guess you just need to approved codecov again @wumpus.

@malteos malteos requested a review from damian0815 January 7, 2026 08:36
@malteos
Copy link
Collaborator Author

malteos commented Jan 7, 2026

The Codecov checks pass now.

Copy link

@damian0815 damian0815 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good to me

LOGGER.warning('revisit record being resolved for url %s %s', url, timestamp)
writer.write_record(record)

writer.close()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knee-jerk reaction: can this be a with ...: context manager sorta deal?


# remember: keep requires synchronized with requirements.txt
requires = ['requests', 'warcio']
requires = ['requests', 'warcio', 'fsspec[s3]', 'boto3']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boto3 appears to be missing from requirements.txt

LOGGER.warning('surprised that status code is now=%d orig=%s %s %s',
status_code, capture['status'], url, timestamp)
LOGGER.warning(
'surprised that status code is now=%d orig=%s %s %s', status_code, capture['status'], url, timestamp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'surprised that' is unusual phrasing. I guess you've just inherited older code here but it might be worth improving these messages. who is surprised? should I, the reader of the logs, be surprised? is it something I should be concerned about? is it my responsibility to fix?

return cdx_toolkit.__version__


def setup(cmd):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give this function a more descriptive name? what's it setting up exactly? is there a suitable type hint for 'cmd' (is it from argparse?)

Copy link

@damian0815 damian0815 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'request changes' I think they only blocker issue is the boto3 require

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants