-
Notifications
You must be signed in to change notification settings - Fork 3
Add Supplements spider + integrated crawling/postprocessing #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Introduce ScpSupplement class for item representation - Implement ScpSupplementSpider to crawl supplement pages - Update makefile to include supplement in data targets
- Implement run_postproc_supplement to process SCP supplement data - Create necessary directories and handle data extraction - Store processed supplements in JSON format for further use
- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements
- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.
- for reference
tedivm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks really good, but there are some changes to be made to the github workflows.
- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5
- this fixes unintend removal of Linkextractor Rule
|
I adjusted the workflow file, but keep in mind I really added it by accident. So if the edits are not working out, I would prefer removing the yml file - I really do not know much about github actions :) Also For testing purpose I forked the api repo, too. See: Result Data: Note: I actually remove a linkrule for "TALES", so in the testrun "Tales" is empty. I reverted it with the last commit. I will start another workflow run to make sure this works out. |
- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes
- Allows manual triggering of the workflow - Improves flexibility for testing and updates
- Handle empty history cases by returning an empty list - Support both dict and list formats for history input - Safely parse date strings with error handling - Sort revisions by date, ensuring robustness against missing values
- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present
- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5
This PR adds first-class support for “supplement” pages from the SCP Wiki and integrates them into the existing crawl + post-processing workflow.
scp_supplementto crawl pages tagged supplement and export them toscp_supplement.json.run_postproc_supplementto generate processed outputs under supplement:content_supplement.json(full content + history/source/images)index.json(metadata + content_file)parent_scp(best-effort extracted from the link) and parent_tale (best-effort extracted from *- patterns).scp_crawlandscp_postprocess(and still available via dedicatedsupplement_*targets).How to test
make scp(includes supplements crawl + postprocess)Or individually:
make supplement_crawlmake supplement_postprocessscrapy crawl scp_supplement -o data/scp_supplement.jsonThis PR should fix #2