Skip to content

Conversation

@heroheman
Copy link
Contributor

@heroheman heroheman commented Dec 15, 2025

This PR adds first-class support for “supplement” pages from the SCP Wiki and integrates them into the existing crawl + post-processing workflow.

  • Adds a new Scrapy spider scp_supplement to crawl pages tagged supplement and export them to scp_supplement.json.
  • Introduces the ScpSupplement item type.
  • Extends post-processing with run_postproc_supplement to generate processed outputs under supplement:
  • content_supplement.json (full content + history/source/images)
  • index.json (metadata + content_file)
  • Adds parent_scp (best-effort extracted from the link) and parent_tale (best-effort extracted from *- patterns).
  • Updates the Makefile so Supplements are included in scp_crawl and scp_postprocess (and still available via dedicated supplement_* targets).
  • Updates README documentation to mention the new spider, output file, and processed output.

How to test

  • make scp (includes supplements crawl + postprocess)

Or individually:

  • make supplement_crawl
  • make supplement_postprocess
  • scrapy crawl scp_supplement -o data/scp_supplement.json

This PR should fix #2

- Introduce ScpSupplement class for item representation
- Implement ScpSupplementSpider to crawl supplement pages
- Update makefile to include supplement in data targets
- Implement run_postproc_supplement to process SCP supplement data
- Create necessary directories and handle data extraction
- Store processed supplements in JSON format for further use
- Added instructions for crawling pages tagged as 'supplement'
- Updated content structure to include multiple content types
- Clarified post-processing details for supplements
- Updated the LinkExtractor in ScpTaleSpider to deny links
  matching specific patterns, improving the relevance of parsed tales.
Copy link
Member

@tedivm tedivm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really good, but there are some changes to be made to the github workflows.

- This change prevents automatic pushing to the SCP API during CI
- it should run on pull requests to allow for manual review first
- remove clone step as it's unnecessary
- resolves scp-data#5
- this fixes unintend removal of Linkextractor Rule
@heroheman
Copy link
Contributor Author

I adjusted the workflow file, but keep in mind I really added it by accident. So if the edits are not working out, I would prefer removing the yml file - I really do not know much about github actions :)


Also

For testing purpose I forked the api repo, too.

See:
https://github.com/heroheman/scp-api/
https://github.com/heroheman/scp-api/actions/runs/20237800863

Result Data:
https://github.com/heroheman/scp-api/tree/main/docs/data/scp/supplement

Note: I actually remove a linkrule for "TALES", so in the testrun "Tales" is empty. I reverted it with the last commit. I will start another workflow run to make sure this works out.

@heroheman heroheman requested a review from tedivm December 15, 2025 16:55
- Add checks for empty responses and missing 'body' in JSON
- Log errors for various failure scenarios to improve debugging
- Ensure robust parsing of history HTML to prevent crashes
- Allows manual triggering of the workflow
- Improves flexibility for testing and updates
- Handle empty history cases by returning an empty list
- Support both dict and list formats for history input
- Safely parse date strings with error handling
- Sort revisions by date, ensuring robustness against missing values
- Use `get` method to safely access history in hubs and items
- Prevent potential KeyError by ensuring history key is present
heroheman added a commit to heroheman/scp_crawler that referenced this pull request Dec 17, 2025
- This change prevents automatic pushing to the SCP API during CI
- it should run on pull requests to allow for manual review first
- remove clone step as it's unnecessary
- resolves scp-data#5
@heroheman heroheman closed this Dec 17, 2025
@heroheman
Copy link
Contributor Author

I have splitted all the changes in feature branches and created #6 #7 #8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: crawling of supplements

2 participants