-
Notifications
You must be signed in to change notification settings - Fork 3
feature: fetch links from paginated hubs add add it to references #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
heroheman
wants to merge
15
commits into
scp-data:main
Choose a base branch
from
heroheman:feature/fetch-hub-pages
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Introduce ScpSupplement class for item representation - Implement ScpSupplementSpider to crawl supplement pages - Update makefile to include supplement in data targets
- Implement run_postproc_supplement to process SCP supplement data - Create necessary directories and handle data extraction - Store processed supplements in JSON format for further use
- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements
- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.
- for reference
- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5
- this fixes unintend removal of Linkextractor Rule
- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes
- Allows manual triggering of the workflow - Improves flexibility for testing and updates
- Handle empty history cases by returning an empty list - Support both dict and list formats for history input - Safely parse date strings with error handling - Sort revisions by date, ensuring robustness against missing values
- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present
- Implement a new pipeline to merge paginated content into hub items - Create a script to fetch links from paginated hub pages - Update Makefile to include pagination processing - Enhance ScpHub item to store paginated content - Add documentation for hub pagination support - Introduce tests for pagination functionality
- Extract pagination URLs from hub content - Fetch links from all paginated pages via HTTP - Merge links into hub's references array during postprocessing - Generate intermediate and final files for paginated data
Member
|
Can you resolve the conflicts? |
tedivm
requested changes
Dec 30, 2025
Member
tedivm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind AI usage but please clean it up a bit- remove the extra generated markdown files, the random files outside of the actual project, tests in the root, etc.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note: it seems that i have some code from #6 in this. Maybe have a look at that first.
Hub Pagination Support
Summary
Adds support for extracting links from paginated hub pages on the SCP Wiki. Many hub pages have pagination (e.g.,
/chaos-insurgency-hub/p/2) that loads additional links to scp documents. This PR implements a reliable, efficient solution to capture all links from these paginated pages.Problem
Hub pages with pagination contain additional links to SCP items, tales, and other content that are only accessible via paginated URLs like
/hub-name/p/1,/hub-name/p/2, etc. These links were previously not being captured during crawling.Solution
Instead of crawling full paginated pages with Scrapy (which caused rate limiting and low success rates), this implementation:
raw_contentusing regex pattern matchingreferencesarrays during postprocessingThis approach is:
Testing:
See the test action in my forked repo:
https://github.com/heroheman/scp_crawler/actions/runs/20395230587/job/58609786632?pr=1
Changes
New Files
fetch_paginated_links.py: Script to extract and fetch links from paginated hub pagesurllib.request#page-contentdivdata/paginated_links.jsonPAGINATION.md: Comprehensive documentation of the pagination implementationModified Files
scp_crawler/postprocessing.py:paginated_links.jsonduring hub processingreferencesarraysmakefile:data/scp_hubs.jsontarget now runsfetch_paginated_links.pyafter crawlingscp_postprocessnow includesdata/processed/hubs.github/workflows/scp-items.yml:README.md:Data Structure
data/paginated_links.json:{ "chaos-insurgency-hub": { "1": { "url": "https://scp-wiki.wikidot.com/chaos-insurgency-hub/p/1", "links": ["scp-xxxx", "tale-yyyy", ...], "link_count": 50 } } }Result in
data/processed/hubs/index.json:{ "chaos-insurgency-hub": { "references": [ "scp-001", "scp-002", // ... original links from main page "scp-9999", "scp-10000" // ... links from paginated pages (merged) ] } }Usage
Breaking Changes
None. The
referencesarray is simply enriched with additional links from paginated pages. Existing functionality remains unchanged.Related Documentation