feature: fetch links from paginated hubs add add it to references #9

heroheman · 2025-12-24T11:29:48Z

Note: it seems that i have some code from #6 in this. Maybe have a look at that first.

Hub Pagination Support

Summary

Adds support for extracting links from paginated hub pages on the SCP Wiki. Many hub pages have pagination (e.g., /chaos-insurgency-hub/p/2) that loads additional links to scp documents. This PR implements a reliable, efficient solution to capture all links from these paginated pages.

Problem

Hub pages with pagination contain additional links to SCP items, tales, and other content that are only accessible via paginated URLs like /hub-name/p/1, /hub-name/p/2, etc. These links were previously not being captured during crawling.

Solution

Instead of crawling full paginated pages with Scrapy (which caused rate limiting and low success rates), this implementation:

Extracts pagination URLs from hub raw_content using regex pattern matching
Fetches links directly from paginated pages via simple HTTP requests (not Scrapy)
Merges links into hub references arrays during postprocessing

This approach is:

✅ Reliable: Simple HTTP requests avoid Scrapy rate limiting issues
✅ Efficient: Only fetches what's needed (links, not full page content)
✅ Complete: Successfully processes all paginated URLs
✅ Clean: No duplicate content, all links merged into existing structure

Testing:

See the test action in my forked repo:
https://github.com/heroheman/scp_crawler/actions/runs/20395230587/job/58609786632?pr=1

Changes

New Files

fetch_paginated_links.py: Script to extract and fetch links from paginated hub pages
- Scans hub content for pagination URLs
- Fetches each URL via urllib.request
- Extracts links from #page-content div
- Outputs to data/paginated_links.json
PAGINATION.md: Comprehensive documentation of the pagination implementation

Modified Files

scp_crawler/postprocessing.py:
- Loads paginated_links.json during hub processing
- Merges paginated links into hub references arrays
- Removes duplicates
- Removed old backwards-compatibility code
makefile:
- data/scp_hubs.json target now runs fetch_paginated_links.py after crawling
- scp_postprocess now includes data/processed/hubs
- Simplified and cleaned up targets
.github/workflows/scp-items.yml:
- Added "Process Hubs" step after "Crawl Hubs"
README.md:
- Added "Hub Pagination" section explaining the feature
- Documented generated files and output structure

Data Structure

data/paginated_links.json:

{
  "chaos-insurgency-hub": {
    "1": {
      "url": "https://scp-wiki.wikidot.com/chaos-insurgency-hub/p/1",
      "links": ["scp-xxxx", "tale-yyyy", ...],
      "link_count": 50
    }
  }
}

Result in data/processed/hubs/index.json:

{
  "chaos-insurgency-hub": {
    "references": [
      "scp-001",
      "scp-002",
      // ... original links from main page
      "scp-9999",
      "scp-10000"
      // ... links from paginated pages (merged)
    ]
  }
}

Usage

# Crawl hubs and fetch paginated links
make data/scp_hubs.json

# Process and merge pagination data
make data/processed/hubs

Breaking Changes

None. The references array is simply enriched with additional links from paginated pages. Existing functionality remains unchanged.

Related Documentation

PAGINATION.md - Detailed implementation guide

- Introduce ScpSupplement class for item representation - Implement ScpSupplementSpider to crawl supplement pages - Update makefile to include supplement in data targets

- Implement run_postproc_supplement to process SCP supplement data - Create necessary directories and handle data extraction - Store processed supplements in JSON format for further use

- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements

- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.

- for reference

- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5

- this fixes unintend removal of Linkextractor Rule

- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes

- Allows manual triggering of the workflow - Improves flexibility for testing and updates

- Handle empty history cases by returning an empty list - Support both dict and list formats for history input - Safely parse date strings with error handling - Sort revisions by date, ensuring robustness against missing values

- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present

- Implement a new pipeline to merge paginated content into hub items - Create a script to fetch links from paginated hub pages - Update Makefile to include pagination processing - Enhance ScpHub item to store paginated content - Add documentation for hub pagination support - Introduce tests for pagination functionality

- Extract pagination URLs from hub content - Fetch links from all paginated pages via HTTP - Merge links into hub's references array during postprocessing - Generate intermediate and final files for paginated data

tedivm · 2025-12-29T22:18:56Z

Can you resolve the conflicts?

tedivm

I don't mind AI usage but please clean it up a bit- remove the extra generated markdown files, the random files outside of the actual project, tests in the root, etc.

heroheman added 14 commits December 15, 2025 12:11

chore(.gitignore): add .DS_Store to ignore list

774a7e8

feat(spiders): add SCP Supplement spider and item

aeb4640

- Introduce ScpSupplement class for item representation - Implement ScpSupplementSpider to crawl supplement pages - Update makefile to include supplement in data targets

feat(postprocessing): add supplement processing command

b188362

- Implement run_postproc_supplement to process SCP supplement data - Create necessary directories and handle data extraction - Store processed supplements in JSON format for further use

docs(README): update supplement crawl instructions and content structure

146435d

- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements

fix(spiders): refine tale parsing rules to exclude unwanted links

84d709f

- Updated the LinkExtractor in ScpTaleSpider to deny links matching specific patterns, improving the relevance of parsed tales.

feat(ci): add GitHub Actions workflow for SCP crawling

3a21940

- for reference

chore(workflow): adjusted workflow after review

166a50b

- This change prevents automatic pushing to the SCP API during CI - it should run on pull requests to allow for manual review first - remove clone step as it's unnecessary - resolves scp-data#5

fix(spiders): update tale parsing rules to allow all links

49aec69

- this fixes unintend removal of Linkextractor Rule

fix(spiders): enhance error handling in history lookup

d68338f

- Add checks for empty responses and missing 'body' in JSON - Log errors for various failure scenarios to improve debugging - Ensure robust parsing of history HTML to prevent crashes

chore(workflow): enable workflow_dispatch for SCP crawling

cca296c

- Allows manual triggering of the workflow - Improves flexibility for testing and updates

fix(history): ensure history key always exists in items

45d680b

- Use `get` method to safely access history in hubs and items - Prevent potential KeyError by ensuring history key is present

feat(readme): update with support for paginated hub pages

51f83e5

- Extract pagination URLs from hub content - Fetch links from all paginated pages via HTTP - Merge links into hub's references array during postprocessing - Generate intermediate and final files for paginated data

Merge branch 'main' into feature/fetch-hub-pages

7a2094b

tedivm requested changes Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: fetch links from paginated hubs add add it to references #9

feature: fetch links from paginated hubs add add it to references #9

Uh oh!

heroheman commented Dec 24, 2025 •

edited

Loading

Uh oh!

tedivm commented Dec 29, 2025

Uh oh!

tedivm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feature: fetch links from paginated hubs add add it to references #9

Are you sure you want to change the base?

feature: fetch links from paginated hubs add add it to references #9

Uh oh!

Conversation

heroheman commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hub Pagination Support

Summary

Problem

Solution

Testing:

Changes

New Files

Modified Files

Data Structure

Usage

Breaking Changes

Related Documentation

Uh oh!

tedivm commented Dec 29, 2025

Uh oh!

tedivm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heroheman commented Dec 24, 2025 •

edited

Loading