Skip to content

Conversation

@heroheman
Copy link
Contributor

@heroheman heroheman commented Dec 24, 2025

Note: it seems that i have some code from #6 in this. Maybe have a look at that first.

Hub Pagination Support

Summary

Adds support for extracting links from paginated hub pages on the SCP Wiki. Many hub pages have pagination (e.g., /chaos-insurgency-hub/p/2) that loads additional links to scp documents. This PR implements a reliable, efficient solution to capture all links from these paginated pages.

Problem

Hub pages with pagination contain additional links to SCP items, tales, and other content that are only accessible via paginated URLs like /hub-name/p/1, /hub-name/p/2, etc. These links were previously not being captured during crawling.

Solution

Instead of crawling full paginated pages with Scrapy (which caused rate limiting and low success rates), this implementation:

  1. Extracts pagination URLs from hub raw_content using regex pattern matching
  2. Fetches links directly from paginated pages via simple HTTP requests (not Scrapy)
  3. Merges links into hub references arrays during postprocessing

This approach is:

  • Reliable: Simple HTTP requests avoid Scrapy rate limiting issues
  • Efficient: Only fetches what's needed (links, not full page content)
  • Complete: Successfully processes all paginated URLs
  • Clean: No duplicate content, all links merged into existing structure

Testing:

See the test action in my forked repo:
https://github.com/heroheman/scp_crawler/actions/runs/20395230587/job/58609786632?pr=1

Changes

New Files

  • fetch_paginated_links.py: Script to extract and fetch links from paginated hub pages

    • Scans hub content for pagination URLs
    • Fetches each URL via urllib.request
    • Extracts links from #page-content div
    • Outputs to data/paginated_links.json
  • PAGINATION.md: Comprehensive documentation of the pagination implementation

Modified Files

  • scp_crawler/postprocessing.py:

    • Loads paginated_links.json during hub processing
    • Merges paginated links into hub references arrays
    • Removes duplicates
    • Removed old backwards-compatibility code
  • makefile:

    • data/scp_hubs.json target now runs fetch_paginated_links.py after crawling
    • scp_postprocess now includes data/processed/hubs
    • Simplified and cleaned up targets
  • .github/workflows/scp-items.yml:

    • Added "Process Hubs" step after "Crawl Hubs"
  • README.md:

    • Added "Hub Pagination" section explaining the feature
    • Documented generated files and output structure

Data Structure

data/paginated_links.json:

{
  "chaos-insurgency-hub": {
    "1": {
      "url": "https://scp-wiki.wikidot.com/chaos-insurgency-hub/p/1",
      "links": ["scp-xxxx", "tale-yyyy", ...],
      "link_count": 50
    }
  }
}

Result in data/processed/hubs/index.json:

{
  "chaos-insurgency-hub": {
    "references": [
      "scp-001",
      "scp-002",
      // ... original links from main page
      "scp-9999",
      "scp-10000"
      // ... links from paginated pages (merged)
    ]
  }
}

Usage

# Crawl hubs and fetch paginated links
make data/scp_hubs.json

# Process and merge pagination data
make data/processed/hubs

Breaking Changes

None. The references array is simply enriched with additional links from paginated pages. Existing functionality remains unchanged.

Related Documentation

  • PAGINATION.md - Detailed implementation guide

- Introduce ScpSupplement class for item representation
- Implement ScpSupplementSpider to crawl supplement pages
- Update makefile to include supplement in data targets
- Implement run_postproc_supplement to process SCP supplement data
- Create necessary directories and handle data extraction
- Store processed supplements in JSON format for further use
- Added instructions for crawling pages tagged as 'supplement'
- Updated content structure to include multiple content types
- Clarified post-processing details for supplements
- Updated the LinkExtractor in ScpTaleSpider to deny links
  matching specific patterns, improving the relevance of parsed tales.
- This change prevents automatic pushing to the SCP API during CI
- it should run on pull requests to allow for manual review first
- remove clone step as it's unnecessary
- resolves scp-data#5
- this fixes unintend removal of Linkextractor Rule
- Add checks for empty responses and missing 'body' in JSON
- Log errors for various failure scenarios to improve debugging
- Ensure robust parsing of history HTML to prevent crashes
- Allows manual triggering of the workflow
- Improves flexibility for testing and updates
- Handle empty history cases by returning an empty list
- Support both dict and list formats for history input
- Safely parse date strings with error handling
- Sort revisions by date, ensuring robustness against missing values
- Use `get` method to safely access history in hubs and items
- Prevent potential KeyError by ensuring history key is present
- Implement a new pipeline to merge paginated content into hub items
- Create a script to fetch links from paginated hub pages
- Update Makefile to include pagination processing
- Enhance ScpHub item to store paginated content
- Add documentation for hub pagination support
- Introduce tests for pagination functionality
- Extract pagination URLs from hub content
- Fetch links from all paginated pages via HTTP
- Merge links into hub's references array during postprocessing
- Generate intermediate and final files for paginated data
@tedivm
Copy link
Member

tedivm commented Dec 29, 2025

Can you resolve the conflicts?

Copy link
Member

@tedivm tedivm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind AI usage but please clean it up a bit- remove the extra generated markdown files, the random files outside of the actual project, tests in the root, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants