Skip to content

zhuermu/spider-for-wechat

Repository files navigation

WeChat Article Scraper

A mitmproxy-based tool for extracting WeChat public account article metadata from the desktop client. Supports both manual scraping and scheduled automation.

Features

  • 🔍 Real-time Extraction - Intercepts WeChat traffic to extract article metadata
  • Scheduled Tasks - Daily automated scraping of multiple accounts
  • 🕒 Time Filtering - Filter articles by publication date (recent N days)
  • ☁️ S3 Integration - Automatic upload to AWS S3 with aggregation
  • 📊 Smart Parsing - Handles both single and multi-article messages
  • 🪟 Windows Native - UI automation for WeChat desktop client

Quick Start

Prerequisites

  • Python 3.7+
  • WeChat Desktop (Windows)
  • Administrator privileges (for transparent proxy)

Installation

# Clone and install dependencies
git clone <repository-url>
cd spider-wechat
pip install -r requirements.txt

# Copy configuration template
copy config.yaml.template config.yaml

Usage

Method 1: Manual Scraping (Real-time)

For ad-hoc scraping of specific WeChat accounts:

# Start the parser
start_wechat_parser.bat

# In WeChat:
# 1. Find the public account's article history link
# 2. Send to "File Transfer Assistant" 
# 3. Click the link to open

# Check results
dir output\articles_*.json

Method 2: Scheduled Tasks (Automated)

For daily automated collection from multiple accounts:

# 1. Configure AWS credentials (choose one method)

# Option A: config.yaml (recommended)
s3:
  aws_access_key_id: "your_key"
  aws_secret_access_key: "your_secret"

# Option B: Environment variables
$env:AWS_ACCESS_KEY_ID = "your_key"
$env:AWS_SECRET_ACCESS_KEY = "your_secret"

# Option C: AWS CLI
aws configure

# 2. Edit config.yaml with your URLs
input_urls:
  - "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=..."

# 3. Test run
python scheduled_scraper.py --run-once

# 4. Start scheduler
start_scheduled_scraper.bat

Configuration

Key settings in config.yaml:

# WeChat account URLs
input_urls:
  - "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=..."

# Output settings  
output:
  filter_days: 1        # Keep articles from last N days
  
# S3 settings (for scheduled tasks)
s3:
  bucket: "your-bucket-name"
  prefix: "articles/"
  
# Scheduler settings
scheduler:
  enabled: true
  time: "00:00"         # Execution time (HH:MM format)

Output Format

{
  "source_url": "https://mp.weixin.qq.com/mp/profile_ext?...",
  "extracted_at": "2025-11-03T20:30:00",
  "article_count": 10,
  "articles": [
    {
      "title": "Article Title",
      "url": "https://mp.weixin.qq.com/s?__biz=...",
      "publish_time": "2025-11-03 10:30:00",
      "digest": "Article summary",
      "cover": "Cover image URL",
      "author": "Author name"
    }
  ]
}

Commands

Manual Scraping

# Start parser (recommended)
start_wechat_parser.bat

# Alternative: Python script
python run_wechat_parser.py

# Direct mitmproxy
mitmdump --mode local:WeChatAppEx.exe -s src/addons/wechat_article_parser.py

Scheduled Tasks

# Test run (execute once)
python scheduled_scraper.py --run-once

# Custom execution time
python scheduled_scraper.py --time "08:00"

# View logs
type logs\scraper.log

Project Structure

spider-wechat/
├── src/
│   ├── addons/                          # mitmproxy addons
│   │   └── wechat_article_parser.py     # Article parser addon
│   ├── services/                        # Business logic
│   │   ├── scheduler.py                 # Task scheduler
│   │   ├── article_aggregator.py        # Article aggregation & S3 upload
│   │   └── wechat_controller.py         # WeChat client automation
│   ├── models/                          # Data models
│   │   ├── article.py                   # Article dataclass
│   │   ├── config.py                    # Configuration model
│   │   └── execution_result.py          # Execution result model
│   └── utils/                           # Utilities
│       ├── logger.py                    # Logging setup
│       ├── config_loader.py             # Configuration loader
│       ├── url_validator.py             # URL validation
│       └── url_manager.py               # URL management
├── output/                              # Output directory
│   ├── articles_*.json                  # Raw scraped data
│   └── aggregated_articles_*.json       # Aggregated data (scheduled)
├── config.yaml                          # Configuration file
├── requirements.txt                     # Python dependencies
├── start_wechat_parser.bat             # Manual scraping launcher
├── start_scheduled_scraper.bat         # Scheduled task launcher
└── README.md                           # This file

Troubleshooting

Q: No data intercepted?
A: Ensure mitmproxy is running and open WeChat links in the built-in browser

Q: How to change scheduled execution time?
A: Edit scheduler.time in config.yaml or use --time parameter

Q: How to modify article filter days?
A: Edit output.filter_days in config.yaml

Q: S3 upload failed?
A: Check AWS credentials configuration. Data is saved locally even if upload fails

Q: WeChat client not running?
A: Scheduled tasks will retry 5 times with 10-second intervals

How It Works

Manual Scraping Flow

WeChat Client → mitmproxy Intercept → Parse Article Data → Save JSON

Scheduled Task Flow

Timer Trigger → Check WeChat → Open Account → Scrape Articles 
→ Time Filter → Aggregate Data → Save Local + Upload S3

Technology Stack

  • mitmproxy (10.1.0+) - HTTPS traffic interception and addon system
  • pywinauto (0.6.8+) - Windows UI automation
  • pyautogui (0.9.54+) - Mouse/keyboard automation
  • psutil (5.9.0+) - Process detection
  • loguru (0.7.0+) - Structured logging
  • schedule (1.2.0+) - Cron-like job scheduling
  • boto3 (1.28.0+) - AWS S3 client
  • PyYAML (6.0+) - Configuration management

Requirements

  1. WeChat Desktop Client - Must be running during scheduled tasks
  2. mitmproxy Running - Uses transparent proxy mode for traffic interception
  3. Administrator Privileges - Required for transparent proxy
  4. Stable Network - Needed for reliable operation

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors