WeChat Article Scraper

A mitmproxy-based tool for extracting WeChat public account article metadata from the desktop client. Supports both manual scraping and scheduled automation.

Features

🔍 Real-time Extraction - Intercepts WeChat traffic to extract article metadata
⏰ Scheduled Tasks - Daily automated scraping of multiple accounts
🕒 Time Filtering - Filter articles by publication date (recent N days)
☁️ S3 Integration - Automatic upload to AWS S3 with aggregation
📊 Smart Parsing - Handles both single and multi-article messages
🪟 Windows Native - UI automation for WeChat desktop client

Quick Start

Prerequisites

Python 3.7+
WeChat Desktop (Windows)
Administrator privileges (for transparent proxy)

Installation

# Clone and install dependencies
git clone <repository-url>
cd spider-wechat
pip install -r requirements.txt

# Copy configuration template
copy config.yaml.template config.yaml

Usage

Method 1: Manual Scraping (Real-time)

For ad-hoc scraping of specific WeChat accounts:

# Start the parser
start_wechat_parser.bat

# In WeChat:
# 1. Find the public account's article history link
# 2. Send to "File Transfer Assistant" 
# 3. Click the link to open

# Check results
dir output\articles_*.json

Method 2: Scheduled Tasks (Automated)

For daily automated collection from multiple accounts:

# 1. Configure AWS credentials (choose one method)

# Option A: config.yaml (recommended)
s3:
  aws_access_key_id: "your_key"
  aws_secret_access_key: "your_secret"

# Option B: Environment variables
$env:AWS_ACCESS_KEY_ID = "your_key"
$env:AWS_SECRET_ACCESS_KEY = "your_secret"

# Option C: AWS CLI
aws configure

# 2. Edit config.yaml with your URLs
input_urls:
  - "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=..."

# 3. Test run
python scheduled_scraper.py --run-once

# 4. Start scheduler
start_scheduled_scraper.bat

Configuration

Key settings in config.yaml:

# WeChat account URLs
input_urls:
  - "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=..."

# Output settings  
output:
  filter_days: 1        # Keep articles from last N days
  
# S3 settings (for scheduled tasks)
s3:
  bucket: "your-bucket-name"
  prefix: "articles/"
  
# Scheduler settings
scheduler:
  enabled: true
  time: "00:00"         # Execution time (HH:MM format)

Output Format

{
  "source_url": "https://mp.weixin.qq.com/mp/profile_ext?...",
  "extracted_at": "2025-11-03T20:30:00",
  "article_count": 10,
  "articles": [
    {
      "title": "Article Title",
      "url": "https://mp.weixin.qq.com/s?__biz=...",
      "publish_time": "2025-11-03 10:30:00",
      "digest": "Article summary",
      "cover": "Cover image URL",
      "author": "Author name"
    }
  ]
}

Commands

Manual Scraping

# Start parser (recommended)
start_wechat_parser.bat

# Alternative: Python script
python run_wechat_parser.py

# Direct mitmproxy
mitmdump --mode local:WeChatAppEx.exe -s src/addons/wechat_article_parser.py

Scheduled Tasks

# Test run (execute once)
python scheduled_scraper.py --run-once

# Custom execution time
python scheduled_scraper.py --time "08:00"

# View logs
type logs\scraper.log

Project Structure

spider-wechat/
├── src/
│   ├── addons/                          # mitmproxy addons
│   │   └── wechat_article_parser.py     # Article parser addon
│   ├── services/                        # Business logic
│   │   ├── scheduler.py                 # Task scheduler
│   │   ├── article_aggregator.py        # Article aggregation & S3 upload
│   │   └── wechat_controller.py         # WeChat client automation
│   ├── models/                          # Data models
│   │   ├── article.py                   # Article dataclass
│   │   ├── config.py                    # Configuration model
│   │   └── execution_result.py          # Execution result model
│   └── utils/                           # Utilities
│       ├── logger.py                    # Logging setup
│       ├── config_loader.py             # Configuration loader
│       ├── url_validator.py             # URL validation
│       └── url_manager.py               # URL management
├── output/                              # Output directory
│   ├── articles_*.json                  # Raw scraped data
│   └── aggregated_articles_*.json       # Aggregated data (scheduled)
├── config.yaml                          # Configuration file
├── requirements.txt                     # Python dependencies
├── start_wechat_parser.bat             # Manual scraping launcher
├── start_scheduled_scraper.bat         # Scheduled task launcher
└── README.md                           # This file

Troubleshooting

Q: No data intercepted?
A: Ensure mitmproxy is running and open WeChat links in the built-in browser

Q: How to change scheduled execution time?
A: Edit scheduler.time in config.yaml or use --time parameter

Q: How to modify article filter days?
A: Edit output.filter_days in config.yaml

Q: S3 upload failed?
A: Check AWS credentials configuration. Data is saved locally even if upload fails

Q: WeChat client not running?
A: Scheduled tasks will retry 5 times with 10-second intervals

How It Works

Manual Scraping Flow

WeChat Client → mitmproxy Intercept → Parse Article Data → Save JSON

Scheduled Task Flow

Timer Trigger → Check WeChat → Open Account → Scrape Articles 
→ Time Filter → Aggregate Data → Save Local + Upload S3

Technology Stack

mitmproxy (10.1.0+) - HTTPS traffic interception and addon system
pywinauto (0.6.8+) - Windows UI automation
pyautogui (0.9.54+) - Mouse/keyboard automation
psutil (5.9.0+) - Process detection
loguru (0.7.0+) - Structured logging
schedule (1.2.0+) - Cron-like job scheduling
boto3 (1.28.0+) - AWS S3 client
PyYAML (6.0+) - Configuration management

Requirements

WeChat Desktop Client - Must be running during scheduled tasks
mitmproxy Running - Uses transparent proxy mode for traffic interception
Administrator Privileges - Required for transparent proxy
Stable Network - Needed for reliable operation

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.kiro		.kiro
src		src
.gitignore		.gitignore
AWS_SETUP.md		AWS_SETUP.md
DEPLOYMENT_SUMMARY.md		DEPLOYMENT_SUMMARY.md
GIT_COMMIT_CHECKLIST.md		GIT_COMMIT_CHECKLIST.md
README.md		README.md
config.yaml.template		config.yaml.template
install_mitmproxy_cert.py		install_mitmproxy_cert.py
requirements.txt		requirements.txt
run_wechat_parser.py		run_wechat_parser.py
scheduled_scraper.py		scheduled_scraper.py
start_scheduled_scraper.bat		start_scheduled_scraper.bat
start_wechat_parser.bat		start_wechat_parser.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeChat Article Scraper

Features

Quick Start

Prerequisites

Installation

Usage

Method 1: Manual Scraping (Real-time)

Method 2: Scheduled Tasks (Automated)

Configuration

Output Format

Commands

Manual Scraping

Scheduled Tasks

Project Structure

Troubleshooting

How It Works

Manual Scraping Flow

Scheduled Task Flow

Technology Stack

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WeChat Article Scraper

Features

Quick Start

Prerequisites

Installation

Usage

Method 1: Manual Scraping (Real-time)

Method 2: Scheduled Tasks (Automated)

Configuration

Output Format

Commands

Manual Scraping

Scheduled Tasks

Project Structure

Troubleshooting

How It Works

Manual Scraping Flow

Scheduled Task Flow

Technology Stack

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages