A mitmproxy-based tool for extracting WeChat public account article metadata from the desktop client. Supports both manual scraping and scheduled automation.
- 🔍 Real-time Extraction - Intercepts WeChat traffic to extract article metadata
- ⏰ Scheduled Tasks - Daily automated scraping of multiple accounts
- 🕒 Time Filtering - Filter articles by publication date (recent N days)
- ☁️ S3 Integration - Automatic upload to AWS S3 with aggregation
- 📊 Smart Parsing - Handles both single and multi-article messages
- 🪟 Windows Native - UI automation for WeChat desktop client
- Python 3.7+
- WeChat Desktop (Windows)
- Administrator privileges (for transparent proxy)
# Clone and install dependencies
git clone <repository-url>
cd spider-wechat
pip install -r requirements.txt
# Copy configuration template
copy config.yaml.template config.yamlFor ad-hoc scraping of specific WeChat accounts:
# Start the parser
start_wechat_parser.bat
# In WeChat:
# 1. Find the public account's article history link
# 2. Send to "File Transfer Assistant"
# 3. Click the link to open
# Check results
dir output\articles_*.jsonFor daily automated collection from multiple accounts:
# 1. Configure AWS credentials (choose one method)
# Option A: config.yaml (recommended)
s3:
aws_access_key_id: "your_key"
aws_secret_access_key: "your_secret"
# Option B: Environment variables
$env:AWS_ACCESS_KEY_ID = "your_key"
$env:AWS_SECRET_ACCESS_KEY = "your_secret"
# Option C: AWS CLI
aws configure
# 2. Edit config.yaml with your URLs
input_urls:
- "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=..."
# 3. Test run
python scheduled_scraper.py --run-once
# 4. Start scheduler
start_scheduled_scraper.batKey settings in config.yaml:
# WeChat account URLs
input_urls:
- "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=..."
# Output settings
output:
filter_days: 1 # Keep articles from last N days
# S3 settings (for scheduled tasks)
s3:
bucket: "your-bucket-name"
prefix: "articles/"
# Scheduler settings
scheduler:
enabled: true
time: "00:00" # Execution time (HH:MM format){
"source_url": "https://mp.weixin.qq.com/mp/profile_ext?...",
"extracted_at": "2025-11-03T20:30:00",
"article_count": 10,
"articles": [
{
"title": "Article Title",
"url": "https://mp.weixin.qq.com/s?__biz=...",
"publish_time": "2025-11-03 10:30:00",
"digest": "Article summary",
"cover": "Cover image URL",
"author": "Author name"
}
]
}# Start parser (recommended)
start_wechat_parser.bat
# Alternative: Python script
python run_wechat_parser.py
# Direct mitmproxy
mitmdump --mode local:WeChatAppEx.exe -s src/addons/wechat_article_parser.py# Test run (execute once)
python scheduled_scraper.py --run-once
# Custom execution time
python scheduled_scraper.py --time "08:00"
# View logs
type logs\scraper.logspider-wechat/
├── src/
│ ├── addons/ # mitmproxy addons
│ │ └── wechat_article_parser.py # Article parser addon
│ ├── services/ # Business logic
│ │ ├── scheduler.py # Task scheduler
│ │ ├── article_aggregator.py # Article aggregation & S3 upload
│ │ └── wechat_controller.py # WeChat client automation
│ ├── models/ # Data models
│ │ ├── article.py # Article dataclass
│ │ ├── config.py # Configuration model
│ │ └── execution_result.py # Execution result model
│ └── utils/ # Utilities
│ ├── logger.py # Logging setup
│ ├── config_loader.py # Configuration loader
│ ├── url_validator.py # URL validation
│ └── url_manager.py # URL management
├── output/ # Output directory
│ ├── articles_*.json # Raw scraped data
│ └── aggregated_articles_*.json # Aggregated data (scheduled)
├── config.yaml # Configuration file
├── requirements.txt # Python dependencies
├── start_wechat_parser.bat # Manual scraping launcher
├── start_scheduled_scraper.bat # Scheduled task launcher
└── README.md # This file
Q: No data intercepted?
A: Ensure mitmproxy is running and open WeChat links in the built-in browser
Q: How to change scheduled execution time?
A: Edit scheduler.time in config.yaml or use --time parameter
Q: How to modify article filter days?
A: Edit output.filter_days in config.yaml
Q: S3 upload failed?
A: Check AWS credentials configuration. Data is saved locally even if upload fails
Q: WeChat client not running?
A: Scheduled tasks will retry 5 times with 10-second intervals
WeChat Client → mitmproxy Intercept → Parse Article Data → Save JSON
Timer Trigger → Check WeChat → Open Account → Scrape Articles
→ Time Filter → Aggregate Data → Save Local + Upload S3
- mitmproxy (10.1.0+) - HTTPS traffic interception and addon system
- pywinauto (0.6.8+) - Windows UI automation
- pyautogui (0.9.54+) - Mouse/keyboard automation
- psutil (5.9.0+) - Process detection
- loguru (0.7.0+) - Structured logging
- schedule (1.2.0+) - Cron-like job scheduling
- boto3 (1.28.0+) - AWS S3 client
- PyYAML (6.0+) - Configuration management
- WeChat Desktop Client - Must be running during scheduled tasks
- mitmproxy Running - Uses transparent proxy mode for traffic interception
- Administrator Privileges - Required for transparent proxy
- Stable Network - Needed for reliable operation
MIT License