|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +MediaCrawler is a multi-platform social media data collection tool supporting platforms like Xiaohongshu (Little Red Book), Douyin (TikTok), Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. The project uses Playwright for browser automation and maintains login states to crawl public information without needing JS reverse engineering. |
| 8 | + |
| 9 | +## Development Environment Setup |
| 10 | + |
| 11 | +### Prerequisites |
| 12 | +- **Python**: >= 3.9 (verified with 3.9.6) |
| 13 | +- **Node.js**: >= 16.0.0 (required for Douyin and Zhihu crawlers) |
| 14 | +- **uv**: Modern Python package manager (recommended) |
| 15 | + |
| 16 | +### Installation Commands |
| 17 | +```bash |
| 18 | +# Using uv (recommended) |
| 19 | +uv sync |
| 20 | +uv run playwright install |
| 21 | + |
| 22 | +# Using traditional pip (fallback) |
| 23 | +pip install -r requirements.txt |
| 24 | +playwright install |
| 25 | +``` |
| 26 | + |
| 27 | +### Running the Application |
| 28 | +```bash |
| 29 | +# Basic crawling command |
| 30 | +uv run main.py --platform xhs --lt qrcode --type search |
| 31 | + |
| 32 | +# View all available options |
| 33 | +uv run main.py --help |
| 34 | + |
| 35 | +# Using traditional Python |
| 36 | +python main.py --platform xhs --lt qrcode --type search |
| 37 | +``` |
| 38 | + |
| 39 | +## Architecture Overview |
| 40 | + |
| 41 | +### Core Components |
| 42 | + |
| 43 | +1. **Platform Crawlers** (`media_platform/`): |
| 44 | + - Each platform has its own crawler implementation |
| 45 | + - Follows abstract base class pattern (`base/base_crawler.py`) |
| 46 | + - Platforms: `xhs`, `dy`, `ks`, `bili`, `wb`, `tieba`, `zhihu` |
| 47 | + |
| 48 | +2. **Configuration System** (`config/`): |
| 49 | + - `base_config.py`: Main configuration file with extensive options |
| 50 | + - `db_config.py`: Database configuration |
| 51 | + - Key settings: login types, proxy settings, CDP mode, data storage options |
| 52 | + |
| 53 | +3. **Data Storage** (`store/`): |
| 54 | + - Multiple storage backends: CSV, JSON, MySQL |
| 55 | + - Platform-specific storage implementations |
| 56 | + - Image download capabilities |
| 57 | + |
| 58 | +4. **Caching System** (`cache/`): |
| 59 | + - Local cache and Redis cache implementations |
| 60 | + - Factory pattern for cache selection |
| 61 | + |
| 62 | +5. **Proxy Support** (`proxy/`): |
| 63 | + - IP proxy pool management |
| 64 | + - Multiple proxy provider support (Kuaidaili, Jishu) |
| 65 | + |
| 66 | +6. **Browser Automation** (`tools/`): |
| 67 | + - Playwright browser launcher |
| 68 | + - CDP (Chrome DevTools Protocol) support |
| 69 | + - Slider validation utilities |
| 70 | + |
| 71 | +### Key Configuration Options |
| 72 | + |
| 73 | +- `PLATFORM`: Target platform (xhs, dy, ks, bili, wb, tieba, zhihu) |
| 74 | +- `KEYWORDS`: Search keywords (comma-separated) |
| 75 | +- `CRAWLER_TYPE`: Type of crawling (search, detail, creator) |
| 76 | +- `ENABLE_CDP_MODE`: Use Chrome DevTools Protocol for better anti-detection |
| 77 | +- `SAVE_DATA_OPTION`: Data storage format (csv, db, json) |
| 78 | +- `ENABLE_GET_COMMENTS`: Enable comment crawling |
| 79 | +- `ENABLE_IP_PROXY`: Enable proxy IP rotation |
| 80 | + |
| 81 | +## Testing |
| 82 | + |
| 83 | +### Available Test Commands |
| 84 | +```bash |
| 85 | +# Run all tests |
| 86 | +python -m unittest discover test |
| 87 | + |
| 88 | +# Run specific test files |
| 89 | +python -m unittest test.test_expiring_local_cache |
| 90 | +python -m unittest test.test_proxy_ip_pool |
| 91 | +python -m unittest test.test_redis_cache |
| 92 | +python -m unittest test.test_utils |
| 93 | + |
| 94 | +# Install and use pytest (enhanced testing) |
| 95 | +uv add pytest |
| 96 | +uv run pytest test/ |
| 97 | +``` |
| 98 | + |
| 99 | +### Test Coverage |
| 100 | +- Cache functionality tests |
| 101 | +- Proxy IP pool tests |
| 102 | +- Utility function tests |
| 103 | +- Redis cache tests (requires Redis server) |
| 104 | + |
| 105 | +## Database Setup |
| 106 | + |
| 107 | +### MySQL Database Initialization |
| 108 | +```bash |
| 109 | +# Initialize database tables (first time only) |
| 110 | +python db.py |
| 111 | + |
| 112 | +# Or with uv |
| 113 | +uv run db.py |
| 114 | +``` |
| 115 | + |
| 116 | +### Supported Storage Options |
| 117 | +- **MySQL**: Full relational database with deduplication |
| 118 | +- **CSV**: Simple file-based storage in `data/` directory |
| 119 | +- **JSON**: Structured file-based storage in `data/` directory |
| 120 | + |
| 121 | +## Common Development Tasks |
| 122 | + |
| 123 | +### Adding New Platform Support |
| 124 | +1. Create new directory in `media_platform/` |
| 125 | +2. Implement crawler class inheriting from `AbstractCrawler` |
| 126 | +3. Add platform-specific client, core, field, and login modules |
| 127 | +4. Update `CrawlerFactory` in `main.py` |
| 128 | +5. Add storage implementation in `store/` |
| 129 | + |
| 130 | +### Debugging CDP Mode |
| 131 | +- Set `ENABLE_CDP_MODE = True` in config |
| 132 | +- Use `CDP_HEADLESS = False` for visual debugging |
| 133 | +- Check browser console for CDP connection issues |
| 134 | + |
| 135 | +### Managing Login States |
| 136 | +- Login states are cached in `browser_data/` directory |
| 137 | +- Platform-specific user data directories maintain session cookies |
| 138 | +- Set `SAVE_LOGIN_STATE = True` to preserve login across runs |
| 139 | + |
| 140 | +## Platform-Specific Notes |
| 141 | + |
| 142 | +### Xiaohongshu (XHS) |
| 143 | +- Supports search, detail, and creator crawling |
| 144 | +- Requires `xsec_token` and `xsec_source` parameters for specific note URLs |
| 145 | +- Custom User-Agent configuration available |
| 146 | + |
| 147 | +### Douyin (DY) |
| 148 | +- Requires Node.js environment |
| 149 | +- Supports publish time filtering |
| 150 | +- Has specific creator ID format (sec_id) |
| 151 | + |
| 152 | +### Bilibili (BILI) |
| 153 | +- Supports date range filtering with `START_DAY` and `END_DAY` |
| 154 | +- Can crawl creator fans/following lists |
| 155 | +- Uses BV video ID format |
| 156 | + |
| 157 | +## Legal and Usage Notes |
| 158 | + |
| 159 | +This project is for educational and research purposes only. Users must: |
| 160 | +- Comply with platform terms of service |
| 161 | +- Follow robots.txt rules |
| 162 | +- Control request frequency appropriately |
| 163 | +- Not use for commercial purposes |
| 164 | +- Respect platform rate limits |
| 165 | + |
| 166 | +The project includes comprehensive legal disclaimers and usage guidelines in the README.md file. |
0 commit comments