Skip to content

Commit 848df2b

Browse files
committed
feat: other platfrom support the cdp mode
1 parent c892c33 commit 848df2b

File tree

9 files changed

+565
-102
lines changed

9 files changed

+565
-102
lines changed

CLAUDE.local.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
MediaCrawler is a multi-platform social media data collection tool supporting platforms like Xiaohongshu (Little Red Book), Douyin (TikTok), Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. The project uses Playwright for browser automation and maintains login states to crawl public information without needing JS reverse engineering.
8+
9+
## Development Environment Setup
10+
11+
### Prerequisites
12+
- **Python**: >= 3.9 (verified with 3.9.6)
13+
- **Node.js**: >= 16.0.0 (required for Douyin and Zhihu crawlers)
14+
- **uv**: Modern Python package manager (recommended)
15+
16+
### Installation Commands
17+
```bash
18+
# Using uv (recommended)
19+
uv sync
20+
uv run playwright install
21+
22+
# Using traditional pip (fallback)
23+
pip install -r requirements.txt
24+
playwright install
25+
```
26+
27+
### Running the Application
28+
```bash
29+
# Basic crawling command
30+
uv run main.py --platform xhs --lt qrcode --type search
31+
32+
# View all available options
33+
uv run main.py --help
34+
35+
# Using traditional Python
36+
python main.py --platform xhs --lt qrcode --type search
37+
```
38+
39+
## Architecture Overview
40+
41+
### Core Components
42+
43+
1. **Platform Crawlers** (`media_platform/`):
44+
- Each platform has its own crawler implementation
45+
- Follows abstract base class pattern (`base/base_crawler.py`)
46+
- Platforms: `xhs`, `dy`, `ks`, `bili`, `wb`, `tieba`, `zhihu`
47+
48+
2. **Configuration System** (`config/`):
49+
- `base_config.py`: Main configuration file with extensive options
50+
- `db_config.py`: Database configuration
51+
- Key settings: login types, proxy settings, CDP mode, data storage options
52+
53+
3. **Data Storage** (`store/`):
54+
- Multiple storage backends: CSV, JSON, MySQL
55+
- Platform-specific storage implementations
56+
- Image download capabilities
57+
58+
4. **Caching System** (`cache/`):
59+
- Local cache and Redis cache implementations
60+
- Factory pattern for cache selection
61+
62+
5. **Proxy Support** (`proxy/`):
63+
- IP proxy pool management
64+
- Multiple proxy provider support (Kuaidaili, Jishu)
65+
66+
6. **Browser Automation** (`tools/`):
67+
- Playwright browser launcher
68+
- CDP (Chrome DevTools Protocol) support
69+
- Slider validation utilities
70+
71+
### Key Configuration Options
72+
73+
- `PLATFORM`: Target platform (xhs, dy, ks, bili, wb, tieba, zhihu)
74+
- `KEYWORDS`: Search keywords (comma-separated)
75+
- `CRAWLER_TYPE`: Type of crawling (search, detail, creator)
76+
- `ENABLE_CDP_MODE`: Use Chrome DevTools Protocol for better anti-detection
77+
- `SAVE_DATA_OPTION`: Data storage format (csv, db, json)
78+
- `ENABLE_GET_COMMENTS`: Enable comment crawling
79+
- `ENABLE_IP_PROXY`: Enable proxy IP rotation
80+
81+
## Testing
82+
83+
### Available Test Commands
84+
```bash
85+
# Run all tests
86+
python -m unittest discover test
87+
88+
# Run specific test files
89+
python -m unittest test.test_expiring_local_cache
90+
python -m unittest test.test_proxy_ip_pool
91+
python -m unittest test.test_redis_cache
92+
python -m unittest test.test_utils
93+
94+
# Install and use pytest (enhanced testing)
95+
uv add pytest
96+
uv run pytest test/
97+
```
98+
99+
### Test Coverage
100+
- Cache functionality tests
101+
- Proxy IP pool tests
102+
- Utility function tests
103+
- Redis cache tests (requires Redis server)
104+
105+
## Database Setup
106+
107+
### MySQL Database Initialization
108+
```bash
109+
# Initialize database tables (first time only)
110+
python db.py
111+
112+
# Or with uv
113+
uv run db.py
114+
```
115+
116+
### Supported Storage Options
117+
- **MySQL**: Full relational database with deduplication
118+
- **CSV**: Simple file-based storage in `data/` directory
119+
- **JSON**: Structured file-based storage in `data/` directory
120+
121+
## Common Development Tasks
122+
123+
### Adding New Platform Support
124+
1. Create new directory in `media_platform/`
125+
2. Implement crawler class inheriting from `AbstractCrawler`
126+
3. Add platform-specific client, core, field, and login modules
127+
4. Update `CrawlerFactory` in `main.py`
128+
5. Add storage implementation in `store/`
129+
130+
### Debugging CDP Mode
131+
- Set `ENABLE_CDP_MODE = True` in config
132+
- Use `CDP_HEADLESS = False` for visual debugging
133+
- Check browser console for CDP connection issues
134+
135+
### Managing Login States
136+
- Login states are cached in `browser_data/` directory
137+
- Platform-specific user data directories maintain session cookies
138+
- Set `SAVE_LOGIN_STATE = True` to preserve login across runs
139+
140+
## Platform-Specific Notes
141+
142+
### Xiaohongshu (XHS)
143+
- Supports search, detail, and creator crawling
144+
- Requires `xsec_token` and `xsec_source` parameters for specific note URLs
145+
- Custom User-Agent configuration available
146+
147+
### Douyin (DY)
148+
- Requires Node.js environment
149+
- Supports publish time filtering
150+
- Has specific creator ID format (sec_id)
151+
152+
### Bilibili (BILI)
153+
- Supports date range filtering with `START_DAY` and `END_DAY`
154+
- Can crawl creator fans/following lists
155+
- Uses BV video ID format
156+
157+
## Legal and Usage Notes
158+
159+
This project is for educational and research purposes only. Users must:
160+
- Comply with platform terms of service
161+
- Follow robots.txt rules
162+
- Control request frequency appropriately
163+
- Not use for commercial purposes
164+
- Respect platform rate limits
165+
166+
The project includes comprehensive legal disclaimers and usage guidelines in the README.md file.

config/base_config.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
"search" # 爬取类型,search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
2323
)
2424
# 自定义User Agent(暂时仅对XHS有效)
25-
UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'
25+
UA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
2626

2727
# 是否开启 IP 代理
2828
ENABLE_IP_PROXY = False
@@ -190,9 +190,9 @@
190190

191191
# 指定知乎需要爬取的帖子ID列表
192192
ZHIHU_SPECIFIED_ID_LIST = [
193-
"https://www.zhihu.com/question/826896610/answer/4885821440", # 回答
194-
"https://zhuanlan.zhihu.com/p/673461588", # 文章
195-
"https://www.zhihu.com/zvideo/1539542068422144000" # 视频
193+
"https://www.zhihu.com/question/826896610/answer/4885821440", # 回答
194+
"https://zhuanlan.zhihu.com/p/673461588", # 文章
195+
"https://www.zhihu.com/zvideo/1539542068422144000", # 视频
196196
]
197197

198198
# 词云相关
@@ -212,10 +212,10 @@
212212
FONT_PATH = "./docs/STZHONGS.TTF"
213213

214214
# 爬取开始的天数,仅支持 bilibili 关键字搜索,YYYY-MM-DD 格式,若为 None 则表示不设置时间范围,按照默认关键字最多返回 1000 条视频的结果处理
215-
START_DAY = '2024-01-01'
215+
START_DAY = "2024-01-01"
216216

217217
# 爬取结束的天数,仅支持 bilibili 关键字搜索,YYYY-MM-DD 格式,若为 None 则表示不设置时间范围,按照默认关键字最多返回 1000 条视频的结果处理
218-
END_DAY = '2024-01-01'
218+
END_DAY = "2024-01-01"
219219

220220
# 是否开启按每一天进行爬取的选项,仅支持 bilibili 关键字搜索
221221
# 若为 False,则忽略 START_DAY 与 END_DAY 设置的值
@@ -233,4 +233,4 @@
233233
CRAWLER_MAX_CONTACTS_COUNT_SINGLENOTES = 100
234234

235235
# 爬取作者动态数量控制(单作者)
236-
CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES = 50
236+
CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES = 50

media_platform/bilibili/core.py

Lines changed: 57 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,14 @@
2222
from datetime import datetime, timedelta
2323
import pandas as pd
2424

25-
from playwright.async_api import (BrowserContext, BrowserType, Page, async_playwright)
25+
from playwright.async_api import (BrowserContext, BrowserType, Page, Playwright, async_playwright)
2626

2727
import config
2828
from base.base_crawler import AbstractCrawler
2929
from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
3030
from store import bilibili as bilibili_store
3131
from tools import utils
32+
from tools.cdp_browser import CDPBrowserManager
3233
from var import crawler_type_var, source_keyword_var
3334

3435
from .client import BilibiliClient
@@ -41,10 +42,12 @@ class BilibiliCrawler(AbstractCrawler):
4142
context_page: Page
4243
bili_client: BilibiliClient
4344
browser_context: BrowserContext
45+
cdp_manager: Optional[CDPBrowserManager]
4446

4547
def __init__(self):
4648
self.index_url = "https://www.bilibili.com"
4749
self.user_agent = utils.get_user_agent()
50+
self.cdp_manager = None
4851

4952
async def start(self):
5053
playwright_proxy_format, httpx_proxy_format = None, None
@@ -55,14 +58,23 @@ async def start(self):
5558
ip_proxy_info)
5659

5760
async with async_playwright() as playwright:
58-
# Launch a browser context.
59-
chromium = playwright.chromium
60-
self.browser_context = await self.launch_browser(
61-
chromium,
62-
None,
63-
self.user_agent,
64-
headless=config.HEADLESS
65-
)
61+
# 根据配置选择启动模式
62+
if config.ENABLE_CDP_MODE:
63+
utils.logger.info("[BilibiliCrawler] 使用CDP模式启动浏览器")
64+
self.browser_context = await self.launch_browser_with_cdp(
65+
playwright, playwright_proxy_format, self.user_agent,
66+
headless=config.CDP_HEADLESS
67+
)
68+
else:
69+
utils.logger.info("[BilibiliCrawler] 使用标准模式启动浏览器")
70+
# Launch a browser context.
71+
chromium = playwright.chromium
72+
self.browser_context = await self.launch_browser(
73+
chromium,
74+
None,
75+
self.user_agent,
76+
headless=config.HEADLESS
77+
)
6678
# stealth.min.js is a js script to prevent the website from detecting the crawler.
6779
await self.browser_context.add_init_script(path="libs/stealth.min.js")
6880
self.context_page = await self.browser_context.new_page()
@@ -434,6 +446,42 @@ async def launch_browser(
434446
)
435447
return browser_context
436448

449+
async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
450+
user_agent: Optional[str], headless: bool = True) -> BrowserContext:
451+
"""
452+
使用CDP模式启动浏览器
453+
"""
454+
try:
455+
self.cdp_manager = CDPBrowserManager()
456+
browser_context = await self.cdp_manager.launch_and_connect(
457+
playwright=playwright,
458+
playwright_proxy=playwright_proxy,
459+
user_agent=user_agent,
460+
headless=headless
461+
)
462+
463+
# 显示浏览器信息
464+
browser_info = await self.cdp_manager.get_browser_info()
465+
utils.logger.info(f"[BilibiliCrawler] CDP浏览器信息: {browser_info}")
466+
467+
return browser_context
468+
469+
except Exception as e:
470+
utils.logger.error(f"[BilibiliCrawler] CDP模式启动失败,回退到标准模式: {e}")
471+
# 回退到标准模式
472+
chromium = playwright.chromium
473+
return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
474+
475+
async def close(self):
476+
"""Close browser context"""
477+
# 如果使用CDP模式,需要特殊处理
478+
if self.cdp_manager:
479+
await self.cdp_manager.cleanup()
480+
self.cdp_manager = None
481+
else:
482+
await self.browser_context.close()
483+
utils.logger.info("[BilibiliCrawler.close] Browser context closed ...")
484+
437485
async def get_bilibili_video(self, video_item: Dict, semaphore: asyncio.Semaphore):
438486
"""
439487
download bilibili video

0 commit comments

Comments
 (0)