NanmiCoder
diff --git a/‎CLAUDE.local.md‎
Lines changed: 166 additions & 0 deletions b/‎CLAUDE.local.md‎
Lines changed: 166 additions & 0 deletions
diff --git a/‎config/base_config.py‎
Lines changed: 7 additions & 7 deletions b/‎config/base_config.py‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎media_platform/bilibili/core.py‎
Lines changed: 57 additions & 9 deletions b/‎media_platform/bilibili/core.py‎
Lines changed: 57 additions & 9 deletions
@@ -0,0 +1,166 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+MediaCrawler is a multi-platform social media data collection tool supporting platforms like Xiaohongshu (Little Red Book), Douyin (TikTok), Kuaishou, Bilibili, Weibo, Tieba, and Zhihu. The project uses Playwright for browser automation and maintains login states to crawl public information without needing JS reverse engineering.
+
+## Development Environment Setup
+
+### Prerequisites
+- **Python**: >= 3.9 (verified with 3.9.6)
+- **Node.js**: >= 16.0.0 (required for Douyin and Zhihu crawlers)
+- **uv**: Modern Python package manager (recommended)
+
+### Installation Commands
+```bash
+# Using uv (recommended)
+uv sync
+uv run playwright install
+
+# Using traditional pip (fallback)
+pip install -r requirements.txt
+playwright install
+```
+
+### Running the Application
+```bash
+# Basic crawling command
+uv run main.py --platform xhs --lt qrcode --type search
+
+# View all available options
+uv run main.py --help
+
+# Using traditional Python
+python main.py --platform xhs --lt qrcode --type search
+```
+
+## Architecture Overview
+
+### Core Components
+
+1. **Platform Crawlers** (`media_platform/`):
+   - Each platform has its own crawler implementation
+   - Follows abstract base class pattern (`base/base_crawler.py`)
+   - Platforms: `xhs`, `dy`, `ks`, `bili`, `wb`, `tieba`, `zhihu`
+
+2. **Configuration System** (`config/`):
+   - `base_config.py`: Main configuration file with extensive options
+   - `db_config.py`: Database configuration
+   - Key settings: login types, proxy settings, CDP mode, data storage options
+
+3. **Data Storage** (`store/`):
+   - Multiple storage backends: CSV, JSON, MySQL
+   - Platform-specific storage implementations
+   - Image download capabilities
+
+4. **Caching System** (`cache/`):
+   - Local cache and Redis cache implementations
+   - Factory pattern for cache selection
+
+5. **Proxy Support** (`proxy/`):
+   - IP proxy pool management
+   - Multiple proxy provider support (Kuaidaili, Jishu)
+
+6. **Browser Automation** (`tools/`):
+   - Playwright browser launcher
+   - CDP (Chrome DevTools Protocol) support
+   - Slider validation utilities
+
+### Key Configuration Options
+
+- `PLATFORM`: Target platform (xhs, dy, ks, bili, wb, tieba, zhihu)
+- `KEYWORDS`: Search keywords (comma-separated)
+- `CRAWLER_TYPE`: Type of crawling (search, detail, creator)
+- `ENABLE_CDP_MODE`: Use Chrome DevTools Protocol for better anti-detection
+- `SAVE_DATA_OPTION`: Data storage format (csv, db, json)
+- `ENABLE_GET_COMMENTS`: Enable comment crawling
+- `ENABLE_IP_PROXY`: Enable proxy IP rotation
+
+## Testing
+
+### Available Test Commands
+```bash
+# Run all tests
+python -m unittest discover test
+
+# Run specific test files
+python -m unittest test.test_expiring_local_cache
+python -m unittest test.test_proxy_ip_pool
+python -m unittest test.test_redis_cache
+python -m unittest test.test_utils
+
+# Install and use pytest (enhanced testing)
+uv add pytest
+uv run pytest test/
+```
+
+### Test Coverage
+- Cache functionality tests
+- Proxy IP pool tests
+- Utility function tests
+- Redis cache tests (requires Redis server)
+
+## Database Setup
+
+### MySQL Database Initialization
+```bash
+# Initialize database tables (first time only)
+python db.py
+
+# Or with uv
+uv run db.py
+```
+
+### Supported Storage Options
+- **MySQL**: Full relational database with deduplication
+- **CSV**: Simple file-based storage in `data/` directory
+- **JSON**: Structured file-based storage in `data/` directory
+
+## Common Development Tasks
+
+### Adding New Platform Support
+1. Create new directory in `media_platform/`
+2. Implement crawler class inheriting from `AbstractCrawler`
+3. Add platform-specific client, core, field, and login modules
+4. Update `CrawlerFactory` in `main.py`
+5. Add storage implementation in `store/`
+
+### Debugging CDP Mode
+- Set `ENABLE_CDP_MODE = True` in config
+- Use `CDP_HEADLESS = False` for visual debugging
+- Check browser console for CDP connection issues
+
+### Managing Login States
+- Login states are cached in `browser_data/` directory
+- Platform-specific user data directories maintain session cookies
+- Set `SAVE_LOGIN_STATE = True` to preserve login across runs
+
+## Platform-Specific Notes
+
+### Xiaohongshu (XHS)
+- Supports search, detail, and creator crawling
+- Requires `xsec_token` and `xsec_source` parameters for specific note URLs
+- Custom User-Agent configuration available
+
+### Douyin (DY)
+- Requires Node.js environment
+- Supports publish time filtering
+- Has specific creator ID format (sec_id)
+
+### Bilibili (BILI)
+- Supports date range filtering with `START_DAY` and `END_DAY`
+- Can crawl creator fans/following lists
+- Uses BV video ID format
+
+## Legal and Usage Notes
+
+This project is for educational and research purposes only. Users must:
+- Comply with platform terms of service
+- Follow robots.txt rules
+- Control request frequency appropriately
+- Not use for commercial purposes
+- Respect platform rate limits
+
+The project includes comprehensive legal disclaimers and usage guidelines in the README.md file.
@@ -22,7 +22,7 @@
     "search"  # 爬取类型，search(关键词搜索) | detail(帖子详情)| creator(创作者主页数据)
 )
 # 自定义User Agent（暂时仅对XHS有效）
-UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'
+UA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0"
 
 # 是否开启 IP 代理
 ENABLE_IP_PROXY = False
@@ -190,9 +190,9 @@
 
 # 指定知乎需要爬取的帖子ID列表
 ZHIHU_SPECIFIED_ID_LIST = [
-    "https://www.zhihu.com/question/826896610/answer/4885821440", # 回答
-    "https://zhuanlan.zhihu.com/p/673461588", # 文章
-    "https://www.zhihu.com/zvideo/1539542068422144000" # 视频
+    "https://www.zhihu.com/question/826896610/answer/4885821440",  # 回答
+    "https://zhuanlan.zhihu.com/p/673461588",  # 文章
+    "https://www.zhihu.com/zvideo/1539542068422144000",  # 视频
 ]
 
 # 词云相关
@@ -212,10 +212,10 @@
 FONT_PATH = "./docs/STZHONGS.TTF"
 
 # 爬取开始的天数，仅支持 bilibili 关键字搜索，YYYY-MM-DD 格式，若为 None 则表示不设置时间范围，按照默认关键字最多返回 1000 条视频的结果处理
-START_DAY = '2024-01-01'
+START_DAY = "2024-01-01"
 
 # 爬取结束的天数，仅支持 bilibili 关键字搜索，YYYY-MM-DD 格式，若为 None 则表示不设置时间范围，按照默认关键字最多返回 1000 条视频的结果处理
-END_DAY = '2024-01-01'
+END_DAY = "2024-01-01"
 
 # 是否开启按每一天进行爬取的选项，仅支持 bilibili 关键字搜索
 # 若为 False，则忽略 START_DAY 与 END_DAY 设置的值
@@ -233,4 +233,4 @@
 CRAWLER_MAX_CONTACTS_COUNT_SINGLENOTES = 100
 
 # 爬取作者动态数量控制(单作者)
-CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES = 50
+CRAWLER_MAX_DYNAMICS_COUNT_SINGLENOTES = 50
@@ -22,13 +22,14 @@
 from datetime import datetime, timedelta
 import pandas as pd
 
-from playwright.async_api import (BrowserContext, BrowserType, Page, async_playwright)
+from playwright.async_api import (BrowserContext, BrowserType, Page, Playwright, async_playwright)
 
 import config
 from base.base_crawler import AbstractCrawler
 from proxy.proxy_ip_pool import IpInfoModel, create_ip_pool
 from store import bilibili as bilibili_store
 from tools import utils
+from tools.cdp_browser import CDPBrowserManager
 from var import crawler_type_var, source_keyword_var
 
 from .client import BilibiliClient
@@ -41,10 +42,12 @@ class BilibiliCrawler(AbstractCrawler):
     context_page: Page
     bili_client: BilibiliClient
     browser_context: BrowserContext
+    cdp_manager: Optional[CDPBrowserManager]
 
     def __init__(self):
         self.index_url = "https://www.bilibili.com"
         self.user_agent = utils.get_user_agent()
+        self.cdp_manager = None
 
     async def start(self):
         playwright_proxy_format, httpx_proxy_format = None, None
@@ -55,14 +58,23 @@ async def start(self):
                 ip_proxy_info)
 
         async with async_playwright() as playwright:
-            # Launch a browser context.
-            chromium = playwright.chromium
-            self.browser_context = await self.launch_browser(
-                chromium,
-                None,
-                self.user_agent,
-                headless=config.HEADLESS
-            )
+            # 根据配置选择启动模式
+            if config.ENABLE_CDP_MODE:
+                utils.logger.info("[BilibiliCrawler] 使用CDP模式启动浏览器")
+                self.browser_context = await self.launch_browser_with_cdp(
+                    playwright, playwright_proxy_format, self.user_agent,
+                    headless=config.CDP_HEADLESS
+                )
+            else:
+                utils.logger.info("[BilibiliCrawler] 使用标准模式启动浏览器")
+                # Launch a browser context.
+                chromium = playwright.chromium
+                self.browser_context = await self.launch_browser(
+                    chromium,
+                    None,
+                    self.user_agent,
+                    headless=config.HEADLESS
+                )
             # stealth.min.js is a js script to prevent the website from detecting the crawler.
             await self.browser_context.add_init_script(path="libs/stealth.min.js")
             self.context_page = await self.browser_context.new_page()
@@ -434,6 +446,42 @@ async def launch_browser(
             )
             return browser_context
 
+    async def launch_browser_with_cdp(self, playwright: Playwright, playwright_proxy: Optional[Dict],
+                                     user_agent: Optional[str], headless: bool = True) -> BrowserContext:
+        """
+        使用CDP模式启动浏览器
+        """
+        try:
+            self.cdp_manager = CDPBrowserManager()
+            browser_context = await self.cdp_manager.launch_and_connect(
+                playwright=playwright,
+                playwright_proxy=playwright_proxy,
+                user_agent=user_agent,
+                headless=headless
+            )
+
+            # 显示浏览器信息
+            browser_info = await self.cdp_manager.get_browser_info()
+            utils.logger.info(f"[BilibiliCrawler] CDP浏览器信息: {browser_info}")
+
+            return browser_context
+
+        except Exception as e:
+            utils.logger.error(f"[BilibiliCrawler] CDP模式启动失败，回退到标准模式: {e}")
+            # 回退到标准模式
+            chromium = playwright.chromium
+            return await self.launch_browser(chromium, playwright_proxy, user_agent, headless)
+
+    async def close(self):
+        """Close browser context"""
+        # 如果使用CDP模式，需要特殊处理
+        if self.cdp_manager:
+            await self.cdp_manager.cleanup()
+            self.cdp_manager = None
+        else:
+            await self.browser_context.close()
+        utils.logger.info("[BilibiliCrawler.close] Browser context closed ...")
+
     async def get_bilibili_video(self, video_item: Dict, semaphore: asyncio.Semaphore):
         """
         download bilibili video