Skip to content

Commit ebbf86d

Browse files
authored
Merge pull request #783 from hsparks-codes/feature/excel-export-and-tests
feat: Add Excel export functionality and unit tests
2 parents 31a092c + 324f09c commit ebbf86d

File tree

14 files changed

+882
-4
lines changed

14 files changed

+882
-4
lines changed

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,10 @@ python main.py --help
212212
支持多种数据存储方式:
213213
- **CSV 文件**:支持保存到 CSV 中(`data/` 目录下)
214214
- **JSON 文件**:支持保存到 JSON 中(`data/` 目录下)
215+
- **Excel 文件**:支持保存到格式化的 Excel 文件(`data/` 目录下)✨ 新功能
216+
- 多工作表支持(内容、评论、创作者)
217+
- 专业格式化(标题样式、自动列宽、边框)
218+
- 易于分析和分享
215219
- **数据库存储**
216220
- 使用参数 `--init_db` 进行数据库初始化(使用`--init_db`时不需要携带其他optional)
217221
- **SQLite 数据库**:轻量级数据库,无需服务器,适合个人使用(推荐)
@@ -224,6 +228,9 @@ python main.py --help
224228

225229
### 使用示例:
226230
```shell
231+
# 使用 Excel 存储数据(推荐用于数据分析)✨ 新功能
232+
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
233+
227234
# 初始化 SQLite 数据库(使用'--init_db'时不需要携带其他optional)
228235
uv run main.py --init_db sqlite
229236
# 使用 SQLite 存储数据(推荐个人用户使用)

README_en.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,10 @@ python main.py --help
209209
Supports multiple data storage methods:
210210
- **CSV Files**: Supports saving to CSV (under `data/` directory)
211211
- **JSON Files**: Supports saving to JSON (under `data/` directory)
212+
- **Excel Files**: Supports saving to formatted Excel files (under `data/` directory) ✨ New Feature
213+
- Multi-sheet support (Contents, Comments, Creators)
214+
- Professional formatting (styled headers, auto-width columns, borders)
215+
- Easy to analyze and share
212216
- **Database Storage**
213217
- Use the `--init_db` parameter for database initialization (when using `--init_db`, no other optional arguments are needed)
214218
- **SQLite Database**: Lightweight database, no server required, suitable for personal use (recommended)
@@ -221,6 +225,9 @@ Supports multiple data storage methods:
221225

222226
### Usage Examples:
223227
```shell
228+
# Use Excel to store data (recommended for data analysis) ✨ New Feature
229+
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
230+
224231
# Initialize SQLite database (when using '--init_db', no other optional arguments are needed)
225232
uv run main.py --init_db sqlite
226233
# Use SQLite to store data (recommended for personal users)

config/base_config.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,8 +70,8 @@
7070
# 设置为False可以保持浏览器运行,便于调试
7171
AUTO_CLOSE_BROWSER = True
7272

73-
# 数据保存类型选项配置,支持四种类型:csv、db、json、sqlite, 最好保存到DB,有排重的功能。
74-
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite
73+
# 数据保存类型选项配置,支持五种类型:csv、db、json、sqlite、excel, 最好保存到DB,有排重的功能。
74+
SAVE_DATA_OPTION = "json" # csv or db or json or sqlite or excel
7575

7676
# 用户浏览器缓存的浏览器文件配置
7777
USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name

docs/excel_export_guide.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Excel Export Guide
2+
3+
## Overview
4+
5+
MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators.
6+
7+
## Features
8+
9+
- **Multi-sheet workbooks**: Separate sheets for Contents, Comments, and Creators
10+
- **Professional formatting**:
11+
- Styled headers with blue background and white text
12+
- Auto-adjusted column widths
13+
- Cell borders and text wrapping
14+
- Clean, readable layout
15+
- **Smart export**: Empty sheets are automatically removed
16+
- **Organized storage**: Files saved to `data/{platform}/` directory with timestamps
17+
18+
## Installation
19+
20+
Excel export requires the `openpyxl` library:
21+
22+
```bash
23+
# Using uv (recommended)
24+
uv sync
25+
26+
# Or using pip
27+
pip install openpyxl
28+
```
29+
30+
## Usage
31+
32+
### Basic Usage
33+
34+
1. **Configure Excel export** in `config/base_config.py`:
35+
36+
```python
37+
SAVE_DATA_OPTION = "excel" # Change from json/csv/db to excel
38+
```
39+
40+
2. **Run the crawler**:
41+
42+
```bash
43+
# Xiaohongshu example
44+
uv run main.py --platform xhs --lt qrcode --type search
45+
46+
# Douyin example
47+
uv run main.py --platform dy --lt qrcode --type search
48+
49+
# Bilibili example
50+
uv run main.py --platform bili --lt qrcode --type search
51+
```
52+
53+
3. **Find your Excel file** in `data/{platform}/` directory:
54+
- Filename format: `{platform}_{crawler_type}_{timestamp}.xlsx`
55+
- Example: `xhs_search_20250128_143025.xlsx`
56+
57+
### Command Line Examples
58+
59+
```bash
60+
# Search by keywords and export to Excel
61+
uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel
62+
63+
# Crawl specific posts and export to Excel
64+
uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel
65+
66+
# Crawl creator profile and export to Excel
67+
uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel
68+
```
69+
70+
## Excel File Structure
71+
72+
### Contents Sheet
73+
Contains post/video information:
74+
- `note_id`: Unique post identifier
75+
- `title`: Post title
76+
- `desc`: Post description
77+
- `user_id`: Author user ID
78+
- `nickname`: Author nickname
79+
- `liked_count`: Number of likes
80+
- `comment_count`: Number of comments
81+
- `share_count`: Number of shares
82+
- `ip_location`: IP location
83+
- `image_list`: Comma-separated image URLs
84+
- `tag_list`: Comma-separated tags
85+
- `note_url`: Direct link to post
86+
- And more platform-specific fields...
87+
88+
### Comments Sheet
89+
Contains comment information:
90+
- `comment_id`: Unique comment identifier
91+
- `note_id`: Associated post ID
92+
- `content`: Comment text
93+
- `user_id`: Commenter user ID
94+
- `nickname`: Commenter nickname
95+
- `like_count`: Comment likes
96+
- `create_time`: Comment timestamp
97+
- `ip_location`: Commenter location
98+
- `sub_comment_count`: Number of replies
99+
- And more...
100+
101+
### Creators Sheet
102+
Contains creator/author information:
103+
- `user_id`: Unique user identifier
104+
- `nickname`: Display name
105+
- `gender`: Gender
106+
- `avatar`: Profile picture URL
107+
- `desc`: Bio/description
108+
- `fans`: Follower count
109+
- `follows`: Following count
110+
- `interaction`: Total interactions
111+
- And more...
112+
113+
## Advantages Over Other Formats
114+
115+
### vs CSV
116+
- ✅ Multiple sheets in one file
117+
- ✅ Professional formatting
118+
- ✅ Better handling of special characters
119+
- ✅ Auto-adjusted column widths
120+
- ✅ No encoding issues
121+
122+
### vs JSON
123+
- ✅ Human-readable tabular format
124+
- ✅ Easy to open in Excel/Google Sheets
125+
- ✅ Better for data analysis
126+
- ✅ Easier to share with non-technical users
127+
128+
### vs Database
129+
- ✅ No database setup required
130+
- ✅ Portable single-file format
131+
- ✅ Easy to share and archive
132+
- ✅ Works offline
133+
134+
## Tips & Best Practices
135+
136+
1. **Large datasets**: For very large crawls (>10,000 rows), consider using database storage instead for better performance
137+
138+
2. **Data analysis**: Excel files work great with:
139+
- Microsoft Excel
140+
- Google Sheets
141+
- LibreOffice Calc
142+
- Python pandas: `pd.read_excel('file.xlsx')`
143+
144+
3. **Combining data**: You can merge multiple Excel files using:
145+
```python
146+
import pandas as pd
147+
df1 = pd.read_excel('file1.xlsx', sheet_name='Contents')
148+
df2 = pd.read_excel('file2.xlsx', sheet_name='Contents')
149+
combined = pd.concat([df1, df2])
150+
combined.to_excel('combined.xlsx', index=False)
151+
```
152+
153+
4. **File size**: Excel files are typically 2-3x larger than CSV but smaller than JSON
154+
155+
## Troubleshooting
156+
157+
### "openpyxl not installed" error
158+
159+
```bash
160+
# Install openpyxl
161+
uv add openpyxl
162+
# or
163+
pip install openpyxl
164+
```
165+
166+
### Excel file not created
167+
168+
Check that:
169+
1. `SAVE_DATA_OPTION = "excel"` in config
170+
2. Crawler successfully collected data
171+
3. No errors in console output
172+
4. `data/{platform}/` directory exists
173+
174+
### Empty Excel file
175+
176+
This happens when:
177+
- No data was crawled (check keywords/IDs)
178+
- Login failed (check login status)
179+
- Platform blocked requests (check IP/rate limits)
180+
181+
## Example Output
182+
183+
After running a successful crawl, you'll see:
184+
185+
```
186+
[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx
187+
[ExcelStoreBase] Stored content to Excel: 7123456789
188+
[ExcelStoreBase] Stored comment to Excel: comment_123
189+
...
190+
[Main] Excel file saved successfully
191+
```
192+
193+
Your Excel file will have:
194+
- Professional blue headers
195+
- Clean borders
196+
- Wrapped text for long content
197+
- Auto-sized columns
198+
- Separate organized sheets
199+
200+
## Advanced Usage
201+
202+
### Programmatic Access
203+
204+
```python
205+
from store.excel_store_base import ExcelStoreBase
206+
207+
# Create store
208+
store = ExcelStoreBase(platform="xhs", crawler_type="search")
209+
210+
# Store data
211+
await store.store_content({
212+
"note_id": "123",
213+
"title": "Test Post",
214+
"liked_count": 100
215+
})
216+
217+
# Save to file
218+
store.flush()
219+
```
220+
221+
### Custom Formatting
222+
223+
You can extend `ExcelStoreBase` to customize formatting:
224+
225+
```python
226+
from store.excel_store_base import ExcelStoreBase
227+
228+
class CustomExcelStore(ExcelStoreBase):
229+
def _apply_header_style(self, sheet, row_num=1):
230+
# Custom header styling
231+
super()._apply_header_style(sheet, row_num)
232+
# Add your customizations here
233+
```
234+
235+
## Support
236+
237+
For issues or questions:
238+
- Check [常见问题](常见问题.md)
239+
- Open an issue on GitHub
240+
- Join the WeChat discussion group
241+
242+
---
243+
244+
**Note**: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits.

main.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,18 @@ async def main():
8484
crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
8585
await crawler.start()
8686

87+
# Flush Excel data if using Excel export
88+
if config.SAVE_DATA_OPTION == "excel":
89+
try:
90+
# Get the store instance and flush data
91+
from store.xhs import XhsStoreFactory
92+
store = XhsStoreFactory.create_store()
93+
if hasattr(store, 'flush'):
94+
store.flush()
95+
print(f"[Main] Excel file saved successfully")
96+
except Exception as e:
97+
print(f"Error flushing Excel data: {e}")
98+
8799
# Generate wordcloud after crawling is complete
88100
# Only for JSON save mode
89101
if config.SAVE_DATA_OPTION == "json" and config.ENABLE_GET_WORDCLOUD:

pyproject.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,9 @@ dependencies = [
3535
"wordcloud==1.9.3",
3636
"xhshow>=0.1.3",
3737
"pre-commit>=3.5.0",
38+
"openpyxl>=3.1.2",
39+
"pytest>=7.4.0",
40+
"pytest-asyncio>=0.21.0",
3841
]
3942

4043
[[tool.uv.index]]

requirements.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,7 @@ alembic>=1.16.5
2525
asyncmy>=0.2.10
2626
sqlalchemy>=2.0.43
2727
motor>=3.3.0
28-
xhshow>=0.1.3
28+
xhshow>=0.1.3
29+
openpyxl>=3.1.2
30+
pytest>=7.4.0
31+
pytest-asyncio>=0.21.0

0 commit comments

Comments
 (0)