|
| 1 | +# Excel Export Guide |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +MediaCrawler now supports exporting crawled data to formatted Excel files (.xlsx) with professional styling and multiple sheets for contents, comments, and creators. |
| 6 | + |
| 7 | +## Features |
| 8 | + |
| 9 | +- **Multi-sheet workbooks**: Separate sheets for Contents, Comments, and Creators |
| 10 | +- **Professional formatting**: |
| 11 | + - Styled headers with blue background and white text |
| 12 | + - Auto-adjusted column widths |
| 13 | + - Cell borders and text wrapping |
| 14 | + - Clean, readable layout |
| 15 | +- **Smart export**: Empty sheets are automatically removed |
| 16 | +- **Organized storage**: Files saved to `data/{platform}/` directory with timestamps |
| 17 | + |
| 18 | +## Installation |
| 19 | + |
| 20 | +Excel export requires the `openpyxl` library: |
| 21 | + |
| 22 | +```bash |
| 23 | +# Using uv (recommended) |
| 24 | +uv sync |
| 25 | + |
| 26 | +# Or using pip |
| 27 | +pip install openpyxl |
| 28 | +``` |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +### Basic Usage |
| 33 | + |
| 34 | +1. **Configure Excel export** in `config/base_config.py`: |
| 35 | + |
| 36 | +```python |
| 37 | +SAVE_DATA_OPTION = "excel" # Change from json/csv/db to excel |
| 38 | +``` |
| 39 | + |
| 40 | +2. **Run the crawler**: |
| 41 | + |
| 42 | +```bash |
| 43 | +# Xiaohongshu example |
| 44 | +uv run main.py --platform xhs --lt qrcode --type search |
| 45 | + |
| 46 | +# Douyin example |
| 47 | +uv run main.py --platform dy --lt qrcode --type search |
| 48 | + |
| 49 | +# Bilibili example |
| 50 | +uv run main.py --platform bili --lt qrcode --type search |
| 51 | +``` |
| 52 | + |
| 53 | +3. **Find your Excel file** in `data/{platform}/` directory: |
| 54 | + - Filename format: `{platform}_{crawler_type}_{timestamp}.xlsx` |
| 55 | + - Example: `xhs_search_20250128_143025.xlsx` |
| 56 | + |
| 57 | +### Command Line Examples |
| 58 | + |
| 59 | +```bash |
| 60 | +# Search by keywords and export to Excel |
| 61 | +uv run main.py --platform xhs --lt qrcode --type search --save_data_option excel |
| 62 | + |
| 63 | +# Crawl specific posts and export to Excel |
| 64 | +uv run main.py --platform xhs --lt qrcode --type detail --save_data_option excel |
| 65 | + |
| 66 | +# Crawl creator profile and export to Excel |
| 67 | +uv run main.py --platform xhs --lt qrcode --type creator --save_data_option excel |
| 68 | +``` |
| 69 | + |
| 70 | +## Excel File Structure |
| 71 | + |
| 72 | +### Contents Sheet |
| 73 | +Contains post/video information: |
| 74 | +- `note_id`: Unique post identifier |
| 75 | +- `title`: Post title |
| 76 | +- `desc`: Post description |
| 77 | +- `user_id`: Author user ID |
| 78 | +- `nickname`: Author nickname |
| 79 | +- `liked_count`: Number of likes |
| 80 | +- `comment_count`: Number of comments |
| 81 | +- `share_count`: Number of shares |
| 82 | +- `ip_location`: IP location |
| 83 | +- `image_list`: Comma-separated image URLs |
| 84 | +- `tag_list`: Comma-separated tags |
| 85 | +- `note_url`: Direct link to post |
| 86 | +- And more platform-specific fields... |
| 87 | + |
| 88 | +### Comments Sheet |
| 89 | +Contains comment information: |
| 90 | +- `comment_id`: Unique comment identifier |
| 91 | +- `note_id`: Associated post ID |
| 92 | +- `content`: Comment text |
| 93 | +- `user_id`: Commenter user ID |
| 94 | +- `nickname`: Commenter nickname |
| 95 | +- `like_count`: Comment likes |
| 96 | +- `create_time`: Comment timestamp |
| 97 | +- `ip_location`: Commenter location |
| 98 | +- `sub_comment_count`: Number of replies |
| 99 | +- And more... |
| 100 | + |
| 101 | +### Creators Sheet |
| 102 | +Contains creator/author information: |
| 103 | +- `user_id`: Unique user identifier |
| 104 | +- `nickname`: Display name |
| 105 | +- `gender`: Gender |
| 106 | +- `avatar`: Profile picture URL |
| 107 | +- `desc`: Bio/description |
| 108 | +- `fans`: Follower count |
| 109 | +- `follows`: Following count |
| 110 | +- `interaction`: Total interactions |
| 111 | +- And more... |
| 112 | + |
| 113 | +## Advantages Over Other Formats |
| 114 | + |
| 115 | +### vs CSV |
| 116 | +- ✅ Multiple sheets in one file |
| 117 | +- ✅ Professional formatting |
| 118 | +- ✅ Better handling of special characters |
| 119 | +- ✅ Auto-adjusted column widths |
| 120 | +- ✅ No encoding issues |
| 121 | + |
| 122 | +### vs JSON |
| 123 | +- ✅ Human-readable tabular format |
| 124 | +- ✅ Easy to open in Excel/Google Sheets |
| 125 | +- ✅ Better for data analysis |
| 126 | +- ✅ Easier to share with non-technical users |
| 127 | + |
| 128 | +### vs Database |
| 129 | +- ✅ No database setup required |
| 130 | +- ✅ Portable single-file format |
| 131 | +- ✅ Easy to share and archive |
| 132 | +- ✅ Works offline |
| 133 | + |
| 134 | +## Tips & Best Practices |
| 135 | + |
| 136 | +1. **Large datasets**: For very large crawls (>10,000 rows), consider using database storage instead for better performance |
| 137 | + |
| 138 | +2. **Data analysis**: Excel files work great with: |
| 139 | + - Microsoft Excel |
| 140 | + - Google Sheets |
| 141 | + - LibreOffice Calc |
| 142 | + - Python pandas: `pd.read_excel('file.xlsx')` |
| 143 | + |
| 144 | +3. **Combining data**: You can merge multiple Excel files using: |
| 145 | + ```python |
| 146 | + import pandas as pd |
| 147 | + df1 = pd.read_excel('file1.xlsx', sheet_name='Contents') |
| 148 | + df2 = pd.read_excel('file2.xlsx', sheet_name='Contents') |
| 149 | + combined = pd.concat([df1, df2]) |
| 150 | + combined.to_excel('combined.xlsx', index=False) |
| 151 | + ``` |
| 152 | + |
| 153 | +4. **File size**: Excel files are typically 2-3x larger than CSV but smaller than JSON |
| 154 | + |
| 155 | +## Troubleshooting |
| 156 | + |
| 157 | +### "openpyxl not installed" error |
| 158 | + |
| 159 | +```bash |
| 160 | +# Install openpyxl |
| 161 | +uv add openpyxl |
| 162 | +# or |
| 163 | +pip install openpyxl |
| 164 | +``` |
| 165 | + |
| 166 | +### Excel file not created |
| 167 | + |
| 168 | +Check that: |
| 169 | +1. `SAVE_DATA_OPTION = "excel"` in config |
| 170 | +2. Crawler successfully collected data |
| 171 | +3. No errors in console output |
| 172 | +4. `data/{platform}/` directory exists |
| 173 | + |
| 174 | +### Empty Excel file |
| 175 | + |
| 176 | +This happens when: |
| 177 | +- No data was crawled (check keywords/IDs) |
| 178 | +- Login failed (check login status) |
| 179 | +- Platform blocked requests (check IP/rate limits) |
| 180 | + |
| 181 | +## Example Output |
| 182 | + |
| 183 | +After running a successful crawl, you'll see: |
| 184 | + |
| 185 | +``` |
| 186 | +[ExcelStoreBase] Initialized Excel export to: data/xhs/xhs_search_20250128_143025.xlsx |
| 187 | +[ExcelStoreBase] Stored content to Excel: 7123456789 |
| 188 | +[ExcelStoreBase] Stored comment to Excel: comment_123 |
| 189 | +... |
| 190 | +[Main] Excel file saved successfully |
| 191 | +``` |
| 192 | + |
| 193 | +Your Excel file will have: |
| 194 | +- Professional blue headers |
| 195 | +- Clean borders |
| 196 | +- Wrapped text for long content |
| 197 | +- Auto-sized columns |
| 198 | +- Separate organized sheets |
| 199 | + |
| 200 | +## Advanced Usage |
| 201 | + |
| 202 | +### Programmatic Access |
| 203 | + |
| 204 | +```python |
| 205 | +from store.excel_store_base import ExcelStoreBase |
| 206 | + |
| 207 | +# Create store |
| 208 | +store = ExcelStoreBase(platform="xhs", crawler_type="search") |
| 209 | + |
| 210 | +# Store data |
| 211 | +await store.store_content({ |
| 212 | + "note_id": "123", |
| 213 | + "title": "Test Post", |
| 214 | + "liked_count": 100 |
| 215 | +}) |
| 216 | + |
| 217 | +# Save to file |
| 218 | +store.flush() |
| 219 | +``` |
| 220 | + |
| 221 | +### Custom Formatting |
| 222 | + |
| 223 | +You can extend `ExcelStoreBase` to customize formatting: |
| 224 | + |
| 225 | +```python |
| 226 | +from store.excel_store_base import ExcelStoreBase |
| 227 | + |
| 228 | +class CustomExcelStore(ExcelStoreBase): |
| 229 | + def _apply_header_style(self, sheet, row_num=1): |
| 230 | + # Custom header styling |
| 231 | + super()._apply_header_style(sheet, row_num) |
| 232 | + # Add your customizations here |
| 233 | +``` |
| 234 | + |
| 235 | +## Support |
| 236 | + |
| 237 | +For issues or questions: |
| 238 | +- Check [常见问题](常见问题.md) |
| 239 | +- Open an issue on GitHub |
| 240 | +- Join the WeChat discussion group |
| 241 | + |
| 242 | +--- |
| 243 | + |
| 244 | +**Note**: Excel export is designed for learning and research purposes. Please respect platform terms of service and rate limits. |
0 commit comments