🌐 ClipVault - Your Personal Internet Knowledge Archive

Local AI-powered tool to collect web content (video/image-text), transcribe, summarize, and archive to Notion/CSV. No API fees required.

English | 简体中文

✨ Features

Transform internet content into searchable personal knowledge base:

📥 Multi-platform Support: YouTube, Bilibili, Xiaohongshu (video + image-text posts)
🎙️ Local Transcription: Whisper-powered audio-to-text (no API fees)
🤖 AI Summarization: Ollama LLM generates key insights (runs locally)
💾 Flexible Export: Save to Notion database or CSV/Excel
🔌 Automation Ready: OpenClaw skill integration for workflow automation
⚡ Checkpoint Resume: Auto-resume from interruptions

📋 Pipeline Overview

Step	Module	Technology	VRAM Usage
1. Download	downloader.py	yt-dlp	-
2. Transcribe	transcriber.py	faster-whisper (small)	~2GB
3. Summarize	summarizer.py	Ollama qwen2.5:7b	~4-5GB
4. Archive	notion_writer.py / csv_writer.py	Notion API / CSV	-

Total VRAM: ~6-7GB (sequential execution, 8GB GPU recommended)

🖥️ System Requirements

OS: Windows 11 / Linux / macOS
GPU: NVIDIA GPU with 8GB+ VRAM (e.g., RTX 4060/5060, RTX 3070)
RAM: 16GB+ recommended (32GB optimal)
CUDA: 12.x (for GPU acceleration)
Python: 3.9+

📦 Installation

1. Basic Environment

# Create project directory
mkdir clipvault
cd clipvault

# Create virtual environment (recommended)
python -m venv venv
.\venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Install yt-dlp

# Method 1: pip
pip install yt-dlp

# Method 2: winget (Windows)
winget install yt-dlp

3. Install FFmpeg (Required for Bilibili, etc.)

# winget
winget install FFmpeg.FFmpeg

# Or download manually: https://ffmpeg.org/download.html
# Add ffmpeg.exe to PATH

4. Install Ollama

# Download: https://ollama.com/download/windows
# Or use winget
winget install Ollama.Ollama

# Start service (runs in background)
ollama serve

# Pull model
ollama pull qwen2.5:7b-instruct-q4_K_M

# Verify
ollama list

5. Download Whisper Model

The small model (~500MB) will be automatically downloaded on first run.

⚙️ Configuration

Option 1: Notion (Knowledge Base)

Step 1: Create Integration

Visit https://www.notion.so/my-integrations
Click New integration
Name: ClipVault
Get Internal Integration Token

Step 2: Create Database

Create a Notion database with these properties:

Property	Type	Description
Title	Title	Content title
URL	URL	Source link
Platform	Select	YouTube/Bilibili/...
Transcript	Text	Full transcription
Summary	Text	AI-generated summary
Tags	Multi-select	Auto-generated tags
KeyPoints	Text	Key takeaways
Category	Select	Content category
Sentiment	Select	positive/negative/neutral
CreatedTime	Date	Creation timestamp

Step 3: Share Database with Integration

Open Notion database page
Click ... (top-right) → Connections → Add ClipVault

Step 4: Get Database ID

https://notion.so/{workspace}/{Database_ID}?v=...
                      ↑ This is your Database ID

Step 5: Configure Environment

# Copy template
copy .env.example .env

# Edit .env file
notepad .env

Add to .env:

NOTION_TOKEN=your_integration_token_here
NOTION_DATABASE_ID=your_database_id_here

Option 2: CSV/Excel (Simple Export)

No configuration needed! If Notion is not configured, results will automatically save to:

output/results.csv (CSV format)
Can be imported to Excel, Google Sheets, or Airtable

CSV includes all fields: Title, URL, Platform, Transcript, Summary, Tags, KeyPoints, Category, Sentiment, CreatedTime

🚀 Usage

CLI Basic Usage

# Activate environment
.\venv\Scripts\activate

# Process single video
python main.py "https://www.youtube.com/watch?v=xxx"

# Debug mode
python main.py "url" --log-level DEBUG

# Skip specific steps
python main.py "url" --skip-summary
python main.py "url" --no-cleanup

Batch Processing

# Create URLs file
@"
https://youtube.com/watch?v=xxx1
https://bilibili.com/video/xxx2
https://youtube.com/watch?v=xxx3
"@ | Out-File -Encoding utf8 urls.txt

# Batch process
Get-Content urls.txt | ForEach-Object { python main.py $_ }

OpenClaw Skill Integration

Use the clip_to_vault skill for automation:

# Example: Auto-save interesting videos to knowledge base
skill.clip_to_vault(url="https://youtube.com/watch?v=xxx")

🔍 CUDA Check

# Check CUDA availability
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}'); print(f'Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"

# VRAM info
python -c "import torch; print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f}GB')"

📊 VRAM Usage Estimates

Model	Parameters	Quantization	VRAM	Speed
Whisper small	-	float16	~2GB	Fast
qwen2.5:7b	7B	Q4_K_M	~4-5GB	Medium
Total	-	-	~6-7GB	<3min for 10min video

🐛 Troubleshooting

OOM (Out of Memory) Solutions

Reduce model precision

# transcriber.py
COMPUTE_TYPE = "int8"  # Change from float16 to int8

Use smaller LLM

# summarizer.py
DEFAULT_MODEL = "llama3.2:3b-instruct-q4_K_M"  # ~2-3GB

Truncate long texts

# Limit transcript length
transcript = transcript[:3000]

Explicit VRAM cleanup

import torch
torch.cuda.empty_cache()
del model
gc.collect()

Common Issues

Issue	Solution
yt-dlp download fails	Check network or use proxy
Whisper errors	Verify FFmpeg is installed
Ollama connection fails	Run `ollama serve`
Notion 401 error	Check Token and Database ID
CSV not saving	Check `output/` folder permissions

📁 Project Structure

clipvault/
├── main.py              # Main entry point
├── downloader.py        # Video/content downloader
├── transcriber.py       # Whisper transcription
├── summarizer.py        # LLM summarization
├── notion_writer.py     # Notion API writer
├── csv_writer.py        # CSV exporter (future)
├── requirements.txt     # Dependencies
├── .env.example         # Config template
├── .env                 # Local config (gitignore)
├── downloads/           # Temporary audio files
├── logs/                # Execution logs
├── checkpoints/         # Resume points
├── output/              # CSV/Excel exports
│   └── results.csv
├── README.md            # English docs
└── README.zh-CN.md      # Chinese docs

🔧 Optimization Tips

Speed Optimization

Use faster models
- Whisper: base (faster than small)
- LLM: phi3.5:3.8b-mini (faster but slightly lower quality)
Cache models
- Keep Ollama running after first load

Quality Optimization

Use larger models
- Whisper: medium (requires more VRAM)
- LLM: qwen2.5:14b (needs 10GB+ VRAM)

🤝 Contributing

Contributions welcome! Areas for improvement:

Support for more platforms (Instagram, TikTok, etc.)
Web UI interface
Batch processing dashboard
Excel/Airtable direct integration
Multi-language summary support

📝 License

MIT License - Free to use and modify

🙏 Acknowledgments

faster-whisper - Local transcription
Ollama - Local LLM inference
yt-dlp - Universal video downloader
Notion API - Knowledge base integration

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
downloader.py		downloader.py
main.py		main.py
notion_writer.py		notion_writer.py
requirements.txt		requirements.txt
setup.bat		setup.bat
summarizer.py		summarizer.py
transcriber.py		transcriber.py

License

hongjiapeng/resource2knowledge

Folders and files

Latest commit

History

Repository files navigation

🌐 ClipVault - Your Personal Internet Knowledge Archive

✨ Features

📋 Pipeline Overview

🖥️ System Requirements

📦 Installation

1. Basic Environment

2. Install yt-dlp

3. Install FFmpeg (Required for Bilibili, etc.)

4. Install Ollama

5. Download Whisper Model

⚙️ Configuration

Option 1: Notion (Knowledge Base)

Step 1: Create Integration

Step 2: Create Database

Step 3: Share Database with Integration

Step 4: Get Database ID

Step 5: Configure Environment

Option 2: CSV/Excel (Simple Export)

🚀 Usage

CLI Basic Usage

Batch Processing

OpenClaw Skill Integration

🔍 CUDA Check

📊 VRAM Usage Estimates

🐛 Troubleshooting

OOM (Out of Memory) Solutions

Common Issues

📁 Project Structure

🔧 Optimization Tips

Speed Optimization

Quality Optimization

🤝 Contributing

📝 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages