Convert video tutorials into structured AI skills with transcripts, on-screen code extraction, and AI enhancement.
Supports YouTube videos, YouTube playlists, local video files, and pre-extracted JSON data.
# Install transcript-only dependencies (lightweight, ~15 MB)
pip install "skill-seekers[video]"
# Extract a YouTube tutorial (transcript only)
skill-seekers video --url https://www.youtube.com/watch?v=VIDEO_ID
# Install visual extraction dependencies (auto-detects your GPU)
skill-seekers video --setup
# Extract with on-screen code recognition
skill-seekers video --url https://www.youtube.com/watch?v=VIDEO_ID --visual
# Extract with AI enhancement (cleans OCR, synthesizes tutorial)
skill-seekers video --url https://www.youtube.com/watch?v=VIDEO_ID --visual --enhance-level 2This installs yt-dlp and youtube-transcript-api -- everything needed to pull metadata and transcripts from YouTube videos.
pip install "skill-seekers[video]"Total download size is around 15 MB. No GPU or native libraries required.
Visual extraction adds scene detection, keyframe classification, OCR (optical character recognition), and Whisper speech-to-text. Install the base visual dependencies first:
pip install "skill-seekers[video-full]"This installs faster-whisper, scenedetect, opencv-python-headless, and pytesseract.
Then run the setup command to install GPU-aware dependencies (PyTorch and EasyOCR):
skill-seekers video --setupThe --setup command auto-detects your GPU hardware and installs the correct PyTorch variant along with EasyOCR. These packages are installed at runtime rather than through pip extras because PyTorch requires different builds depending on your GPU.
Detection order:
| GPU Type | Detection Method | PyTorch Variant Installed |
|---|---|---|
| NVIDIA (CUDA) | nvidia-smi |
torch with CUDA 11.8 / 12.1 / 12.4 (matched to your driver) |
| AMD (ROCm) | rocminfo |
torch with ROCm 6.2 / 6.3 |
| AMD (no ROCm) | lspci |
CPU-only (warns to install ROCm first) |
| Apple Silicon | macOS detection | torch with MPS support |
| CPU only | Fallback | torch CPU build |
What gets installed:
- PyTorch -- correct build for your GPU
- EasyOCR -- multi-engine OCR for on-screen text extraction
- opencv-python-headless -- frame extraction and image processing
- scenedetect -- scene change detection for keyframe selection
- pytesseract -- Tesseract OCR engine (requires the
tesseractsystem binary) - faster-whisper -- Whisper speech-to-text for audio fallback
System dependency: Tesseract must be installed separately through your system package manager:
# Ubuntu/Debian
sudo apt install tesseract-ocr
# macOS
brew install tesseract
# Arch/Manjaro
sudo pacman -S tesseract| Flag | Type | Default | Description |
|---|---|---|---|
--url URL |
string | -- | Video URL (YouTube, Vimeo) |
--video-file PATH |
string | -- | Local video file path |
--playlist URL |
string | -- | Playlist URL (processes all videos in the playlist) |
--visual |
flag | off | Enable visual extraction (requires video-full deps) |
--vision-ocr |
flag | off | Use Claude Vision API as fallback for low-confidence code frames (requires ANTHROPIC_API_KEY, ~$0.004/frame) |
--whisper-model MODEL |
string | base |
Whisper model size for speech-to-text fallback |
--from-json FILE |
string | -- | Build skill from previously extracted JSON data |
--start-time TIME |
string | -- | Start time for extraction (single video only) |
--end-time TIME |
string | -- | End time for extraction (single video only) |
--setup |
flag | -- | Auto-detect GPU and install visual extraction dependencies, then exit |
--visual-interval SECS |
float | 0.7 |
How often to sample frames during visual extraction (seconds) |
--visual-min-gap SECS |
float | 0.5 |
Minimum gap between extracted frames (seconds) |
--visual-similarity THRESH |
float | 3.0 |
Pixel-diff threshold for duplicate frame detection; lower values keep more frames |
--languages LANGS |
string | en |
Transcript language preference (comma-separated, e.g., en,es) |
These flags are available on all Skill Seekers commands:
| Flag | Type | Default | Description |
|---|---|---|---|
--name NAME |
string | video_skill |
Skill name (used for output directory and filenames) |
--description TEXT |
string | -- | Skill description (used in SKILL.md) |
--output DIR |
string | output/<name> |
Output directory |
--enhance-level LEVEL |
int (0-3) | 0 |
AI enhancement level (see AI Enhancement below). Default is 0 (disabled) for the video command. |
--enhance-workflow NAME |
string | -- | Enhancement workflow preset to apply (repeatable). Auto-set to video-tutorial when --enhance-level > 0. |
--api-key KEY |
string | -- | Anthropic API key (or set ANTHROPIC_API_KEY env var) |
--dry-run |
flag | off | Preview what will happen without executing |
--verbose / -v |
flag | off | Enable DEBUG level logging |
--quiet / -q |
flag | off | Suppress most output (WARNING level only) |
Provide a YouTube URL with --url. The tool extracts metadata (title, channel, duration, chapters, tags, view count) via yt-dlp and fetches transcripts via the YouTube Transcript API.
skill-seekers video --url https://www.youtube.com/watch?v=dQw4w9WgXcQ --name my-tutorialShortened URLs also work:
skill-seekers video --url https://youtu.be/dQw4w9WgXcQProvide a playlist URL with --playlist. Every video in the playlist is processed sequentially and combined into a single skill.
skill-seekers video --playlist "https://www.youtube.com/playlist?list=PLxxxxxxx" --name course-nameNote: --start-time and --end-time cannot be used with playlists.
Provide a file path with --video-file. Metadata is extracted from the file itself. If a subtitle file (.srt or .vtt) exists alongside the video with the same base name, it is used automatically.
skill-seekers video --video-file recording.mp4 --name my-recordingFor transcript extraction from local files without subtitles, the tool falls back to Whisper speech-to-text (requires faster-whisper from the video-full extras).
If you have already run extraction and saved the JSON data, you can rebuild the skill without re-downloading or re-processing:
skill-seekers video --from-json output/my-tutorial_video_extracted.json --name my-tutorialThis skips all network requests and video processing -- it only runs the skill-building step.
When --visual is enabled, the tool runs a multi-stream pipeline on the video file:
Uses yt-dlp to fetch title, channel, duration, chapters, tags, thumbnails, view/like counts, and upload date.
Transcripts are acquired in priority order:
- YouTube Transcript API -- Fetches official captions. Prefers manually created transcripts over auto-generated ones. Confidence is reduced by 20% for auto-generated captions.
- Subtitle files -- Parses
.srtor.vttfiles found alongside local video files. - Whisper fallback -- Runs
faster-whisperspeech-to-text on the audio track. Requires thevideo-fullextras.
If none succeed, the video is processed without a transcript (visual data only).
The visual pipeline has several stages:
-
Scene detection -- Samples frames at
--visual-intervalintervals (default: every 0.7 seconds). Filters duplicates using pixel-diff comparison controlled by--visual-similarity. -
Keyframe classification -- Each extracted frame is classified into one of these types:
CODE_EDITOR-- IDE or text editor showing codeTERMINAL-- Command line / terminal windowSLIDE-- Presentation slideDIAGRAM-- Diagrams, flowcharts, architecture drawingsBROWSER-- Web browser contentWEBCAM-- Speaker face / webcam feedSCREENCAST-- General screen recordingOTHER-- Anything else
-
Per-panel OCR -- For
CODE_EDITORandTERMINALframes, the image is split into panels (e.g., sidebar vs. editor pane) and OCR is run on each panel separately. This avoids mixing IDE UI text with actual code. -
OCR line cleaning -- Removes line numbers captured by OCR, IDE decorations, button labels, and intra-line duplications from multi-engine results.
-
Code filtering -- The
_is_likely_code()function checks whether OCR text contains actual programming tokens (e.g.,=,{},def,import) rather than UI junk. Only text that passes this filter is included in reference files as code blocks. -
Text block tracking -- Groups OCR results across sequential frames into "text groups" that track the evolution of on-screen code over time. Detects additions, modifications, and deletions between frames.
-
Language detection -- Uses the
LanguageDetectorto identify the programming language of each text group based on code patterns and keywords. -
Audio-visual alignment -- Pairs on-screen code with overlapping transcript segments to create annotated code examples (what was on screen + what the narrator was saying).
When --vision-ocr is enabled and OCR confidence is low on a code frame, the tool sends the frame image to the Claude Vision API for higher-quality code extraction. This costs approximately $0.004 per frame and requires ANTHROPIC_API_KEY to be set.
The tool uses a multi-engine OCR ensemble:
- EasyOCR -- Neural network-based, good at recognizing code fonts
- Tesseract (via pytesseract) -- Traditional OCR engine, handles clean text well
Results from both engines are merged with deduplication to maximize accuracy.
Enhancement is disabled by default for the video command (--enhance-level defaults to 0). Enable it by setting --enhance-level to 1, 2, or 3.
| Level | What It Does |
|---|---|
0 |
No AI enhancement. Raw extraction output only. |
1 |
Enhances SKILL.md only (overview, structure, readability). |
2 |
Recommended. Two-pass enhancement: first cleans reference files (Code Timeline reconstruction), then runs workflow stages and enhances SKILL.md. |
3 |
Full enhancement. All level-2 work plus architecture, configuration, and comprehensive documentation analysis. |
Enhancement auto-detects the mode based on environment:
- If
ANTHROPIC_API_KEYis set, uses API mode (direct Claude API calls). - Otherwise, uses LOCAL mode (Claude Code CLI, free with Max plan).
At enhancement level 2 or higher, the tool runs two passes:
Pass 1: Reference file cleanup. Each reference file is sent to Claude with a focused prompt to reconstruct the Code Timeline section. The AI uses transcript context to fix OCR errors, remove UI decorations, set correct language tags, and reconstruct garbled code blocks.
Pass 2: Workflow stages + SKILL.md rewrite. The video-tutorial workflow runs four specialized stages, then the traditional SKILL.md enhancer rewrites the final output.
When --enhance-level > 0 and no --enhance-workflow is explicitly specified, the video-tutorial workflow is automatically applied. It has four stages:
-
ocr_code_cleanup-- Reviews all code blocks for OCR noise. Removes captured line numbers, UI elements, and common OCR character confusions (l/1, O/0, rn/m). Outputs cleaned blocks with language detection and confidence scoring. -
language_detection-- Determines the programming language for each code block using narrator mentions, code patterns, visible file extensions, framework context, and pre-filleddetected_languagehints from the extraction pipeline. -
tutorial_synthesis-- Groups content by topic rather than timestamp. Identifies main concepts, builds a progressive learning path, and pairs code blocks with narrator explanations. Creates structured tutorial sections with prerequisites and key concepts. -
skill_polish-- Produces the final SKILL.md with clear trigger conditions, a quick reference of 5-10 annotated code examples, a step-by-step guide, and key concept definitions. Ensures all code fences have correct language tags and no raw OCR artifacts remain.
After extraction completes, the output directory contains:
output/<name>/
├── SKILL.md # Main skill file (enhanced if --enhance-level > 0)
├── references/
│ └── video_<sanitized-title>.md # Full transcript + OCR + Code Timeline per video
├── frames/ # Only present with --visual
│ └── frame_NNN_Ns.jpg # Extracted keyframes (N = frame number, Ns = timestamp)
└── video_data/
└── metadata.json # Full extraction metadata (VideoScraperResult)
Additionally, a standalone JSON file is saved outside the skill directory:
output/<name>_video_extracted.json # Raw extraction data (can be re-used with --from-json)
Each reference file (references/video_<title>.md) contains:
- Metadata block -- Source channel, duration, publish date, URL, view/like counts, tags
- Table of contents -- From YouTube chapters or auto-generated segments
- Segments -- Transcript text organized by time segment, with keyframe images and OCR text inline
- Code Timeline -- (visual mode) Tracked code groups showing text evolution over time, with edit diffs
- Audio-Visual Alignment -- (visual mode) Paired on-screen code with narrator explanations
- Transcript source -- Which tier provided the transcript (YouTube manual, YouTube auto, subtitle file, Whisper) and confidence score
Use --start-time and --end-time to extract only a portion of a video. This is useful for long videos where you only need a specific section.
Accepted time formats:
| Format | Example | Meaning |
|---|---|---|
| Seconds | 90 or 330.5 |
90 seconds / 330.5 seconds |
| MM:SS | 1:30 |
1 minute 30 seconds |
| HH:MM:SS | 0:05:30 |
5 minutes 30 seconds |
Both transcript segments and chapters are filtered to the specified range. When visual extraction is enabled, frames outside the range are skipped.
# Extract only minutes 5 through 15
skill-seekers video --url https://youtu.be/VIDEO_ID --start-time 5:00 --end-time 15:00
# Extract from 2 minutes onward
skill-seekers video --url https://youtu.be/VIDEO_ID --start-time 120
# Extract the first 10 minutes
skill-seekers video --url https://youtu.be/VIDEO_ID --end-time 10:00Restrictions:
--start-timemust be less than--end-timewhen both are specified.- Time clipping cannot be used with
--playlist.
skill-seekers video --url https://www.youtube.com/watch?v=VIDEO_ID --name react-hooks-tutorialskill-seekers video --url https://youtu.be/VIDEO_ID --name godot-signals --visualexport ANTHROPIC_API_KEY=sk-ant-...
skill-seekers video --url https://youtu.be/VIDEO_ID --name django-rest-api \
--visual --enhance-level 2# Place recording.srt alongside recording.mp4
skill-seekers video --video-file ./recording.mp4 --name my-lectureskill-seekers video --url https://youtu.be/VIDEO_ID --name auth-chapter \
--start-time 15:30 --end-time 42:00 --visualskill-seekers video --playlist "https://www.youtube.com/playlist?list=PLxxxxxxx" \
--name python-crash-course --languages enskill-seekers video --from-json output/my-tutorial_video_extracted.json \
--name my-tutorial --enhance-level 2export ANTHROPIC_API_KEY=sk-ant-...
skill-seekers video --url https://youtu.be/VIDEO_ID --name cpp-tutorial \
--visual --vision-ocr --enhance-level 2You need to install the video extras:
pip install "skill-seekers[video]"Visual extraction requires the full dependency set:
pip install "skill-seekers[video-full]"
skill-seekers video --setup- NVIDIA: Ensure
nvidia-smiis in your PATH and your GPU driver is installed. - AMD: Ensure ROCm is installed and
rocminfois available. If onlylspcidetects the GPU, install ROCm first for GPU acceleration: https://rocm.docs.amd.com/ - Fallback: If no GPU is found, CPU-only PyTorch is installed. OCR and Whisper will still work, just slower.
Install Tesseract via your system package manager:
sudo apt install tesseract-ocr # Ubuntu/Debian
brew install tesseract # macOS
sudo pacman -S tesseract # Arch/ManjaroSome videos have no captions available. Check:
- The video may have captions disabled by the uploader.
- Try different languages with
--languages en,auto. - For local files, place a
.srtor.vttsubtitle file alongside the video. - Install
faster-whisper(viavideo-full) for speech-to-text fallback.
yt-dlp can be rate-limited by YouTube. If you hit this:
- Wait a few minutes and retry.
- For playlists, the tool processes videos sequentially with natural delays.
- Consider downloading the video first with
yt-dlpand using--video-file.
- Use
--vision-ocrto enable the Claude Vision API fallback for low-confidence frames (~$0.004/frame). - Lower
--visual-similarity(e.g.,1.5) to keep more frames, giving the tracker more data points. - Decrease
--visual-interval(e.g.,0.3) to sample frames more frequently. - Use
--enhance-level 2to let AI reconstruct code blocks from transcript context.
- Verify your API key:
echo $ANTHROPIC_API_KEY - Check that the key has sufficient quota.
- Try a lower enhancement level:
--enhance-level 1only enhances SKILL.md. - Without an API key, enhancement falls back to LOCAL mode (requires Claude Code CLI with a Max plan).
Check the error output for specifics. Common causes:
- Invalid or private YouTube URL.
- Network connectivity issues.
- Video is age-restricted or geo-blocked.
- Local file path does not exist or is not a supported format.