Skip to content

Commit d19ad7d

Browse files
feat: video pipeline OCR quality fixes + two-pass AI enhancement
- Skip OCR on WEBCAM/OTHER frames (eliminates ~64 junk results per video) - Add _clean_ocr_line() to strip line numbers, IDE decorations, collapse markers - Add _fix_intra_line_duplication() for multi-engine OCR overlap artifacts - Add _is_likely_code() filter to prevent UI junk in reference code fences - Add language detection to get_text_groups() via LanguageDetector - Apply OCR cleaning in _assemble_structured_text() pipeline - Add two-pass AI enhancement: Pass 1 cleans reference Code Timeline using transcript context, Pass 2 generates SKILL.md from cleaned refs - Update video-tutorial.yaml prompts for pre-cleaned references - Add 17 new tests (197 total video tests), 2540 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent bb54b3f commit d19ad7d

File tree

6 files changed

+489
-23
lines changed

6 files changed

+489
-23
lines changed

CHANGELOG.md

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10-
**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,037 lines since v3.1.3. **2,523 tests passing.**
10+
**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,500 lines since v3.1.3. **2,540 tests passing.**
1111

1212
### 🎬 Video Tutorial Scraping Pipeline (BETA)
1313

@@ -23,7 +23,7 @@ Complete video tutorial extraction system that converts YouTube videos and local
2323
- **`video_metadata.py`** (~270 lines) — YouTube metadata extraction (title, channel, views, chapters, duration) via yt-dlp; local file metadata via ffprobe
2424
- **`video_transcript.py`** (~370 lines) — Multi-source transcript extraction with 3-tier fallback: YouTube Transcript API → yt-dlp subtitles → faster-whisper local transcription
2525
- **`video_segmenter.py`** (~220 lines) — Chapter-based and time-window segmentation with configurable overlap
26-
- **`video_visual.py`** (~2,290 lines) — Visual extraction pipeline:
26+
- **`video_visual.py`** (~2,410 lines) — Visual extraction pipeline:
2727
- Keyframe detection via scene change (scenedetect) with configurable threshold
2828
- Frame classification (code editor, slides, terminal, browser, other)
2929
- Panel detection — splits IDE screenshots into independent sub-sections (code, terminal, file tree)
@@ -37,11 +37,13 @@ Complete video tutorial extraction system that converts YouTube videos and local
3737
- Tesseract circuit breaker (`_tesseract_broken` flag) — disables pytesseract after first failure
3838
- **Audio-visual alignment** — Code blocks paired with narrator transcript for context
3939
- **Video-specific AI enhancement** — Custom prompt for OCR denoising, code reconstruction, and tutorial narrative synthesis
40+
- **Two-pass AI enhancement** — Pass 1 cleans reference files (Code Timeline reconstruction from transcript context), Pass 2 generates SKILL.md from cleaned references
41+
- **`_ai_clean_reference()`** — Sends reference file to Claude to reconstruct code blocks using transcript context, fixing OCR noise before SKILL.md generation
4042
- **`video-tutorial.yaml`** workflow preset — 4-stage enhancement pipeline (OCR cleanup → language detection → tutorial synthesis → skill polish)
4143
- **Video arguments**`arguments/video.py` with `VIDEO_ARGUMENTS` dict: `--url`, `--video-file`, `--playlist`, `--vision-ocr`, `--keyframe-threshold`, `--max-keyframes`, `--whisper-model`, `--setup`, etc.
4244
- **Video parser**`parsers/video_parser.py` for unified CLI parser registry
4345
- **MCP `scrape_video` tool** — Full video scraping from MCP server with 6 visual params, setup mode, and playlist support
44-
- **`tests/test_video_scraper.py`** (180 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments
46+
- **`tests/test_video_scraper.py`** (197 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments, OCR cleaning, code filtering
4547

4648
#### Video `--setup`: GPU Auto-Detection & Dependency Installation
4749
- **`skill-seekers video --setup`** — One-command GPU auto-detection and dependency installation
@@ -80,6 +82,14 @@ Complete video tutorial extraction system that converts YouTube videos and local
8082

8183
### Fixed
8284

85+
#### Video Pipeline OCR Quality Fixes (6)
86+
- **Webcam/OTHER frames skip OCR** — WEBCAM and OTHER frame types no longer get OCR'd, eliminating ~64 junk OCR results per video
87+
- **`_clean_ocr_line()` helper** — Strips leading line numbers, IDE tab bar text, Unity Inspector labels, and VS Code collapse markers from OCR output
88+
- **`_fix_intra_line_duplication()`** — Detects and removes token sequence repetition from multi-engine OCR overlap (e.g., `gpublic class Card Jpublic class Card``public class Card`)
89+
- **`_is_likely_code()` filter** — Reference file code fences now filtered to reject UI junk (Inspector, Hierarchy, Canvas labels) that passed frame classification
90+
- **Language detection on text groups**`get_text_groups()` now runs `LanguageDetector.detect_from_code()` on each group, filling the previously-always-None `detected_language` field
91+
- **OCR cleaning in text assembly**`_assemble_structured_text()` applies `_clean_ocr_line()` to every line before joining
92+
8393
#### Video Pipeline Fixes (15)
8494
- **`extract_visual_data` returning 2-tuple instead of 3** — Caused `ValueError` crash when unpacking results
8595
- **pytesseract in core deps** — Moved from core dependencies to `[video-full]` optional group

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ pytest tests/test_mcp_fastmcp.py -v
290290
**Test Architecture:**
291291
- 46 test files covering all features
292292
- CI Matrix: Ubuntu + macOS, Python 3.10-3.13
293-
- **2,121 tests passing** (current v3.1.0), up from 700+ in v2.x
293+
- **2,540 tests passing** (current), up from 700+ in v2.x
294294
- Must run `pip install -e .` before tests (src/ layout requirement)
295295
- Tests include create command integration tests, CLI refactor E2E tests
296296

@@ -808,7 +808,7 @@ pip install -e .
808808

809809
Per user instructions in `~/.claude/CLAUDE.md`:
810810
- "never skip any test. always make sure all test pass"
811-
- All 2,121 tests must pass before commits (v3.1.0)
811+
- All 2,540 tests must pass before commits
812812
- Run full test suite: `pytest tests/ -v`
813813
- New tests added for create command and CLI refactor work
814814

src/skill_seekers/cli/video_scraper.py

Lines changed: 133 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,86 @@ def _build_audio_visual_alignments(
233233
return alignments
234234

235235

236+
# =============================================================================
237+
# OCR Quality Filters
238+
# =============================================================================
239+
240+
241+
_RE_CODE_TOKENS = re.compile(
242+
r"[=(){};]|(?:def|class|function|import|return|var|let|const|public|private|void|static|override|virtual|protected)\b"
243+
)
244+
_RE_UI_PATTERNS = re.compile(
245+
r"\b(?:Inspector|Hierarchy|Project|Console|Image Type|Sorting Layer|Button|Canvas|Scene|Game)\b",
246+
re.IGNORECASE,
247+
)
248+
249+
250+
def _is_likely_code(text: str) -> bool:
251+
"""Return True if text likely contains programming code, not UI junk."""
252+
if not text or len(text.strip()) < 10:
253+
return False
254+
code_tokens = _RE_CODE_TOKENS.findall(text)
255+
ui_patterns = _RE_UI_PATTERNS.findall(text)
256+
return len(code_tokens) >= 2 and len(code_tokens) > len(ui_patterns)
257+
258+
259+
# =============================================================================
260+
# Two-Pass AI Reference Enhancement
261+
# =============================================================================
262+
263+
264+
def _ai_clean_reference(ref_path: str, content: str, api_key: str | None = None) -> None:
265+
"""Use AI to clean Code Timeline section in a reference file.
266+
267+
Sends the reference file content to Claude with a focused prompt
268+
to reconstruct the Code Timeline from noisy OCR + transcript context.
269+
"""
270+
try:
271+
import anthropic
272+
except ImportError:
273+
return
274+
275+
key = api_key or os.environ.get("ANTHROPIC_API_KEY") or os.environ.get("ANTHROPIC_AUTH_TOKEN")
276+
if not key:
277+
return
278+
279+
base_url = os.environ.get("ANTHROPIC_BASE_URL")
280+
client_kwargs: dict = {"api_key": key}
281+
if base_url:
282+
client_kwargs["base_url"] = base_url
283+
284+
prompt = (
285+
"You are cleaning a video tutorial reference file. The Code Timeline section "
286+
"contains OCR-extracted code that is noisy (duplicated lines, garbled characters, "
287+
"UI decorations mixed in). The transcript sections above provide context about "
288+
"what the code SHOULD be.\n\n"
289+
"Tasks:\n"
290+
"1. Reconstruct each code block in the file using transcript context\n"
291+
"2. Fix OCR errors (l/1, O/0, rn/m confusions)\n"
292+
"3. Remove any UI text (Inspector, Hierarchy, button labels)\n"
293+
"4. Set correct language tags on code fences\n"
294+
"5. Keep the document structure but clean the code text\n\n"
295+
"Return the COMPLETE reference file with cleaned code blocks. "
296+
"Do NOT modify the transcript or metadata sections.\n\n"
297+
f"Reference file:\n{content}"
298+
)
299+
300+
try:
301+
client = anthropic.Anthropic(**client_kwargs)
302+
response = client.messages.create(
303+
model="claude-sonnet-4-20250514",
304+
max_tokens=8000,
305+
messages=[{"role": "user", "content": prompt}],
306+
)
307+
result = response.content[0].text
308+
if result and len(result) > len(content) * 0.5:
309+
with open(ref_path, "w", encoding="utf-8") as f:
310+
f.write(result)
311+
logger.info(f"AI-cleaned reference: {os.path.basename(ref_path)}")
312+
except Exception as e:
313+
logger.debug(f"Reference enhancement failed: {e}")
314+
315+
236316
# =============================================================================
237317
# Main Converter Class
238318
# =============================================================================
@@ -675,6 +755,7 @@ def _generate_reference_md(self, video: VideoInfo) -> str:
675755
if (
676756
ss.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL)
677757
and ss.ocr_text
758+
and _is_likely_code(ss.ocr_text)
678759
):
679760
lines.append(f"\n```{lang_hint}")
680761
lines.append(ss.ocr_text)
@@ -683,15 +764,16 @@ def _generate_reference_md(self, video: VideoInfo) -> str:
683764
from skill_seekers.cli.video_models import FrameType
684765

685766
if kf.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL):
686-
lang_hint = ""
687-
if seg.detected_code_blocks:
688-
for cb in seg.detected_code_blocks:
689-
if cb.language:
690-
lang_hint = cb.language
691-
break
692-
lines.append(f"\n```{lang_hint}")
693-
lines.append(kf.ocr_text)
694-
lines.append("```")
767+
if _is_likely_code(kf.ocr_text):
768+
lang_hint = ""
769+
if seg.detected_code_blocks:
770+
for cb in seg.detected_code_blocks:
771+
if cb.language:
772+
lang_hint = cb.language
773+
break
774+
lines.append(f"\n```{lang_hint}")
775+
lines.append(kf.ocr_text)
776+
lines.append("```")
695777
elif kf.frame_type == FrameType.SLIDE:
696778
for text_line in kf.ocr_text.split("\n"):
697779
if text_line.strip():
@@ -779,6 +861,44 @@ def _generate_reference_md(self, video: VideoInfo) -> str:
779861

780862
return "\n".join(lines)
781863

864+
def _enhance_reference_files(self, enhance_level: int, args) -> None:
865+
"""First-pass: AI-clean reference files before SKILL.md enhancement.
866+
867+
When enhance_level >= 2 and an API key is available, sends each
868+
reference file to Claude to reconstruct noisy Code Timeline
869+
sections using transcript context.
870+
"""
871+
has_api_key = bool(
872+
os.environ.get("ANTHROPIC_API_KEY")
873+
or os.environ.get("ANTHROPIC_AUTH_TOKEN")
874+
or getattr(args, "api_key", None)
875+
)
876+
if not has_api_key or enhance_level < 2:
877+
return
878+
879+
refs_dir = os.path.join(self.skill_dir, "references")
880+
if not os.path.isdir(refs_dir):
881+
return
882+
883+
logger.info("\n📝 Pass 1: AI-cleaning reference files (Code Timeline reconstruction)...")
884+
api_key = getattr(args, "api_key", None)
885+
886+
for ref_file in sorted(os.listdir(refs_dir)):
887+
if not ref_file.endswith(".md"):
888+
continue
889+
ref_path = os.path.join(refs_dir, ref_file)
890+
try:
891+
with open(ref_path, encoding="utf-8") as f:
892+
content = f.read()
893+
except OSError:
894+
continue
895+
896+
# Only enhance if there are code fences to clean
897+
if "```" not in content:
898+
continue
899+
900+
_ai_clean_reference(ref_path, content, api_key)
901+
782902
def _generate_skill_md(self) -> str:
783903
"""Generate the main SKILL.md file."""
784904
lines = []
@@ -1044,11 +1164,14 @@ def main() -> int:
10441164
# Enhancement
10451165
enhance_level = getattr(args, "enhance_level", 0)
10461166
if enhance_level > 0:
1167+
# Pass 1: Clean reference files (Code Timeline reconstruction)
1168+
converter._enhance_reference_files(enhance_level, args)
1169+
10471170
# Auto-inject video-tutorial workflow if no workflow specified
10481171
if not getattr(args, "enhance_workflow", None):
10491172
args.enhance_workflow = ["video-tutorial"]
10501173

1051-
# Run workflow stages (specialized video analysis)
1174+
# Pass 2: Run workflow stages (specialized video analysis)
10521175
try:
10531176
from skill_seekers.cli.workflow_runner import run_workflows
10541177

0 commit comments

Comments
 (0)