yusufkaraaslan
diff --git a/‎CHANGELOG.md‎
Lines changed: 13 additions & 3 deletions b/‎CHANGELOG.md‎
Lines changed: 13 additions & 3 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions b/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎src/skill_seekers/cli/video_scraper.py‎
Lines changed: 133 additions & 10 deletions b/‎src/skill_seekers/cli/video_scraper.py‎
Lines changed: 133 additions & 10 deletions
@@ -7,7 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
-**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,037 lines since v3.1.3. **2,523 tests passing.**
+**Theme:** Video source support (BETA), Word document support, and quality improvements. 94 files changed, +23,500 lines since v3.1.3. **2,540 tests passing.**
 
 ### 🎬 Video Tutorial Scraping Pipeline (BETA)
 
@@ -23,7 +23,7 @@ Complete video tutorial extraction system that converts YouTube videos and local
 - **`video_metadata.py`** (~270 lines) — YouTube metadata extraction (title, channel, views, chapters, duration) via yt-dlp; local file metadata via ffprobe
 - **`video_transcript.py`** (~370 lines) — Multi-source transcript extraction with 3-tier fallback: YouTube Transcript API → yt-dlp subtitles → faster-whisper local transcription
 - **`video_segmenter.py`** (~220 lines) — Chapter-based and time-window segmentation with configurable overlap
-- **`video_visual.py`** (~2,290 lines) — Visual extraction pipeline:
+- **`video_visual.py`** (~2,410 lines) — Visual extraction pipeline:
   - Keyframe detection via scene change (scenedetect) with configurable threshold
   - Frame classification (code editor, slides, terminal, browser, other)
   - Panel detection — splits IDE screenshots into independent sub-sections (code, terminal, file tree)
@@ -37,11 +37,13 @@ Complete video tutorial extraction system that converts YouTube videos and local
   - Tesseract circuit breaker (`_tesseract_broken` flag) — disables pytesseract after first failure
 - **Audio-visual alignment** — Code blocks paired with narrator transcript for context
 - **Video-specific AI enhancement** — Custom prompt for OCR denoising, code reconstruction, and tutorial narrative synthesis
+- **Two-pass AI enhancement** — Pass 1 cleans reference files (Code Timeline reconstruction from transcript context), Pass 2 generates SKILL.md from cleaned references
+- **`_ai_clean_reference()`** — Sends reference file to Claude to reconstruct code blocks using transcript context, fixing OCR noise before SKILL.md generation
 - **`video-tutorial.yaml`** workflow preset — 4-stage enhancement pipeline (OCR cleanup → language detection → tutorial synthesis → skill polish)
 - **Video arguments** — `arguments/video.py` with `VIDEO_ARGUMENTS` dict: `--url`, `--video-file`, `--playlist`, `--vision-ocr`, `--keyframe-threshold`, `--max-keyframes`, `--whisper-model`, `--setup`, etc.
 - **Video parser** — `parsers/video_parser.py` for unified CLI parser registry
 - **MCP `scrape_video` tool** — Full video scraping from MCP server with 6 visual params, setup mode, and playlist support
-- **`tests/test_video_scraper.py`** (180 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments
+- **`tests/test_video_scraper.py`** (197 tests) — Comprehensive coverage: models, metadata, transcript, segmenter, visual extraction, OCR, panel detection, scraper integration, CLI arguments, OCR cleaning, code filtering
 
 #### Video `--setup`: GPU Auto-Detection & Dependency Installation
 - **`skill-seekers video --setup`** — One-command GPU auto-detection and dependency installation
@@ -80,6 +82,14 @@ Complete video tutorial extraction system that converts YouTube videos and local
 
 ### Fixed
 
+#### Video Pipeline OCR Quality Fixes (6)
+- **Webcam/OTHER frames skip OCR** — WEBCAM and OTHER frame types no longer get OCR'd, eliminating ~64 junk OCR results per video
+- **`_clean_ocr_line()` helper** — Strips leading line numbers, IDE tab bar text, Unity Inspector labels, and VS Code collapse markers from OCR output
+- **`_fix_intra_line_duplication()`** — Detects and removes token sequence repetition from multi-engine OCR overlap (e.g., `gpublic class Card Jpublic class Card` → `public class Card`)
+- **`_is_likely_code()` filter** — Reference file code fences now filtered to reject UI junk (Inspector, Hierarchy, Canvas labels) that passed frame classification
+- **Language detection on text groups** — `get_text_groups()` now runs `LanguageDetector.detect_from_code()` on each group, filling the previously-always-None `detected_language` field
+- **OCR cleaning in text assembly** — `_assemble_structured_text()` applies `_clean_ocr_line()` to every line before joining
+
 #### Video Pipeline Fixes (15)
 - **`extract_visual_data` returning 2-tuple instead of 3** — Caused `ValueError` crash when unpacking results
 - **pytesseract in core deps** — Moved from core dependencies to `[video-full]` optional group
 
@@ -290,7 +290,7 @@ pytest tests/test_mcp_fastmcp.py -v
 **Test Architecture:**
 - 46 test files covering all features
 - CI Matrix: Ubuntu + macOS, Python 3.10-3.13
-- **2,121 tests passing** (current v3.1.0), up from 700+ in v2.x
+- **2,540 tests passing** (current), up from 700+ in v2.x
 - Must run `pip install -e .` before tests (src/ layout requirement)
 - Tests include create command integration tests, CLI refactor E2E tests
 
@@ -808,7 +808,7 @@ pip install -e .
 
 Per user instructions in `~/.claude/CLAUDE.md`:
 - "never skip any test. always make sure all test pass"
-- All 2,121 tests must pass before commits (v3.1.0)
+- All 2,540 tests must pass before commits
 - Run full test suite: `pytest tests/ -v`
 - New tests added for create command and CLI refactor work
 
 
@@ -233,6 +233,86 @@ def _build_audio_visual_alignments(
     return alignments
 
 
+# =============================================================================
+# OCR Quality Filters
+# =============================================================================
+
+
+_RE_CODE_TOKENS = re.compile(
+    r"[=(){};]|(?:def|class|function|import|return|var|let|const|public|private|void|static|override|virtual|protected)\b"
+)
+_RE_UI_PATTERNS = re.compile(
+    r"\b(?:Inspector|Hierarchy|Project|Console|Image Type|Sorting Layer|Button|Canvas|Scene|Game)\b",
+    re.IGNORECASE,
+)
+
+
+def _is_likely_code(text: str) -> bool:
+    """Return True if text likely contains programming code, not UI junk."""
+    if not text or len(text.strip()) < 10:
+        return False
+    code_tokens = _RE_CODE_TOKENS.findall(text)
+    ui_patterns = _RE_UI_PATTERNS.findall(text)
+    return len(code_tokens) >= 2 and len(code_tokens) > len(ui_patterns)
+
+
+# =============================================================================
+# Two-Pass AI Reference Enhancement
+# =============================================================================
+
+
+def _ai_clean_reference(ref_path: str, content: str, api_key: str | None = None) -> None:
+    """Use AI to clean Code Timeline section in a reference file.
+
+    Sends the reference file content to Claude with a focused prompt
+    to reconstruct the Code Timeline from noisy OCR + transcript context.
+    """
+    try:
+        import anthropic
+    except ImportError:
+        return
+
+    key = api_key or os.environ.get("ANTHROPIC_API_KEY") or os.environ.get("ANTHROPIC_AUTH_TOKEN")
+    if not key:
+        return
+
+    base_url = os.environ.get("ANTHROPIC_BASE_URL")
+    client_kwargs: dict = {"api_key": key}
+    if base_url:
+        client_kwargs["base_url"] = base_url
+
+    prompt = (
+        "You are cleaning a video tutorial reference file. The Code Timeline section "
+        "contains OCR-extracted code that is noisy (duplicated lines, garbled characters, "
+        "UI decorations mixed in). The transcript sections above provide context about "
+        "what the code SHOULD be.\n\n"
+        "Tasks:\n"
+        "1. Reconstruct each code block in the file using transcript context\n"
+        "2. Fix OCR errors (l/1, O/0, rn/m confusions)\n"
+        "3. Remove any UI text (Inspector, Hierarchy, button labels)\n"
+        "4. Set correct language tags on code fences\n"
+        "5. Keep the document structure but clean the code text\n\n"
+        "Return the COMPLETE reference file with cleaned code blocks. "
+        "Do NOT modify the transcript or metadata sections.\n\n"
+        f"Reference file:\n{content}"
+    )
+
+    try:
+        client = anthropic.Anthropic(**client_kwargs)
+        response = client.messages.create(
+            model="claude-sonnet-4-20250514",
+            max_tokens=8000,
+            messages=[{"role": "user", "content": prompt}],
+        )
+        result = response.content[0].text
+        if result and len(result) > len(content) * 0.5:
+            with open(ref_path, "w", encoding="utf-8") as f:
+                f.write(result)
+            logger.info(f"AI-cleaned reference: {os.path.basename(ref_path)}")
+    except Exception as e:
+        logger.debug(f"Reference enhancement failed: {e}")
+
+
 # =============================================================================
 # Main Converter Class
 # =============================================================================
@@ -675,6 +755,7 @@ def _generate_reference_md(self, video: VideoInfo) -> str:
                             if (
                                 ss.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL)
                                 and ss.ocr_text
+                                and _is_likely_code(ss.ocr_text)
                             ):
                                 lines.append(f"\n```{lang_hint}")
                                 lines.append(ss.ocr_text)
@@ -683,15 +764,16 @@ def _generate_reference_md(self, video: VideoInfo) -> str:
                         from skill_seekers.cli.video_models import FrameType
 
                         if kf.frame_type in (FrameType.CODE_EDITOR, FrameType.TERMINAL):
-                            lang_hint = ""
-                            if seg.detected_code_blocks:
-                                for cb in seg.detected_code_blocks:
-                                    if cb.language:
-                                        lang_hint = cb.language
-                                        break
-                            lines.append(f"\n```{lang_hint}")
-                            lines.append(kf.ocr_text)
-                            lines.append("```")
+                            if _is_likely_code(kf.ocr_text):
+                                lang_hint = ""
+                                if seg.detected_code_blocks:
+                                    for cb in seg.detected_code_blocks:
+                                        if cb.language:
+                                            lang_hint = cb.language
+                                            break
+                                lines.append(f"\n```{lang_hint}")
+                                lines.append(kf.ocr_text)
+                                lines.append("```")
                         elif kf.frame_type == FrameType.SLIDE:
                             for text_line in kf.ocr_text.split("\n"):
                                 if text_line.strip():
@@ -779,6 +861,44 @@ def _generate_reference_md(self, video: VideoInfo) -> str:
 
         return "\n".join(lines)
 
+    def _enhance_reference_files(self, enhance_level: int, args) -> None:
+        """First-pass: AI-clean reference files before SKILL.md enhancement.
+
+        When enhance_level >= 2 and an API key is available, sends each
+        reference file to Claude to reconstruct noisy Code Timeline
+        sections using transcript context.
+        """
+        has_api_key = bool(
+            os.environ.get("ANTHROPIC_API_KEY")
+            or os.environ.get("ANTHROPIC_AUTH_TOKEN")
+            or getattr(args, "api_key", None)
+        )
+        if not has_api_key or enhance_level < 2:
+            return
+
+        refs_dir = os.path.join(self.skill_dir, "references")
+        if not os.path.isdir(refs_dir):
+            return
+
+        logger.info("\n📝 Pass 1: AI-cleaning reference files (Code Timeline reconstruction)...")
+        api_key = getattr(args, "api_key", None)
+
+        for ref_file in sorted(os.listdir(refs_dir)):
+            if not ref_file.endswith(".md"):
+                continue
+            ref_path = os.path.join(refs_dir, ref_file)
+            try:
+                with open(ref_path, encoding="utf-8") as f:
+                    content = f.read()
+            except OSError:
+                continue
+
+            # Only enhance if there are code fences to clean
+            if "```" not in content:
+                continue
+
+            _ai_clean_reference(ref_path, content, api_key)
+
     def _generate_skill_md(self) -> str:
         """Generate the main SKILL.md file."""
         lines = []
@@ -1044,11 +1164,14 @@ def main() -> int:
     # Enhancement
     enhance_level = getattr(args, "enhance_level", 0)
     if enhance_level > 0:
+        # Pass 1: Clean reference files (Code Timeline reconstruction)
+        converter._enhance_reference_files(enhance_level, args)
+
         # Auto-inject video-tutorial workflow if no workflow specified
         if not getattr(args, "enhance_workflow", None):
             args.enhance_workflow = ["video-tutorial"]
 
-        # Run workflow stages (specialized video analysis)
+        # Pass 2: Run workflow stages (specialized video analysis)
         try:
             from skill_seekers.cli.workflow_runner import run_workflows