Skip to content

fix(transcript): validate videoId and harden yt-dlp error handling#619

Open
Ashvin-KS wants to merge 1 commit intoAOSSIE-Org:mainfrom
Ashvin-KS:fix/transcript-endpoint-hardening
Open

fix(transcript): validate videoId and harden yt-dlp error handling#619
Ashvin-KS wants to merge 1 commit intoAOSSIE-Org:mainfrom
Ashvin-KS:fix/transcript-endpoint-hardening

Conversation

@Ashvin-KS
Copy link
Contributor

@Ashvin-KS Ashvin-KS commented Mar 22, 2026

Addressed Issues:

N/A (no separate issue filed for this PR)

Screenshots/Recordings:

N/A (backend reliability hardening; no UI changes)

Additional Notes:

  • This PR hardens the getTranscript endpoint for reliability and safer failure behavior.
  • Added strict YouTube videoId format validation before running yt-dlp.
  • Added timeout handling for yt-dlp to avoid hanging requests.
  • Added explicit subprocess error mapping:
    • timeout returns 504
    • yt-dlp failure returns 502
    • invalid videoId returns 400
    • unexpected failure returns 500
  • Switched subtitle extraction to request-scoped temporary directory to reduce cross-request file collisions.
  • Added empty transcript guard with 404 response.

AI Usage Disclosure:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: GitHub Copilot (GPT-5.3-Codex) for drafting/refinement; all changes were reviewed and validated before submission.

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • New Features

    • Added video ID validation to prevent invalid requests from processing
    • Implemented timeout protection (60s) for transcript extraction
  • Bug Fixes

    • Enhanced error responses with specific HTTP status codes for different failure scenarios

Copilot AI review requested due to automatic review settings March 22, 2026 12:36
@coderabbitai
Copy link

coderabbitai bot commented Mar 22, 2026

📝 Walkthrough

Walkthrough

The /getTranscript endpoint in backend/server.py was refactored to add YouTube video ID validation via regex, implement per-request temporary directories for subtitle extraction instead of a persistent folder, improve error handling with specific HTTP status codes, and include timeout protection for subprocess execution.

Changes

Cohort / File(s) Summary
Transcript Extraction & Validation
backend/server.py
Added YOUTUBE_VIDEO_ID_PATTERN regex constant and is_valid_youtube_video_id() function for video ID validation. Modified /getTranscript endpoint to validate and sanitize videoId parameter before processing, use temporary directories for .vtt file output instead of persistent subtitles/ folder, read newest .vtt from temp directory, and return appropriate HTTP errors (400 for invalid IDs, 404 for empty transcripts). Enhanced subprocess execution with 60-second timeout and granular error handling: subprocess.TimeoutExpired → 504, subprocess.CalledProcessError → 502, generic exceptions → 500. Removed legacy persistent file cleanup logic.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Handler as /getTranscript Handler
    participant Validator as Video ID Validator
    participant TempDir as Temp Directory
    participant Subprocess as yt-dlp Process
    participant FileSystem as File System
    participant Response as HTTP Response

    Client->>Handler: GET /getTranscript?videoId=...
    Handler->>Handler: Sanitize videoId (strip whitespace)
    Handler->>Validator: is_valid_youtube_video_id()
    
    alt Invalid or Empty ID
        Validator-->>Response: ❌ 400 Bad Request
        Response-->>Client: Error
    else Valid ID
        Validator->>Handler: ✓ Valid
        Handler->>TempDir: Create TemporaryDirectory()
        Handler->>Subprocess: Run yt-dlp (timeout=60s)
        
        alt Timeout
            Subprocess-->>Response: ⏱️ 504 Gateway Timeout
        else Process Error
            Subprocess-->>Response: ❌ 502 Bad Gateway
        else Success
            Subprocess->>FileSystem: Write .vtt file
            FileSystem->>TempDir: File created
            Handler->>FileSystem: Read newest .vtt
            FileSystem-->>Handler: File content
            
            alt Empty Transcript
                Handler->>Response: ⚠️ 404 Not Found
            else Valid Transcript
                Handler->>Response: ✓ 200 OK + Transcript
            end
        end
        
        Handler->>TempDir: Cleanup (auto on exit)
        Response-->>Client: Response
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hops through validation gates so tight,
Temporary homes for transcripts bright,
No more folder clutter—clean and neat,
With timeout shields and error receipts,
YouTube videos now checked twice with care!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly identifies the main changes: YouTube video ID validation and improved error handling for the yt-dlp subprocess, which align with the primary objectives of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can validate your CodeRabbit configuration file in your editor.

If your editor has YAML language server, you can enable auto-completion and validation by adding # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json at the top of your CodeRabbit configuration file.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the /getTranscript backend endpoint by validating YouTube videoId inputs and making yt-dlp execution safer and more reliable (timeouts, error mapping, and request-scoped temp files).

Changes:

  • Added strict YouTube video ID format validation before invoking yt-dlp.
  • Switched subtitle output to a per-request temporary directory and added guards for missing/empty transcripts.
  • Added subprocess timeout + explicit error handling to return appropriate HTTP status codes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


def is_valid_youtube_video_id(video_id):
return bool(YOUTUBE_VIDEO_ID_PATTERN.fullmatch(video_id or ""))

Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a blank line between the top-level helper is_valid_youtube_video_id and the @app.route decorator below it to keep top-level definitions separated consistently (PEP8-style; most other endpoints in this file are separated by blank lines).

Suggested change

Copilot uses AI. Check for mistakes.
if not subtitle_files:
return jsonify({"error": "No subtitles found"}), 404

latest_subtitle = max(subtitle_files, key=os.path.getctime)
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.getctime is platform-dependent (on Unix it’s metadata-change time, not creation time). For selecting the newest subtitle file, prefer a deterministic strategy (e.g., getmtime, or selecting the expected filename for the requested video_id) to avoid surprising picks when multiple .vtt files exist.

Suggested change
latest_subtitle = max(subtitle_files, key=os.path.getctime)
latest_subtitle = max(subtitle_files, key=os.path.getmtime)

Copilot uses AI. Check for mistakes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/server.py (1)

579-579: Consider using getmtime instead of getctime for consistency.

os.path.getctime returns creation time on Windows but inode change time on Unix/Linux. Since files are freshly created in the temp directory, this works in practice, but os.path.getmtime (modification time) is more portable and semantically clear.

♻️ Suggested change
-            latest_subtitle = max(subtitle_files, key=os.path.getctime)
+            latest_subtitle = max(subtitle_files, key=os.path.getmtime)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` at line 579, Replace usage of os.path.getctime with
os.path.getmtime when selecting the latest subtitle file (the assignment to
latest_subtitle where subtitle_files is used) to rely on modification time
rather than platform-dependent creation/inode-change time; ensure any
surrounding logic that assumes creation-time semantics still works with mtime
(no additional imports needed if os is already used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Line 596: Remove the dead directory creation call os.makedirs("subtitles",
exist_ok=True) from server.py because the transcript endpoint now uses
per-request temporary directories; locate the invocation of
os.makedirs("subtitles", exist_ok=True) and delete that statement (and any
unused import of os if it becomes unused) so no stale "subtitles" directory is
created.

---

Nitpick comments:
In `@backend/server.py`:
- Line 579: Replace usage of os.path.getctime with os.path.getmtime when
selecting the latest subtitle file (the assignment to latest_subtitle where
subtitle_files is used) to rely on modification time rather than
platform-dependent creation/inode-change time; ensure any surrounding logic that
assumes creation-time semantics still works with mtime (no additional imports
needed if os is already used).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 192fc72c-ce8d-47f5-b4a4-6b60e189d5ec

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 2767b76.

📒 Files selected for processing (1)
  • backend/server.py

return jsonify({"error": "Internal server error"}), 500

if __name__ == "__main__":
os.makedirs("subtitles", exist_ok=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove dead code: subtitles directory is no longer used.

The transcript endpoint now uses per-request temporary directories. This os.makedirs("subtitles", ...) line creates a directory that is never used, leaving stale code behind.

🧹 Suggested removal
 if __name__ == "__main__":
-    os.makedirs("subtitles", exist_ok=True)
     app.run()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` at line 596, Remove the dead directory creation call
os.makedirs("subtitles", exist_ok=True) from server.py because the transcript
endpoint now uses per-request temporary directories; locate the invocation of
os.makedirs("subtitles", exist_ok=True) and delete that statement (and any
unused import of os if it becomes unused) so no stale "subtitles" directory is
created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants