fix(transcript): validate videoId and harden yt-dlp error handling by Ashvin-KS · Pull Request #619 · AOSSIE-Org/EduAid

Ashvin-KS · 2026-03-22T12:36:21Z

Addressed Issues:

N/A (no separate issue filed for this PR)

Screenshots/Recordings:

N/A (backend reliability hardening; no UI changes)

Additional Notes:

This PR hardens the getTranscript endpoint for reliability and safer failure behavior.
Added strict YouTube videoId format validation before running yt-dlp.
Added timeout handling for yt-dlp to avoid hanging requests.
Added explicit subprocess error mapping:
- timeout returns 504
- yt-dlp failure returns 502
- invalid videoId returns 400
- unexpected failure returns 500
Switched subtitle extraction to request-scoped temporary directory to reduce cross-request file collisions.
Added empty transcript guard with 404 response.

AI Usage Disclosure:

This PR does not contain AI-generated code at all.
This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: GitHub Copilot (GPT-5.3-Codex) for drafting/refinement; all changes were reviewed and validated before submission.

Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement.
My code follows the project's code style and conventions
If applicable, I have made corresponding changes or additions to the documentation
If applicable, I have made corresponding changes or additions to tests
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contribution Guidelines
Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

New Features
- Added video ID validation to prevent invalid requests from processing
- Implemented timeout protection (60s) for transcript extraction
Bug Fixes
- Enhanced error responses with specific HTTP status codes for different failure scenarios

coderabbitai · 2026-03-22T12:36:34Z

📝 Walkthrough

Walkthrough

The /getTranscript endpoint in backend/server.py was refactored to add YouTube video ID validation via regex, implement per-request temporary directories for subtitle extraction instead of a persistent folder, improve error handling with specific HTTP status codes, and include timeout protection for subprocess execution.

Changes

Cohort / File(s)	Summary
Transcript Extraction & Validation `backend/server.py`	Added `YOUTUBE_VIDEO_ID_PATTERN` regex constant and `is_valid_youtube_video_id()` function for video ID validation. Modified `/getTranscript` endpoint to validate and sanitize `videoId` parameter before processing, use temporary directories for `.vtt` file output instead of persistent `subtitles/` folder, read newest `.vtt` from temp directory, and return appropriate HTTP errors (400 for invalid IDs, 404 for empty transcripts). Enhanced subprocess execution with 60-second timeout and granular error handling: `subprocess.TimeoutExpired` → 504, `subprocess.CalledProcessError` → 502, generic exceptions → 500. Removed legacy persistent file cleanup logic.

Sequence Diagram

sequenceDiagram
    actor Client
    participant Handler as /getTranscript Handler
    participant Validator as Video ID Validator
    participant TempDir as Temp Directory
    participant Subprocess as yt-dlp Process
    participant FileSystem as File System
    participant Response as HTTP Response

    Client->>Handler: GET /getTranscript?videoId=...
    Handler->>Handler: Sanitize videoId (strip whitespace)
    Handler->>Validator: is_valid_youtube_video_id()
    
    alt Invalid or Empty ID
        Validator-->>Response: ❌ 400 Bad Request
        Response-->>Client: Error
    else Valid ID
        Validator->>Handler: ✓ Valid
        Handler->>TempDir: Create TemporaryDirectory()
        Handler->>Subprocess: Run yt-dlp (timeout=60s)
        
        alt Timeout
            Subprocess-->>Response: ⏱️ 504 Gateway Timeout
        else Process Error
            Subprocess-->>Response: ❌ 502 Bad Gateway
        else Success
            Subprocess->>FileSystem: Write .vtt file
            FileSystem->>TempDir: File created
            Handler->>FileSystem: Read newest .vtt
            FileSystem-->>Handler: File content
            
            alt Empty Transcript
                Handler->>Response: ⚠️ 404 Not Found
            else Valid Transcript
                Handler->>Response: ✓ 200 OK + Transcript
            end
        end
        
        Handler->>TempDir: Cleanup (auto on exit)
        Response-->>Client: Response
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hops through validation gates so tight,
Temporary homes for transcripts bright,
No more folder clutter—clean and neat,
With timeout shields and error receipts,
YouTube videos now checked twice with care! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly identifies the main changes: YouTube video ID validation and improved error handling for the yt-dlp subprocess, which align with the primary objectives of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can validate your CodeRabbit configuration file in your editor.

If your editor has YAML language server, you can enable auto-completion and validation by adding # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json at the top of your CodeRabbit configuration file.

Copilot

Pull request overview

This PR hardens the /getTranscript backend endpoint by validating YouTube videoId inputs and making yt-dlp execution safer and more reliable (timeouts, error mapping, and request-scoped temp files).

Changes:

Added strict YouTube video ID format validation before invoking yt-dlp.
Switched subtitle output to a per-request temporary directory and added guards for missing/empty transcripts.
Added subprocess timeout + explicit error handling to return appropriate HTTP status codes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-22T12:39:28Z

backend/server.py

+
+def is_valid_youtube_video_id(video_id):
+    return bool(YOUTUBE_VIDEO_ID_PATTERN.fullmatch(video_id or ""))
+


There should be a blank line between the top-level helper is_valid_youtube_video_id and the @app.route decorator below it to keep top-level definitions separated consistently (PEP8-style; most other endpoints in this file are separated by blank lines).

Suggested change

Copilot · 2026-03-22T12:39:28Z

backend/server.py

+            if not subtitle_files:
+                return jsonify({"error": "No subtitles found"}), 404
+
+            latest_subtitle = max(subtitle_files, key=os.path.getctime)


os.path.getctime is platform-dependent (on Unix it’s metadata-change time, not creation time). For selecting the newest subtitle file, prefer a deterministic strategy (e.g., getmtime, or selecting the expected filename for the requested video_id) to avoid surprising picks when multiple .vtt files exist.

Suggested change

latest_subtitle = max(subtitle_files, key=os.path.getctime)

latest_subtitle = max(subtitle_files, key=os.path.getmtime)

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/server.py (1)
579-579: Consider using getmtime instead of getctime for consistency.

os.path.getctime returns creation time on Windows but inode change time on Unix/Linux. Since files are freshly created in the temp directory, this works in practice, but os.path.getmtime (modification time) is more portable and semantically clear.
♻️ Suggested change
-            latest_subtitle = max(subtitle_files, key=os.path.getctime)
+            latest_subtitle = max(subtitle_files, key=os.path.getmtime)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` at line 579, Replace usage of os.path.getctime with
os.path.getmtime when selecting the latest subtitle file (the assignment to
latest_subtitle where subtitle_files is used) to rely on modification time
rather than platform-dependent creation/inode-change time; ensure any
surrounding logic that assumes creation-time semantics still works with mtime
(no additional imports needed if os is already used).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/server.py`:
- Line 596: Remove the dead directory creation call os.makedirs("subtitles",
exist_ok=True) from server.py because the transcript endpoint now uses
per-request temporary directories; locate the invocation of
os.makedirs("subtitles", exist_ok=True) and delete that statement (and any
unused import of os if it becomes unused) so no stale "subtitles" directory is
created.

---

Nitpick comments:
In `@backend/server.py`:
- Line 579: Replace usage of os.path.getctime with os.path.getmtime when
selecting the latest subtitle file (the assignment to latest_subtitle where
subtitle_files is used) to rely on modification time rather than
platform-dependent creation/inode-change time; ensure any surrounding logic that
assumes creation-time semantics still works with mtime (no additional imports
needed if os is already used).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 192fc72c-ce8d-47f5-b4a4-6b60e189d5ec

📥 Commits

Reviewing files that changed from the base of the PR and between 2038116 and 2767b76.

📒 Files selected for processing (1)

backend/server.py

coderabbitai · 2026-03-22T12:39:57Z

backend/server.py

+        return jsonify({"error": "Internal server error"}), 500

 if __name__ == "__main__":
    os.makedirs("subtitles", exist_ok=True)


⚠️ Potential issue | 🟡 Minor

Remove dead code: subtitles directory is no longer used.

The transcript endpoint now uses per-request temporary directories. This os.makedirs("subtitles", ...) line creates a directory that is never used, leaving stale code behind.

🧹 Suggested removal

if __name__ == "__main__": - os.makedirs("subtitles", exist_ok=True) app.run()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/server.py` at line 596, Remove the dead directory creation call os.makedirs("subtitles", exist_ok=True) from server.py because the transcript endpoint now uses per-request temporary directories; locate the invocation of os.makedirs("subtitles", exist_ok=True) and delete that statement (and any unused import of os if it becomes unused) so no stale "subtitles" directory is created.

fix(transcript): validate videoId and harden yt-dlp error handling

2767b76

Copilot AI review requested due to automatic review settings March 22, 2026 12:36

Copilot started reviewing on behalf of Ashvin-KS March 22, 2026 12:36 View session

Copilot AI reviewed Mar 22, 2026

View reviewed changes

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(transcript): validate videoId and harden yt-dlp error handling#619

fix(transcript): validate videoId and harden yt-dlp error handling#619
Ashvin-KS wants to merge 1 commit intoAOSSIE-Org:mainfrom
Ashvin-KS:fix/transcript-endpoint-hardening

Ashvin-KS commented Mar 22, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 22, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

Copilot AI Mar 22, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		def is_valid_youtube_video_id(video_id):
		return bool(YOUTUBE_VIDEO_ID_PATTERN.fullmatch(video_id or ""))

	latest_subtitle = max(subtitle_files, key=os.path.getctime)
	latest_subtitle = max(subtitle_files, key=os.path.getmtime)

Uh oh!

Conversation

Ashvin-KS commented Mar 22, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

AI Usage Disclosure:

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ashvin-KS commented Mar 22, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 22, 2026 •

edited

Loading