fix(transcript): validate videoId and harden yt-dlp error handling#619
fix(transcript): validate videoId and harden yt-dlp error handling#619Ashvin-KS wants to merge 1 commit intoAOSSIE-Org:mainfrom
Conversation
📝 WalkthroughWalkthroughThe Changes
Sequence DiagramsequenceDiagram
actor Client
participant Handler as /getTranscript Handler
participant Validator as Video ID Validator
participant TempDir as Temp Directory
participant Subprocess as yt-dlp Process
participant FileSystem as File System
participant Response as HTTP Response
Client->>Handler: GET /getTranscript?videoId=...
Handler->>Handler: Sanitize videoId (strip whitespace)
Handler->>Validator: is_valid_youtube_video_id()
alt Invalid or Empty ID
Validator-->>Response: ❌ 400 Bad Request
Response-->>Client: Error
else Valid ID
Validator->>Handler: ✓ Valid
Handler->>TempDir: Create TemporaryDirectory()
Handler->>Subprocess: Run yt-dlp (timeout=60s)
alt Timeout
Subprocess-->>Response: ⏱️ 504 Gateway Timeout
else Process Error
Subprocess-->>Response: ❌ 502 Bad Gateway
else Success
Subprocess->>FileSystem: Write .vtt file
FileSystem->>TempDir: File created
Handler->>FileSystem: Read newest .vtt
FileSystem-->>Handler: File content
alt Empty Transcript
Handler->>Response: ⚠️ 404 Not Found
else Valid Transcript
Handler->>Response: ✓ 200 OK + Transcript
end
end
Handler->>TempDir: Cleanup (auto on exit)
Response-->>Client: Response
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can validate your CodeRabbit configuration file in your editor.If your editor has YAML language server, you can enable auto-completion and validation by adding |
There was a problem hiding this comment.
Pull request overview
This PR hardens the /getTranscript backend endpoint by validating YouTube videoId inputs and making yt-dlp execution safer and more reliable (timeouts, error mapping, and request-scoped temp files).
Changes:
- Added strict YouTube video ID format validation before invoking
yt-dlp. - Switched subtitle output to a per-request temporary directory and added guards for missing/empty transcripts.
- Added subprocess timeout + explicit error handling to return appropriate HTTP status codes.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def is_valid_youtube_video_id(video_id): | ||
| return bool(YOUTUBE_VIDEO_ID_PATTERN.fullmatch(video_id or "")) | ||
|
|
There was a problem hiding this comment.
There should be a blank line between the top-level helper is_valid_youtube_video_id and the @app.route decorator below it to keep top-level definitions separated consistently (PEP8-style; most other endpoints in this file are separated by blank lines).
| if not subtitle_files: | ||
| return jsonify({"error": "No subtitles found"}), 404 | ||
|
|
||
| latest_subtitle = max(subtitle_files, key=os.path.getctime) |
There was a problem hiding this comment.
os.path.getctime is platform-dependent (on Unix it’s metadata-change time, not creation time). For selecting the newest subtitle file, prefer a deterministic strategy (e.g., getmtime, or selecting the expected filename for the requested video_id) to avoid surprising picks when multiple .vtt files exist.
| latest_subtitle = max(subtitle_files, key=os.path.getctime) | |
| latest_subtitle = max(subtitle_files, key=os.path.getmtime) |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/server.py (1)
579-579: Consider usinggetmtimeinstead ofgetctimefor consistency.
os.path.getctimereturns creation time on Windows but inode change time on Unix/Linux. Since files are freshly created in the temp directory, this works in practice, butos.path.getmtime(modification time) is more portable and semantically clear.♻️ Suggested change
- latest_subtitle = max(subtitle_files, key=os.path.getctime) + latest_subtitle = max(subtitle_files, key=os.path.getmtime)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/server.py` at line 579, Replace usage of os.path.getctime with os.path.getmtime when selecting the latest subtitle file (the assignment to latest_subtitle where subtitle_files is used) to rely on modification time rather than platform-dependent creation/inode-change time; ensure any surrounding logic that assumes creation-time semantics still works with mtime (no additional imports needed if os is already used).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/server.py`:
- Line 596: Remove the dead directory creation call os.makedirs("subtitles",
exist_ok=True) from server.py because the transcript endpoint now uses
per-request temporary directories; locate the invocation of
os.makedirs("subtitles", exist_ok=True) and delete that statement (and any
unused import of os if it becomes unused) so no stale "subtitles" directory is
created.
---
Nitpick comments:
In `@backend/server.py`:
- Line 579: Replace usage of os.path.getctime with os.path.getmtime when
selecting the latest subtitle file (the assignment to latest_subtitle where
subtitle_files is used) to rely on modification time rather than
platform-dependent creation/inode-change time; ensure any surrounding logic that
assumes creation-time semantics still works with mtime (no additional imports
needed if os is already used).
| return jsonify({"error": "Internal server error"}), 500 | ||
|
|
||
| if __name__ == "__main__": | ||
| os.makedirs("subtitles", exist_ok=True) |
There was a problem hiding this comment.
Remove dead code: subtitles directory is no longer used.
The transcript endpoint now uses per-request temporary directories. This os.makedirs("subtitles", ...) line creates a directory that is never used, leaving stale code behind.
🧹 Suggested removal
if __name__ == "__main__":
- os.makedirs("subtitles", exist_ok=True)
app.run()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/server.py` at line 596, Remove the dead directory creation call
os.makedirs("subtitles", exist_ok=True) from server.py because the transcript
endpoint now uses per-request temporary directories; locate the invocation of
os.makedirs("subtitles", exist_ok=True) and delete that statement (and any
unused import of os if it becomes unused) so no stale "subtitles" directory is
created.
Addressed Issues:
N/A (no separate issue filed for this PR)
Screenshots/Recordings:
N/A (backend reliability hardening; no UI changes)
Additional Notes:
AI Usage Disclosure:
I have used the following AI models and tools: GitHub Copilot (GPT-5.3-Codex) for drafting/refinement; all changes were reviewed and validated before submission.
Checklist
Summary by CodeRabbit
New Features
Bug Fixes