This project automates the process of extracting and cleaning lecture transcripts from your university’s online video platform. It uses Selenium to navigate your course folder, scrape transcript text from each recorded lecture, and save them as tidy .txt files — perfect for uploading into NotebookLM or other AI-assisted study tools.
- Automatic transcript scraping – Collects transcripts from all videos in a course folder
- Smart deduplication – Skips previously scraped lectures using a local record file
- Clean formatting – Removes interface artifacts and organizes timestamps for clarity
- Organized storage – Saves each transcript as a readable
.txtfile in/transcripts - Error handling – Captures screenshots when issues occur for easy debugging
NotebookLM works best when provided with high-quality text context. For some courses—especially math or theory-heavy ones—there are only recorded lectures and no slides or written notes. This script bridges that gap by turning lecture recordings → structured text, giving you searchable and summarizable course material.
git clone https://github.com/<your-username>/lecture-transcript-scraper.git
cd lecture-transcript-scraperMake sure you have Python 3.9+ and run:
pip install selenium webdriver-managerOpen the script and replace:
COURSE_FOLDER_URL = "https://example.com/your-course-folder"with your own course folder URL that lists all recorded videos.
- Opens the course folder in a headless Chrome browser.
- Collects all available videos and extracts metadata (titles, links, IDs).
- Opens each new video, scrapes the transcript text, cleans it, and saves it to a
.txtfile under:transcripts/ ├─ lecture-1.txt ├─ lecture-2.txt └─ ... - Logs scraped videos in
scraped_titles.txtso future runs only fetch new ones.
Run the script:
python transcript_scraper.pyThe program will:
- Launch Chrome in headless mode
- Scrape all lectures not previously processed
- Save cleaned transcripts into the
transcripts/folder
Progress updates (e.g. “Found 14 videos”, “Saved transcript to transcripts/lecture-2.txt”) are printed to the console.
File: transcripts/lecture-5.txt
00:12 Welcome back everyone.
00:30 Today we’re covering partial derivatives.
01:10 Recall from last lecture that...
...
| File | Description |
|---|---|
transcript_scraper.py |
Main script that extracts and saves transcripts |
scraped_titles.txt |
Tracks which lectures have already been processed |
transcripts/ |
Directory where all .txt files are saved |
error_screenshot.png |
Screenshot automatically saved on errors |
- You may need to log in manually if your lectures are restricted. To do this, remove
--headlessfrom the Chrome options, log in once, then re-enable headless mode.
- Automatic login with cookies or SSO tokens
- Option to export
.pdfor.mdfiles - Support for concurrent video scraping