Lecture Transcript Scraper

This project automates the process of extracting and cleaning lecture transcripts from your university’s online video platform. It uses Selenium to navigate your course folder, scrape transcript text from each recorded lecture, and save them as tidy .txt files — perfect for uploading into NotebookLM or other AI-assisted study tools.

Features

Automatic transcript scraping – Collects transcripts from all videos in a course folder
Smart deduplication – Skips previously scraped lectures using a local record file
Clean formatting – Removes interface artifacts and organizes timestamps for clarity
Organized storage – Saves each transcript as a readable .txt file in /transcripts
Error handling – Captures screenshots when issues occur for easy debugging

Motivation

NotebookLM works best when provided with high-quality text context. For some courses—especially math or theory-heavy ones—there are only recorded lectures and no slides or written notes. This script bridges that gap by turning lecture recordings → structured text, giving you searchable and summarizable course material.

Setup

1. Clone this repository

git clone https://github.com/<your-username>/lecture-transcript-scraper.git
cd lecture-transcript-scraper

2. Install dependencies

Make sure you have Python 3.9+ and run:

pip install selenium webdriver-manager

3. Configure your course URL

Open the script and replace:

COURSE_FOLDER_URL = "https://example.com/your-course-folder"

with your own course folder URL that lists all recorded videos.

How It Works

Opens the course folder in a headless Chrome browser.
Collects all available videos and extracts metadata (titles, links, IDs).
Opens each new video, scrapes the transcript text, cleans it, and saves it to a .txt file under:
```
transcripts/
  ├─ lecture-1.txt
  ├─ lecture-2.txt
  └─ ...
```
Logs scraped videos in scraped_titles.txt so future runs only fetch new ones.

Usage

Run the script:

python transcript_scraper.py

The program will:

Launch Chrome in headless mode
Scrape all lectures not previously processed
Save cleaned transcripts into the transcripts/ folder

Progress updates (e.g. “Found 14 videos”, “Saved transcript to transcripts/lecture-2.txt”) are printed to the console.

Output Example

File: transcripts/lecture-5.txt

00:12 Welcome back everyone.
00:30 Today we’re covering partial derivatives.
01:10 Recall from last lecture that...
...

Files

File	Description
`transcript_scraper.py`	Main script that extracts and saves transcripts
`scraped_titles.txt`	Tracks which lectures have already been processed
`transcripts/`	Directory where all `.txt` files are saved
`error_screenshot.png`	Screenshot automatically saved on errors

Notes

You may need to log in manually if your lectures are restricted. To do this, remove --headless from the Chrome options, log in once, then re-enable headless mode.

💡 Future Improvements

Automatic login with cookies or SSO tokens
Option to export .pdf or .md files
Support for concurrent video scraping

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
transcripts		transcripts
.gitignore		.gitignore
README.md		README.md
scraped_titles.txt		scraped_titles.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lecture Transcript Scraper

Features

Motivation

Setup

1. Clone this repository

2. Install dependencies

3. Configure your course URL

How It Works

Usage

Output Example

Files

Notes

💡 Future Improvements

About

Uh oh!

Releases

Packages

Languages

minjunminji/mathlecturetranscriptscraper

Folders and files

Latest commit

History

Repository files navigation

Lecture Transcript Scraper

Features

Motivation

Setup

1. Clone this repository

2. Install dependencies

3. Configure your course URL

How It Works

Usage

Output Example

Files

Notes

💡 Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages