Skip to content

minjunminji/mathlecturetranscriptscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lecture Transcript Scraper

This project automates the process of extracting and cleaning lecture transcripts from your university’s online video platform. It uses Selenium to navigate your course folder, scrape transcript text from each recorded lecture, and save them as tidy .txt files — perfect for uploading into NotebookLM or other AI-assisted study tools.

Features

  • Automatic transcript scraping – Collects transcripts from all videos in a course folder
  • Smart deduplication – Skips previously scraped lectures using a local record file
  • Clean formatting – Removes interface artifacts and organizes timestamps for clarity
  • Organized storage – Saves each transcript as a readable .txt file in /transcripts
  • Error handling – Captures screenshots when issues occur for easy debugging

Motivation

NotebookLM works best when provided with high-quality text context. For some courses—especially math or theory-heavy ones—there are only recorded lectures and no slides or written notes. This script bridges that gap by turning lecture recordings → structured text, giving you searchable and summarizable course material.

Setup

1. Clone this repository

git clone https://github.com/<your-username>/lecture-transcript-scraper.git
cd lecture-transcript-scraper

2. Install dependencies

Make sure you have Python 3.9+ and run:

pip install selenium webdriver-manager

3. Configure your course URL

Open the script and replace:

COURSE_FOLDER_URL = "https://example.com/your-course-folder"

with your own course folder URL that lists all recorded videos.


How It Works

  1. Opens the course folder in a headless Chrome browser.
  2. Collects all available videos and extracts metadata (titles, links, IDs).
  3. Opens each new video, scrapes the transcript text, cleans it, and saves it to a .txt file under:
    transcripts/
      ├─ lecture-1.txt
      ├─ lecture-2.txt
      └─ ...
    
  4. Logs scraped videos in scraped_titles.txt so future runs only fetch new ones.

Usage

Run the script:

python transcript_scraper.py

The program will:

  • Launch Chrome in headless mode
  • Scrape all lectures not previously processed
  • Save cleaned transcripts into the transcripts/ folder

Progress updates (e.g. “Found 14 videos”, “Saved transcript to transcripts/lecture-2.txt”) are printed to the console.


Output Example

File: transcripts/lecture-5.txt

00:12 Welcome back everyone.
00:30 Today we’re covering partial derivatives.
01:10 Recall from last lecture that...
...

Files

File Description
transcript_scraper.py Main script that extracts and saves transcripts
scraped_titles.txt Tracks which lectures have already been processed
transcripts/ Directory where all .txt files are saved
error_screenshot.png Screenshot automatically saved on errors

Notes

  • You may need to log in manually if your lectures are restricted. To do this, remove --headless from the Chrome options, log in once, then re-enable headless mode.

💡 Future Improvements

  • Automatic login with cookies or SSO tokens
  • Option to export .pdf or .md files
  • Support for concurrent video scraping

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages