Skip to content

Create a vector search from youtube audio transcripts  #289

@Gautam-Rajeev

Description

@Gautam-Rajeev

Description

Be able to parse all the videos from a Youtube channel or Youtube playlist , extract transcripts from their audios and embed them in a vector DB to enable search/retrieve over it .

Implementation Details

It'll include the following :

  • Receive the channel link/playlist link from user
  • Scrape the audio from all the videos in the link/playlist
  • Extract the transcript along with timestamps from all the videos
  • Create chunks from the transcript (you can use basic chunks like 4 mins of audio or use any fancier chunking algo)
  • Summarise each video using an LLM call and store as a separate chunk
  • Embed this in a vector DB, use COLBERT ( Ragatoullie -LangChain ). Use this for reference
  • Enable COLBERT search and retrieval on the content embedding
  • When a question is searched it returns get related content as well as youtube link as well as timestamps for the relevant content

Can use https://github.com/ytdl-org/youtube-dl for scraping
Can use https://www.youtube.com/@3blue1brown as initial test set for the above
Ticket for using ColBERT is covered here, you only need to make it work locally here using the notebook.

Product Name

AI Tools

Organization Name

SamagraX

Domain

NA

Tech Skills Needed

Pytorch/ Python, ML

Category

Feature

Mentor(s)

@GautamR-Samagra

Complexity

Medium

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions