Create a vector search  from youtube audio transcripts 

## Description
Be able to parse all the videos from a Youtube channel or Youtube playlist , extract transcripts from their audios and embed them in a vector DB to enable search/retrieve over it . 

## Implementation Details 
It'll include the following : 
- Receive the channel link/playlist link from user 
- Scrape the audio from all the videos in the link/playlist 
- Extract the transcript along with timestamps from all the videos 
- Create chunks from the transcript (you can use basic chunks like 4 mins of audio or use any fancier chunking algo)
- Summarise each video using an LLM call and store as a separate chunk
- Embed this in a vector DB, use COLBERT ( Ragatoullie -LangChain ). Use [this](https://colab.research.google.com/drive/1mVApUZJaSdTFBg6Ii9NBQHSrgbON1jvR?usp=sharing) for reference 
- Enable COLBERT search and retrieval on the content embedding 
- When a question is searched it returns get related content as well as youtube link as well as timestamps for the relevant content

Can use https://github.com/ytdl-org/youtube-dl for scraping 
Can use https://www.youtube.com/@3blue1brown as initial test set for the above
Ticket for using ColBERT is covered [here](https://github.com/Samagra-Development/ai-tools/issues/288), you only need to make it work locally here using the notebook. 



## Product Name
AI Tools

## Organization Name
SamagraX

## Domain
NA

## Tech Skills Needed
Pytorch/ Python, ML

## Category
Feature

## Mentor(s)
@GautamR-Samagra 

## Complexity
Medium







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a vector search from youtube audio transcripts #289

Description

Implementation Details

Product Name

Organization Name

Domain

Tech Skills Needed

Category

Mentor(s)

Complexity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create a vector search from youtube audio transcripts #289

Description

Description

Implementation Details

Product Name

Organization Name

Domain

Tech Skills Needed

Category

Mentor(s)

Complexity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions