Warning
This codebase was specifically developed to index the "Bref" YouTube playlist. While it can theoretically work for any YouTube playlist with some modifications, those adaptations haven't been implemented yet. The code's structure, logic, and various elements could be improved, but the primary goal was to create a functional system quickly. This documentation describes the current implementation while noting potential improvements.
This project creates a searchable index of YouTube video content by extracting subtitles, generating screenshots and video clips at specific timestamps, and making everything searchable through Algolia. The end result allows users to search for specific dialogue and instantly see the relevant video moment.
The project uses a specific file storage structure that's important to understand:
-
Temporary Files: The project expects a
./tmp
directory containing all downloaded MP4 videos (./tmp/mp4
) and MP3 audio files (./tmp/mp3
). Those files are not committed to the repository, you will have to fetch them yourself. -
Media Asset Repository: All generated images and video clips are stored in a separate repository (pixelastic/brefsearch-images) to avoid bloating the main codebase. This repository is expected to be stored one level higher (ie.
./brefsearch
and `./brefsearch-imagesr side by side in the same directory.) -
Media Hosting: The media files are synced to a personal server using
rsync
. Credentials for this server are personal and not shared in that repo either. The URLs in the JSON records point to this server. Cloudinary functions as a CDN, retrieving assets from this origin server.
This separation of code and data repositories is intentional for maintainability, but means that you will probably need to do some plumbing to replicate the structure and establish your own media storage solution when adapting this system.
Each step in the process generates files on disk that serve as input for the next step. This pipeline architecture has several important characteristics:
-
Restart at Any Step: You can restart the process at any step without having to run the entire pipeline from the beginning. Each step reads from the disk and outputs to the disk.
-
Step Dependencies: While steps are relatively independent, they are still linked in that the output of one step becomes the input of the next.
-
Potential Data Loss: Running a single step might overwrite or remove certain information from the record files. Therefore, it's often necessary to run subsequent steps to restore or re-add this information correctly.
-
JSON as Source of Truth: The final output consists of JSON files on disk that represent each Algolia record, stored in
./data/records
. These files are the source of truth and are directly pushed to Algolia.
This design allows for flexibility when making changes or corrections at specific stages without reprocessing everything, but requires understanding the dependencies between steps to maintain data integrity.
Understanding the final record structure helps visualize what we're trying to create:
{
"episode": {
"commentCount": 861,
"durationHuman": "2:52",
"durationInSeconds": 172,
"index": 75,
"isAgeRestricted": false,
"likeCount": 89384,
"name": "Bref. J'ai tout cassé.",
"season": 1,
"slug": "brefJaiToutCasse",
"videoId": "9u9X-FzVWZ0",
"viewCount": 7053468
},
"line": {
"content": "Et puis, j'ai eu un point de côté.",
"end": 81,
"heatBucket": 1,
"index": 30,
"start": 77,
"url": "https://www.youtube.com/watch?v=9u9X-FzVWZ0&t=77s"
},
"thumbnail": {
"animatedUrl": "https://assets.pixelastic.com/brefsearch/animated/S01E75_brefJaiToutCasse/077.mp4",
"hash": "d3628d89c6",
"height": 1080,
"lqip": "",
"url": "https://assets.pixelastic.com/brefsearch/thumbnails/S01E75_brefJaiToutCasse/077.png",
"width": 1920
}
}
Each record represents a specific moment in a video where a line or paragraph is spoken, along with metadata about the video and media assets for the search result display.
- Node.js (v18+)
- yt-dlp
- FFmpeg
- An Algolia account
- Speech-to-text service (like HappyScribe or Whisper)
- Server for hosting media files
- A Vercel account for deployment
I used yt-dlp
to download all videos of the playlist in ./tmp/mp4
. I think it
was something like yt-dlp "{playlistId}"
.
Tip
You'll want to have good video quality, as we'll use those files to extract the screenshots and previews. Sound quality doesn't matter here.
Tip
Some videos are age restricted, so you might need to pass
--cookiers-from-browser firefox
to use your Firefox cookies to bypass it and
download them.
Caution
There is no script to do that job for you (sorry!). I ran all those commands
manually when I built the project, so you'll have to do some trial and error
to replicate it. At the end of the day, the other scripts expect files to be
named something like S01E75_brefJaiToutCasse.mp3
.
I used yt-dlp
to download only the audio from the videos. I used
--extract-audio
to only download the audio.
Tip
You'll probably only need a mono 16khz file for the audio, as this is what Whisper accept as input.
Caution
Still no script to do that for you. Sorry again.
Here, I needed to transform the .mp3
files I have into .vtt
(subtitle)
files. I used
https://www.happyscribe.com/ manually to do
that: I uploaded each .mp3
file to their UI and downloaded the resulting
.vtt
file.
I then proofread them all (re-watching the whole TV Show locally with subtitles added). I would say HappyScribe did a 90% good job, but I still had to fix some proper nouns and when multiple people were speaking at the same time.
.vtt
files are stored and committed in ./data/vtts
, following the same
pattern as the video/audio files. If you need to do proofreading, this is where you'll need to edit the content.
Note
You can probably automate that with a call to OpenAI Whisper API or HappyScribe if you're looking to build an automated pipeline.
Tip
YouTube generate sautomatic captions for its videos, that you can get with yt-dlp --write-subs
but the quality on french content was horrible, so I couldn't use
them. Maybe if your content is well spoken in clear english it could work for
you.
Here, I also did some manual work. I called yt-dlp --dump-json
on each video
of the playlist and extracted the name and duration from each video and created
an entry in ./data/episodes
for each video.
Each entry looked someting like this:
{
"duration": {
"human": "1:48",
"inSeconds": 108
},
"episode": {
"index": 2,
"name": "Bref. Je remets tout à demain.",
"season": 1,
"slug": "brefJeRemetsToutADemain"
},
"episodes": []
}
Here, we are starting to get into script territory. If you run yarn run update-episode-lines
, it will read all .vtt
files and update the .lines
array of the episode files we created earlier.
Each element of the array will contain the .content
(what is being said) and
the .start
(when it is being said).
Note
There is also some data sanitization in place in that step. Basically, you don't want lines to be cut in the middle of a sentence, so the script merges several subtitles together until they form a complete sentence, while keeping an optimal width.
Note
I also saved the .end
value in there, but I don't really use it anywhere.
Running the yarn run update-episode-count
will fetch all metadata for all
video and store it as a huge JSON file in ./data/counts
.
Those JSON files contains a lot of information, but only a few keys are important.
.view_count
,.comment_count
and.like_count
can give popularity metrics about which videos are the most popular. I only used.view_count
in my implementation..heatmap
contains an array representation of exactly 100 segments of equal length of the video, attributing a score (.value
) between 0 and 1 to each. A high value means that segment has been watched a lot, a low value means it hasn't be watched often.
I will be using both metrics later in the ranking formula of Algolia.
Note
Raw values of the heatmap have too much granularity and are not very useful in an Algolia ranking formula, so I grouped the segments in 5 "buckets" of equal lengths, basically turning this value into a 5-star rating system of each segment. This make the ranking way more useful.
Now that I think of it, I should have done the same thing with the
.view_count
¯\(ツ)/¯
Thumbnails are the static .png
images displayed along a search result. You can
generate them by running the yarn run update-episode-thumbnails
function.
This will use ffmpeg
to extract a screenshot at the exact timestamp defined in
the ./data/episodes
JSON files, for each line. Files will be stored in another
repository (expected to be one level higher in the hierarchy, and named
brefsearch-images
).
Warning
Be aware that it will generate many heavy images. I personally ran an optional compression script on them before committing them to the repo. I chose to extract high quality version of the images, even if Cloudinary will resize and compress them later, to have a baseline with the highest possible quality, in case I want to make Cloudinary less aggressive in its compression later.
Previews are the animated 2s looping videos that start playing when you hover
a result. You can generate them with yarn run update-episode-animated
.
This is very similar to the thumbnail generation, except that it will generate videos. The output file will already be compressed this time, so it shouldn't take up to much space, but it might require more processing power to run.
Tip
In an earlier version, I had tried to generate .gif
file but the quality was
horrible and the file size was even worse. Stick to .mp4
video, they work
well.
Running yarn run update-episode-records
will stitch all our previous steps
together and output JSON files in ./data/records
. One file will be generated
per record, which gives you a good view of what will end up in your Algolia
index.
Each record is made of three main keys:
.episode
contains metadata shared across all subtitles of a given video. Of
interest here are the .durationHuman
, .index
, .name
, .slug
, .videoId
and .viewCount
keys that I re-use in the display somehow.
Tip
Other keys like .durationInSeconds
, .isAgeRestricted
or .likeCount
are
not really used. I should remove them.
.line
contains info about the subtitle at that exact moment. .content
is
what is being said. .start
is when it is being said. .heatBucket
is the
"rating" of that specific line in the whole video.
Tip
.end
is never used. .index
indicates that this is the nth subtitle of that
video. .url
is redundant because we can craft it
from other keys in the record. One can potentially infer the URL from the other parts of the records rather than storing them.
.thumbnail
contain info about the static thumbnail and animated preview. Of
importance are the .url
(for the thumbnail) and .animatedUrl
(for the
animated preview). They point to one of my own servers, that is used as the
origin server for Cloudinary.
Tip
.lqip
also contains a very important info: a base64 encoded string of
a blurry version of the final image. This can be used to display a Low Quality
Image Placeholder of the search result as soon as the record is returned by
Algolia, without waiting for the full image to load. .hash
, .width
and
.height
are here to help you with cache busting or CSS.
This part will deploy all the static thumbnails and animated preview to
a private server of mine, using rsync
. You will have to replace that step with
another way to deploy the static assets and modifiy the thumbnail.url
and
thumbnail.animatedPreview
accordingly in the records.
You can try running yarn run deploy-data-assets
but it will not work
for you.
Run yarn run deploy-data-algolia
to push data to your Algolia index and
configure it accordingly. The most notable configuration is the use of
distinct
and attributeForDistinct
set to episode.videoId
. This instructs
Algolia to only return one match per episode.
We are technically using two indices here. The main one is used when we first
display the website, and will order results by episode.index
and line.index
(basically rendering episodes in the order in which they were released, with the
first sentence displayed.).
But a trick on the front-end will swap to a secondary replica (called
popularity
) that ranks episodes based on the number of view per episodes, and
the most popular sentence (matching the query) inside of an episode, the moment you start typing something in the search bar.
The indexing uses an atomic strategy, meaning that each update will only add/delete/update the records that actually changed, rather than deleting everything and re-adding them.
Warning
You need to have your own appId
, indexName
and adminApiKey
configured in
./modules/lib/config.js
for this to work.
Go into ./modules/website
and run yarn run dev
to load the website locally,
or yarn run build
to build it.
- @lukyvj for the front-end
- @sarahdayan for the subtitle best practices
- @fluf22 for the video compression tips