Skip to content

Commit d5f642d

Browse files
committed
docs: explain HLS segment interception and on-demand loading challenges
- Add new 'On-Demand Segment Loading' section to homepage - Expand 'Encrypted HLS Streams' explanation with concrete examples - Add detailed JSDoc to extractLearningSuitePostContent explaining the strategy - Document why seeking through the video timeline is necessary
1 parent 822db9a commit d5f642d

File tree

2 files changed

+43
-3
lines changed

2 files changed

+43
-3
lines changed

docs/index.html

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -493,10 +493,27 @@ <h2 class="text-3xl sm:text-4xl font-bold text-center mb-4">Under the Hood</h2>
493493
</div>
494494
<div>
495495
<h3 class="text-lg font-semibold mb-2">Encrypted HLS Streams</h3>
496-
<p class="text-slate-400 text-sm mb-3">Some platforms serve encrypted HLS playlists that are decrypted client-side. Standard downloaders fail because the playlist data is gibberish.</p>
496+
<p class="text-slate-400 text-sm mb-3">Some platforms serve encrypted HLS playlists that are decrypted client-side. The API returns scrambled data like <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">77a393e51f4b...</code> instead of standard <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">#EXTM3U</code> playlists. Standard downloaders and even ffmpeg can't parse these.</p>
497497
<div class="bg-slate-900/50 rounded-lg p-4 border border-slate-700/30">
498498
<p class="text-slate-500 text-xs mb-2 font-mono">Our approach:</p>
499-
<p class="text-slate-300 text-sm">We intercept the actual <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">.ts</code> video segments as the browser plays them, capture their individual auth tokens, download each segment, then concatenate them with <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">ffmpeg</code>.</p>
499+
<p class="text-slate-300 text-sm">We let the browser decrypt the playlist, then intercept the actual <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">.ts</code> video segments as they're requested. Each segment has its own auth token that expires quickly. We capture all tokens, download each segment individually, then concatenate them with <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">ffmpeg -f concat</code>.</p>
500+
</div>
501+
</div>
502+
</div>
503+
</div>
504+
505+
<!-- Challenge 1b: On-Demand Segment Loading -->
506+
<div class="bg-slate-800/20 rounded-2xl p-6 border border-slate-700/30">
507+
<div class="flex items-start gap-4">
508+
<div class="w-10 h-10 bg-pink-500/20 rounded-lg flex items-center justify-center flex-shrink-0 mt-1">
509+
<img src="icon-play-circle.svg" alt="" class="w-5 h-5" style="filter: invert(70%) sepia(80%) saturate(500%) hue-rotate(280deg);">
510+
</div>
511+
<div>
512+
<h3 class="text-lg font-semibold mb-2">On-Demand Segment Loading</h3>
513+
<p class="text-slate-400 text-sm mb-3">HLS players don't load all segments upfront—they fetch them on-demand as you watch. If you only watch the first 2 minutes, you only get those segments. A 30-minute video might download as a 2-minute clip.</p>
514+
<div class="bg-slate-900/50 rounded-lg p-4 border border-slate-700/30">
515+
<p class="text-slate-500 text-xs mb-2 font-mono">Our approach:</p>
516+
<p class="text-slate-300 text-sm">We programmatically seek through the entire video timeline, triggering the player to request every segment. For a 5-minute video, we seek to positions every ~12 seconds, ensuring all 75+ segments are requested and captured. The full video downloads correctly every time.</p>
500517
</div>
501518
</div>
502519
</div>

src/scraper/learningsuite/extractor.ts

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,28 @@ export async function extractAttachmentsFromPage(
373373

374374
/**
375375
* Extracts complete lesson content using DOM-based extraction with network interception.
376+
*
377+
* ## Video Download Strategy for LearningSuite (Bunny CDN)
378+
*
379+
* LearningSuite uses encrypted HLS playlists that cannot be downloaded directly:
380+
*
381+
* 1. **Encrypted Playlists**: The API returns encrypted data (e.g., `77a393e51f4b...`)
382+
* instead of standard `#EXTM3U` playlists. JavaScript decrypts them client-side.
383+
*
384+
* 2. **Per-Segment Tokens**: Each `.ts` segment has a unique, short-lived token.
385+
* These tokens are generated when the browser requests each segment.
386+
*
387+
* 3. **On-Demand Loading**: HLS players load segments on-demand during playback.
388+
* Simply loading the page only captures the first ~2 minutes of video.
389+
*
390+
* ## Our Solution
391+
*
392+
* - Intercept network requests to capture segment URLs with their tokens
393+
* - Programmatically seek through the entire video timeline
394+
* - This triggers the player to request ALL segments
395+
* - Download each segment individually with its token
396+
* - Concatenate segments using `ffmpeg -f concat`
397+
*
376398
* Note: LearningSuite uses persisted GraphQL queries, so we can't make arbitrary API calls.
377399
*/
378400
export async function extractLearningSuitePostContent(
@@ -385,7 +407,8 @@ export async function extractLearningSuitePostContent(
385407
// Set up request interception to capture HLS video URLs
386408
const hlsUrls: string[] = [];
387409

388-
// Handler for requests - capture segment URLs with tokens
410+
// Capture segment URLs with their individual auth tokens
411+
// Each .ts segment has a unique token like: video0.ts?token=abc123&expires=...
389412
const segmentUrls: string[] = [];
390413

391414
const requestHandler = (request: { url: () => string }) => {

0 commit comments

Comments
 (0)