docs: explain HLS segment interception and on-demand loading challenges

swernerx · swernerx · commit d5f642df5072 · 2025-12-27T11:56:16.000+01:00
- Add new 'On-Demand Segment Loading' section to homepage
- Expand 'Encrypted HLS Streams' explanation with concrete examples
- Add detailed JSDoc to extractLearningSuitePostContent explaining the strategy
- Document why seeking through the video timeline is necessary
diff --git a/docs/index.html b/docs/index.html
@@ -493,10 +493,27 @@ <h2 class="text-3xl sm:text-4xl font-bold text-center mb-4">Under the Hood</h2>
             </div>
             <div>
               <h3 class="text-lg font-semibold mb-2">Encrypted HLS Streams</h3>
-              <p class="text-slate-400 text-sm mb-3">Some platforms serve encrypted HLS playlists that are decrypted client-side. Standard downloaders fail because the playlist data is gibberish.</p>
+              <p class="text-slate-400 text-sm mb-3">Some platforms serve encrypted HLS playlists that are decrypted client-side. The API returns scrambled data like <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">77a393e51f4b...</code> instead of standard <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">#EXTM3U</code> playlists. Standard downloaders and even ffmpeg can't parse these.</p>
               <div class="bg-slate-900/50 rounded-lg p-4 border border-slate-700/30">
                 <p class="text-slate-500 text-xs mb-2 font-mono">Our approach:</p>
-                <p class="text-slate-300 text-sm">We intercept the actual <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">.ts</code> video segments as the browser plays them, capture their individual auth tokens, download each segment, then concatenate them with <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">ffmpeg</code>.</p>
+                <p class="text-slate-300 text-sm">We let the browser decrypt the playlist, then intercept the actual <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">.ts</code> video segments as they're requested. Each segment has its own auth token that expires quickly. We capture all tokens, download each segment individually, then concatenate them with <code class="bg-slate-800 px-1.5 py-0.5 rounded text-xs">ffmpeg -f concat</code>.</p>
+              </div>
+            </div>
+          </div>
+        </div>
+
+        <!-- Challenge 1b: On-Demand Segment Loading -->
+        <div class="bg-slate-800/20 rounded-2xl p-6 border border-slate-700/30">
+          <div class="flex items-start gap-4">
+            <div class="w-10 h-10 bg-pink-500/20 rounded-lg flex items-center justify-center flex-shrink-0 mt-1">
+              <img src="icon-play-circle.svg" alt="" class="w-5 h-5" style="filter: invert(70%) sepia(80%) saturate(500%) hue-rotate(280deg);">
+            </div>
+            <div>
+              <h3 class="text-lg font-semibold mb-2">On-Demand Segment Loading</h3>
+              <p class="text-slate-400 text-sm mb-3">HLS players don't load all segments upfront—they fetch them on-demand as you watch. If you only watch the first 2 minutes, you only get those segments. A 30-minute video might download as a 2-minute clip.</p>
+              <div class="bg-slate-900/50 rounded-lg p-4 border border-slate-700/30">
+                <p class="text-slate-500 text-xs mb-2 font-mono">Our approach:</p>
+                <p class="text-slate-300 text-sm">We programmatically seek through the entire video timeline, triggering the player to request every segment. For a 5-minute video, we seek to positions every ~12 seconds, ensuring all 75+ segments are requested and captured. The full video downloads correctly every time.</p>
               </div>
             </div>
           </div>
diff --git a/src/scraper/learningsuite/extractor.ts b/src/scraper/learningsuite/extractor.ts
@@ -373,6 +373,28 @@ export async function extractAttachmentsFromPage(
 
 /**
  * Extracts complete lesson content using DOM-based extraction with network interception.
+ *
+ * ## Video Download Strategy for LearningSuite (Bunny CDN)
+ *
+ * LearningSuite uses encrypted HLS playlists that cannot be downloaded directly:
+ *
+ * 1. **Encrypted Playlists**: The API returns encrypted data (e.g., `77a393e51f4b...`)
+ *    instead of standard `#EXTM3U` playlists. JavaScript decrypts them client-side.
+ *
+ * 2. **Per-Segment Tokens**: Each `.ts` segment has a unique, short-lived token.
+ *    These tokens are generated when the browser requests each segment.
+ *
+ * 3. **On-Demand Loading**: HLS players load segments on-demand during playback.
+ *    Simply loading the page only captures the first ~2 minutes of video.
+ *
+ * ## Our Solution
+ *
+ * - Intercept network requests to capture segment URLs with their tokens
+ * - Programmatically seek through the entire video timeline
+ * - This triggers the player to request ALL segments
+ * - Download each segment individually with its token
+ * - Concatenate segments using `ffmpeg -f concat`
+ *
  * Note: LearningSuite uses persisted GraphQL queries, so we can't make arbitrary API calls.
  */
 export async function extractLearningSuitePostContent(
@@ -385,7 +407,8 @@ export async function extractLearningSuitePostContent(
   // Set up request interception to capture HLS video URLs
   const hlsUrls: string[] = [];
 
-  // Handler for requests - capture segment URLs with tokens
+  // Capture segment URLs with their individual auth tokens
+  // Each .ts segment has a unique token like: video0.ts?token=abc123&expires=...
   const segmentUrls: string[] = [];
 
   const requestHandler = (request: { url: () => string }) => {