|
| 1 | +# OCW Downloader System Analysis Document |
| 2 | + |
| 3 | +## Executive Summary |
| 4 | + |
| 5 | +The OCW Downloader System is a content acquisition and organization solution designed to systematically download and persist OpenCourseWare materials. The system interfaces with multiple OCW APIs to retrieve course metadata, hierarchical content structure, and binary session files, organizing them into a deterministic filesystem structure. This architecture enables reliable, repeatable downloads with human-readable directory organization following the pattern: `course_title/chapter_sort__chapter_title/session_sort__session_title.ext`. |
| 6 | + |
| 7 | +## System Overview |
| 8 | + |
| 9 | +### Core Components |
| 10 | + |
| 11 | +The system architecture comprises five primary components working in orchestrated harmony: |
| 12 | + |
| 13 | +1. #### User/CLI Interface |
| 14 | + |
| 15 | +- Entry point for system interaction |
| 16 | +- Accepts courseId as primary input parameter |
| 17 | +- Receives status updates and completion summaries |
| 18 | + |
| 19 | +1. #### Downloader (Spider/Worker) |
| 20 | + |
| 21 | +- Central orchestration engine |
| 22 | +- Manages API communication sequencing |
| 23 | +- Handles error recovery and retry logic |
| 24 | +- Implements deterministic path generation algorithm |
| 25 | + |
| 26 | +1. #### OCW API Suite |
| 27 | + |
| 28 | +- **Course API**: Provides course-level metadata (title, type) |
| 29 | +- **Sessions API**: Returns hierarchical content structure with sort ordering |
| 30 | +- **Sessions Link**: Binary content delivery endpoint |
| 31 | + |
| 32 | +1. #### Local Storage (File System) |
| 33 | + |
| 34 | +- Persistent storage layer |
| 35 | +- Maintains hierarchical directory structure |
| 36 | +- Preserves content with deterministic naming convention |
| 37 | + |
| 38 | +### Component Interaction Diagram |
| 39 | + |
| 40 | +```plantuml |
| 41 | +@startuml |
| 42 | +title OCW Downloader — System Architecture |
| 43 | +
|
| 44 | +!define RECTANGLE_COLOR #E1F5FE |
| 45 | +!define API_COLOR #FFF3E0 |
| 46 | +!define STORAGE_COLOR #E8F5E9 |
| 47 | +!define WORKER_COLOR #F3E5F5 |
| 48 | +
|
| 49 | +left to right direction |
| 50 | +skinparam componentStyle rectangle |
| 51 | +skinparam wrapWidth 220 |
| 52 | +skinparam maxMessageSize 220 |
| 53 | +skinparam arrowColor #abb2bf |
| 54 | +skinparam actorBackgroundColor #61afef |
| 55 | +skinparam componentBackgroundColor #2c323c |
| 56 | +skinparam componentBorderColor #61afef |
| 57 | +skinparam databaseBackgroundColor #2c323c |
| 58 | +skinparam databaseBorderColor #98c379 |
| 59 | +skinparam nodeBackgroundColor #2c323c |
| 60 | +skinparam nodeBorderColor #e06c75 |
| 61 | +
|
| 62 | +
|
| 63 | +' Dark Mode Theme |
| 64 | +skinparam backgroundColor #282c34 |
| 65 | +skinparam defaultTextAlignment left |
| 66 | +skinparam noteTextAlignment left |
| 67 | +skinparam ArrowColor #abb2bf |
| 68 | +skinparam NoteBorderColor #61afef |
| 69 | +skinparam NoteBackgroundColor #2c323c |
| 70 | +skinparam NoteFontColor #abb2bf |
| 71 | +skinparam Nodesep 100 |
| 72 | +skinparam Ranksep 100 |
| 73 | +skinparam Dpi 96 |
| 74 | +skinparam PageMargin 150 |
| 75 | +skinparam BoxPadding 150 |
| 76 | +
|
| 77 | +
|
| 78 | +
|
| 79 | +actor "User/CLI" as User #61afef |
| 80 | +component "Downloader\n(Spider/Worker)" as D <<core>> #2c323c |
| 81 | +
|
| 82 | +node "OCW API Gateway" as API #2c323c { |
| 83 | + component "Course API\nPOST /api/v1/ocw/course/get" as CourseAPI #2c323c |
| 84 | + component "Sessions API\nPOST /api/v1/ocw/sessions" as SessionsAPI #2c323c |
| 85 | + component "Session Link\nGET /cms/ocw/session_link" as SessionLink #2c323c |
| 86 | +} |
| 87 | +
|
| 88 | +database "Local Storage\n(File System)" as FS #2c323c |
| 89 | +
|
| 90 | +User -[#61afef]-> D : courseId |
| 91 | +D -[#e06c75]-> CourseAPI : POST {"id": courseId} |
| 92 | +CourseAPI -[#98c379]-> D : {title, type} |
| 93 | +
|
| 94 | +D -[#e06c75]-> SessionsAPI : POST {\n "limit": null,\n "order_type": "ASC",\n "course_id": courseId,\n "status": ["free","non-free"]\n} |
| 95 | +SessionsAPI -[#98c379]-> D : chapters[] {title, sort,\n sessions[] {title, link, type, sort}} |
| 96 | +
|
| 97 | +D -[#e06c75]-> SessionLink : GET session.link\n(per session) |
| 98 | +SessionLink -[#98c379]-> D : binary content |
| 99 | +
|
| 100 | +D -UP[#c678dd]-> FS : save as\ncourse_title/\n chapter_sort__chapter_title/\n session_sort__session_title.ext |
| 101 | +
|
| 102 | +note bottom of D #2c323c |
| 103 | + <color:#abb2bf>Orchestrates entire workflow |
| 104 | + Implements retry logic |
| 105 | + Handles path generation</color> |
| 106 | +end note |
| 107 | +
|
| 108 | +note top of API #2c323c |
| 109 | + <color:#abb2bf>RESTful API endpoints |
| 110 | + JSON request/response |
| 111 | + Binary content delivery</color> |
| 112 | +end note |
| 113 | +@enduml |
| 114 | +``` |
| 115 | + |
| 116 | +## Interaction Analysis |
| 117 | + |
| 118 | +The system demonstrates a well-structured service-oriented architecture with clear separation of concerns: |
| 119 | + |
| 120 | +### Key Interaction Patterns |
| 121 | + |
| 122 | +1. **Sequential Dependency Chain**: Course metadata must be retrieved before session listing, establishing a critical path for data acquisition |
| 123 | +2. **Hierarchical Data Resolution**: The Sessions API provides complete navigational structure in a single response, minimizing API calls |
| 124 | +3. **Parallel Download Capability**: Individual session downloads are independent, enabling potential parallelization |
| 125 | +4. **Deterministic Path Generation**: Sort keys ensure consistent, reproducible filesystem organization across multiple executions |
| 126 | + |
| 127 | +### Communication Protocols |
| 128 | + |
| 129 | +- **Metadata APIs**: JSON-based POST requests with structured payloads |
| 130 | +- **Binary Endpoint**: Simple GET requests with URL-based session identification |
| 131 | +- **Error Handling**: Non-blocking session failures with graceful degradation |
| 132 | + |
| 133 | +## Process Flow Analysis |
| 134 | + |
| 135 | +### Sequence Diagram |
| 136 | + |
| 137 | +```plantuml |
| 138 | +@startuml |
| 139 | +title OCW Downloader — Process Flow |
| 140 | +
|
| 141 | +' Dark Mode Theme |
| 142 | +skinparam backgroundColor #282c34 |
| 143 | +skinparam defaultTextAlignment left |
| 144 | +skinparam noteTextAlignment left |
| 145 | +skinparam ArrowColor #abb2bf |
| 146 | +skinparam NoteBorderColor #61afef |
| 147 | +skinparam NoteBackgroundColor #2c323c |
| 148 | +skinparam NoteFontColor #abb2bf |
| 149 | +skinparam Nodesep 100 |
| 150 | +skinparam Ranksep 100 |
| 151 | +skinparam Dpi 96 |
| 152 | +skinparam PageMargin 150 |
| 153 | +skinparam BoxPadding 150 |
| 154 | +
|
| 155 | +skinparam sequenceArrowColor #0f58e0 |
| 156 | +skinparam sequenceLifeLineBorderColor #4b5263 |
| 157 | +skinparam sequenceParticipantBackgroundColor #2c323c |
| 158 | +skinparam sequenceParticipantBorderColor #61afef |
| 159 | +skinparam sequenceActorBackgroundColor #2c323c |
| 160 | +skinparam sequenceActorBorderColor #61afef |
| 161 | +skinparam sequenceGroupBackgroundColor #2c323c |
| 162 | +skinparam sequenceGroupBorderColor #61afef |
| 163 | +skinparam sequenceGroupHeaderFontColor #61afef |
| 164 | +skinparam sequenceDividerBackgroundColor #2c323c |
| 165 | +skinparam sequenceDividerBorderColor #61afef |
| 166 | +skinparam sequenceDividerFontColor #61afef |
| 167 | +skinparam sequenceLifeLineBackgroundColor #2c323c |
| 168 | +
|
| 169 | +autonumber "<b>[00]" |
| 170 | +actor Client #61afef |
| 171 | +participant "Downloader\n(Spider/Worker)" as D #c678dd |
| 172 | +participant "Course API\nPOST /api/v1/ocw/course/get" as CourseAPI #e06c75 |
| 173 | +participant "Sessions API\nPOST /api/v1/ocw/sessions" as SessionsAPI #e06c75 |
| 174 | +participant "Session Link\nGET /cms/ocw/session_link" as CMS #e06c75 |
| 175 | +database "File System" as FS #98c379 |
| 176 | +
|
| 177 | +== Initialization == |
| 178 | +Client -> D : start(courseId) |
| 179 | +activate D |
| 180 | +
|
| 181 | +== Phase 1: Course Metadata Retrieval == |
| 182 | +group #2c323c Fetch Course Metadata |
| 183 | + D -> CourseAPI : POST { "id": courseId } |
| 184 | + activate CourseAPI |
| 185 | + alt #98c379 Success [200 OK] |
| 186 | + CourseAPI --> D : { title: "Data Structures", type: "undergraduate" } |
| 187 | + note right #2c323c: Course metadata fetched\nfor directory naming |
| 188 | + else #e06c75 Error [4xx/5xx] |
| 189 | + CourseAPI --> D : 4xx/5xx |
| 190 | + deactivate CourseAPI |
| 191 | + D --> Client : ERROR: Course fetch failed |
| 192 | + deactivate D |
| 193 | + return |
| 194 | + end |
| 195 | + deactivate CourseAPI |
| 196 | +end |
| 197 | +
|
| 198 | +== Phase 2: Content Hierarchy Discovery == |
| 199 | +group #2c323c Fetch Chapter/Session Hierarchy |
| 200 | + D -> SessionsAPI : POST {\n "limit": null,\n "order_type": "ASC",\n "course_id": courseId,\n "status": ["free","non-free"]\n} |
| 201 | + activate SessionsAPI |
| 202 | + alt #98c379 Success [200 OK] |
| 203 | + SessionsAPI --> D : chapters[] { title, sort,\n sessions[] { title, link, type, sort } } |
| 204 | + note right #2c323c: Complete hierarchy\nretrieved in single call |
| 205 | + else #e06c75 Error [4xx/5xx] |
| 206 | + SessionsAPI --> D : 4xx/5xx |
| 207 | + deactivate SessionsAPI |
| 208 | + D --> Client : ERROR: Sessions fetch failed |
| 209 | + deactivate D |
| 210 | + return |
| 211 | + end |
| 212 | + deactivate SessionsAPI |
| 213 | +end |
| 214 | +
|
| 215 | +== Phase 3: Content Download Execution == |
| 216 | +group #2c323c Download Sessions (Ordered Processing) |
| 217 | + loop for each chapter (ascending by sort) |
| 218 | + note over D #2c323c: Create chapter directory\nif not exists |
| 219 | + loop for each session (ascending by sort) |
| 220 | + D -> CMS : GET session.link |
| 221 | + activate CMS |
| 222 | + alt #98c379 Success [200 OK] |
| 223 | + CMS --> D : content bytes |
| 224 | + deactivate CMS |
| 225 | + D -> FS : write course_title/\n chapter_sort__chapter_title/\n session_sort__session_title.ext |
| 226 | + activate FS |
| 227 | + FS --> D : write confirmation |
| 228 | + deactivate FS |
| 229 | + note right #2c323c: Path deterministically\ngenerated from metadata |
| 230 | + else #e06c75 Error [4xx/5xx] |
| 231 | + CMS --> D : 4xx/5xx |
| 232 | + deactivate CMS |
| 233 | + D --> Client : WARN: Session skipped |
| 234 | + note right #2c323c: Continue with\nnext session |
| 235 | + end |
| 236 | + end |
| 237 | + end |
| 238 | +end |
| 239 | +
|
| 240 | +== Completion == |
| 241 | +D --> Client : done(summary: {\n total_sessions: 47,\n successful_downloads: 45,\n failed_downloads: 2,\n target_paths: "./Data_Structures/"\n}) |
| 242 | +deactivate D |
| 243 | +@enduml |
| 244 | +``` |
| 245 | + |
| 246 | +### Process Flow Characteristics |
| 247 | + |
| 248 | +1. **Three-Phase Execution Model**: |
| 249 | + |
| 250 | + - **Phase 1**: Course metadata acquisition (blocking) |
| 251 | + - **Phase 2**: Session hierarchy retrieval (blocking) |
| 252 | + - **Phase 3**: Content download (non-blocking per session) |
| 253 | + |
| 254 | +2. **Error Recovery Strategy**: |
| 255 | + |
| 256 | + - Critical failures (Phases 1-2): Terminate execution |
| 257 | + - Non-critical failures (Phase 3): Log and continue |
| 258 | + |
| 259 | +## Data Model Analysis |
| 260 | + |
| 261 | +### API Data Model Diagram |
| 262 | + |
| 263 | +```plantuml |
| 264 | +@startuml |
| 265 | +title OCW API Data Model |
| 266 | +
|
| 267 | +' Dark Mode Theme |
| 268 | +skinparam backgroundColor #282c34 |
| 269 | +skinparam defaultTextAlignment left |
| 270 | +skinparam noteTextAlignment left |
| 271 | +skinparam ArrowColor #abb2bf |
| 272 | +skinparam NoteBorderColor #61afef |
| 273 | +skinparam NoteBackgroundColor #2c323c |
| 274 | +skinparam NoteFontColor #abb2bf |
| 275 | +skinparam Nodesep 100 |
| 276 | +skinparam Ranksep 100 |
| 277 | +skinparam Dpi 96 |
| 278 | +skinparam PageMargin 150 |
| 279 | +skinparam BoxPadding 150 |
| 280 | +
|
| 281 | +skinparam classBackgroundColor #2c323c |
| 282 | +skinparam classBorderColor #61afef |
| 283 | +skinparam classAttributeIconSize 0 |
| 284 | +skinparam classFontStyle bold |
| 285 | +skinparam classFontColor #abb2bf |
| 286 | +skinparam arrowColor #98c379 |
| 287 | +skinparam entityBackgroundColor #2c323c |
| 288 | +skinparam entityBorderColor #61afef |
| 289 | +skinparam entityFontColor #abb2bf |
| 290 | +skinparam entityAttributeFontColor #abb2bf |
| 291 | +skinparam packageBackgroundColor #2c323c |
| 292 | +hide empty methods |
| 293 | +
|
| 294 | +entity "Course" as Course <<Entity>> #2c323c { |
| 295 | + <color:#61afef>title</color> : String [NOT NULL] |
| 296 | + -- |
| 297 | + <color:#abb2bf><i>Constraints:</i> |
| 298 | + • Title used for root directory |
| 299 | +} |
| 300 | +
|
| 301 | +entity "Chapter" as Chapter <<Entity>> #2c323c { |
| 302 | + <color:#61afef>title</color> : String [NOT NULL] |
| 303 | + <color:#e06c75>sort</color> : Integer [UNIQUE per course] |
| 304 | + -- |
| 305 | + <color:#abb2bf><i>Constraints:</i> |
| 306 | + • Sort determines processing order |
| 307 | + • Sort used in directory naming |
| 308 | + • No direct content storage</color> |
| 309 | +} |
| 310 | +
|
| 311 | +entity "Session" as Session <<Entity>> #2c323c { |
| 312 | + <color:#61afef>title</color> : String [NOT NULL] |
| 313 | + <color:#c678dd>link</color> : URL [NOT NULL] |
| 314 | + <color:#c678dd>ext</color> : String |
| 315 | + <color:#e06c75>sort</color> : Integer [UNIQUE per chapter] |
| 316 | + -- |
| 317 | + <color:#abb2bf><i>Constraints:</i></color> |
| 318 | + <color:SUBTEXT_COLOR>• Link points to binary content</color> |
| 319 | + <color:SUBTEXT_COLOR>• Sort ensures consistent ordering</color> |
| 320 | + <color:SUBTEXT_COLOR>• Ext derived from Link</color> |
| 321 | +} |
| 322 | +
|
| 323 | +
|
| 324 | +Chapter ||-R-o{ Session : "contains\n(1:N)" #98c379 |
| 325 | +
|
| 326 | +note top of Course #2c323c |
| 327 | +<color:#abb2bf><b>Primary Entity</b> |
| 328 | +Identified by external courseId |
| 329 | +Retrieved via Course API |
| 330 | +Forms root of storage hierarchy</color> |
| 331 | +end note |
| 332 | +
|
| 333 | +note top of Chapter #2c323c |
| 334 | +<color:#abb2bf><b>Organizational Container</b> |
| 335 | +Groups related sessions |
| 336 | +Sort-prefixed directory naming |
| 337 | +No direct downloadable content</color> |
| 338 | +end note |
| 339 | +
|
| 340 | +note bottom of Session #2c323c |
| 341 | +<color:#abb2bf><b>Content Unit</b> |
| 342 | +Atomic downloadable resource |
| 343 | +Binary content via link URL |
| 344 | +Sort-prefixed file naming</color> |
| 345 | +end note |
| 346 | +
|
| 347 | +note as StorageFormula #2c323c |
| 348 | +<color:#abb2bf><b>Storage Path Generation Algorithm:</b> |
| 349 | +<code> |
| 350 | +Path = {course.title}/ |
| 351 | + {chapter.sort}__{chapter.title}/ |
| 352 | + {session.sort}__{session.title}.{ext} |
| 353 | +</code> |
| 354 | +
|
| 355 | +<b>Example:</b> |
| 356 | +<code> |
| 357 | +"Introduction to Python/ |
| 358 | + 01__Getting Started/ |
| 359 | + 01__Installation Guide.pdf" |
| 360 | +</code></color> |
| 361 | +end note |
| 362 | +
|
| 363 | +StorageFormula .. Session |
| 364 | +@enduml |
| 365 | +``` |
| 366 | + |
| 367 | +### Data Model Insights |
| 368 | + |
| 369 | +#### Schema Characteristics |
| 370 | + |
| 371 | +1. ##### Course Schema |
| 372 | + |
| 373 | + - Key attributes: title (directory naming) |
| 374 | + |
| 375 | +2. ##### Chapter Schema |
| 376 | + |
| 377 | + - Organizational unit without direct content |
| 378 | + - Sort attribute ensures deterministic ordering |
| 379 | + - Relationship: Children (Sessions) |
| 380 | + |
| 381 | +3. ##### Session Schema |
| 382 | + |
| 383 | + - Atomic content unit with downloadable resource |
| 384 | + - Link attribute provides content access URL |
| 385 | + - Sort attribute maintains consistent ordering within chapter |
| 386 | + |
| 387 | +#### Data Integrity Considerations |
| 388 | + |
| 389 | +- Sort values must be unique within their scope (course/chapter) |
| 390 | +- Path generation algorithm ensures filesystem compatibility |
0 commit comments