Skip to content

Commit 4099a86

Browse files
committed
Merge branch 'develop'
2 parents ef0888c + e51c196 commit 4099a86

File tree

12 files changed

+775
-82
lines changed

12 files changed

+775
-82
lines changed

docs/ocw-system-analysis.md

Lines changed: 390 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,390 @@
1+
# OCW Downloader System Analysis Document
2+
3+
## Executive Summary
4+
5+
The OCW Downloader System is a content acquisition and organization solution designed to systematically download and persist OpenCourseWare materials. The system interfaces with multiple OCW APIs to retrieve course metadata, hierarchical content structure, and binary session files, organizing them into a deterministic filesystem structure. This architecture enables reliable, repeatable downloads with human-readable directory organization following the pattern: `course_title/chapter_sort__chapter_title/session_sort__session_title.ext`.
6+
7+
## System Overview
8+
9+
### Core Components
10+
11+
The system architecture comprises five primary components working in orchestrated harmony:
12+
13+
1. #### User/CLI Interface
14+
15+
- Entry point for system interaction
16+
- Accepts courseId as primary input parameter
17+
- Receives status updates and completion summaries
18+
19+
1. #### Downloader (Spider/Worker)
20+
21+
- Central orchestration engine
22+
- Manages API communication sequencing
23+
- Handles error recovery and retry logic
24+
- Implements deterministic path generation algorithm
25+
26+
1. #### OCW API Suite
27+
28+
- **Course API**: Provides course-level metadata (title, type)
29+
- **Sessions API**: Returns hierarchical content structure with sort ordering
30+
- **Sessions Link**: Binary content delivery endpoint
31+
32+
1. #### Local Storage (File System)
33+
34+
- Persistent storage layer
35+
- Maintains hierarchical directory structure
36+
- Preserves content with deterministic naming convention
37+
38+
### Component Interaction Diagram
39+
40+
```plantuml
41+
@startuml
42+
title OCW Downloader — System Architecture
43+
44+
!define RECTANGLE_COLOR #E1F5FE
45+
!define API_COLOR #FFF3E0
46+
!define STORAGE_COLOR #E8F5E9
47+
!define WORKER_COLOR #F3E5F5
48+
49+
left to right direction
50+
skinparam componentStyle rectangle
51+
skinparam wrapWidth 220
52+
skinparam maxMessageSize 220
53+
skinparam arrowColor #abb2bf
54+
skinparam actorBackgroundColor #61afef
55+
skinparam componentBackgroundColor #2c323c
56+
skinparam componentBorderColor #61afef
57+
skinparam databaseBackgroundColor #2c323c
58+
skinparam databaseBorderColor #98c379
59+
skinparam nodeBackgroundColor #2c323c
60+
skinparam nodeBorderColor #e06c75
61+
62+
63+
' Dark Mode Theme
64+
skinparam backgroundColor #282c34
65+
skinparam defaultTextAlignment left
66+
skinparam noteTextAlignment left
67+
skinparam ArrowColor #abb2bf
68+
skinparam NoteBorderColor #61afef
69+
skinparam NoteBackgroundColor #2c323c
70+
skinparam NoteFontColor #abb2bf
71+
skinparam Nodesep 100
72+
skinparam Ranksep 100
73+
skinparam Dpi 96
74+
skinparam PageMargin 150
75+
skinparam BoxPadding 150
76+
77+
78+
79+
actor "User/CLI" as User #61afef
80+
component "Downloader\n(Spider/Worker)" as D <<core>> #2c323c
81+
82+
node "OCW API Gateway" as API #2c323c {
83+
component "Course API\nPOST /api/v1/ocw/course/get" as CourseAPI #2c323c
84+
component "Sessions API\nPOST /api/v1/ocw/sessions" as SessionsAPI #2c323c
85+
component "Session Link\nGET /cms/ocw/session_link" as SessionLink #2c323c
86+
}
87+
88+
database "Local Storage\n(File System)" as FS #2c323c
89+
90+
User -[#61afef]-> D : courseId
91+
D -[#e06c75]-> CourseAPI : POST {"id": courseId}
92+
CourseAPI -[#98c379]-> D : {title, type}
93+
94+
D -[#e06c75]-> SessionsAPI : POST {\n "limit": null,\n "order_type": "ASC",\n "course_id": courseId,\n "status": ["free","non-free"]\n}
95+
SessionsAPI -[#98c379]-> D : chapters[] {title, sort,\n sessions[] {title, link, type, sort}}
96+
97+
D -[#e06c75]-> SessionLink : GET session.link\n(per session)
98+
SessionLink -[#98c379]-> D : binary content
99+
100+
D -UP[#c678dd]-> FS : save as\ncourse_title/\n chapter_sort__chapter_title/\n session_sort__session_title.ext
101+
102+
note bottom of D #2c323c
103+
<color:#abb2bf>Orchestrates entire workflow
104+
Implements retry logic
105+
Handles path generation</color>
106+
end note
107+
108+
note top of API #2c323c
109+
<color:#abb2bf>RESTful API endpoints
110+
JSON request/response
111+
Binary content delivery</color>
112+
end note
113+
@enduml
114+
```
115+
116+
## Interaction Analysis
117+
118+
The system demonstrates a well-structured service-oriented architecture with clear separation of concerns:
119+
120+
### Key Interaction Patterns
121+
122+
1. **Sequential Dependency Chain**: Course metadata must be retrieved before session listing, establishing a critical path for data acquisition
123+
2. **Hierarchical Data Resolution**: The Sessions API provides complete navigational structure in a single response, minimizing API calls
124+
3. **Parallel Download Capability**: Individual session downloads are independent, enabling potential parallelization
125+
4. **Deterministic Path Generation**: Sort keys ensure consistent, reproducible filesystem organization across multiple executions
126+
127+
### Communication Protocols
128+
129+
- **Metadata APIs**: JSON-based POST requests with structured payloads
130+
- **Binary Endpoint**: Simple GET requests with URL-based session identification
131+
- **Error Handling**: Non-blocking session failures with graceful degradation
132+
133+
## Process Flow Analysis
134+
135+
### Sequence Diagram
136+
137+
```plantuml
138+
@startuml
139+
title OCW Downloader — Process Flow
140+
141+
' Dark Mode Theme
142+
skinparam backgroundColor #282c34
143+
skinparam defaultTextAlignment left
144+
skinparam noteTextAlignment left
145+
skinparam ArrowColor #abb2bf
146+
skinparam NoteBorderColor #61afef
147+
skinparam NoteBackgroundColor #2c323c
148+
skinparam NoteFontColor #abb2bf
149+
skinparam Nodesep 100
150+
skinparam Ranksep 100
151+
skinparam Dpi 96
152+
skinparam PageMargin 150
153+
skinparam BoxPadding 150
154+
155+
skinparam sequenceArrowColor #0f58e0
156+
skinparam sequenceLifeLineBorderColor #4b5263
157+
skinparam sequenceParticipantBackgroundColor #2c323c
158+
skinparam sequenceParticipantBorderColor #61afef
159+
skinparam sequenceActorBackgroundColor #2c323c
160+
skinparam sequenceActorBorderColor #61afef
161+
skinparam sequenceGroupBackgroundColor #2c323c
162+
skinparam sequenceGroupBorderColor #61afef
163+
skinparam sequenceGroupHeaderFontColor #61afef
164+
skinparam sequenceDividerBackgroundColor #2c323c
165+
skinparam sequenceDividerBorderColor #61afef
166+
skinparam sequenceDividerFontColor #61afef
167+
skinparam sequenceLifeLineBackgroundColor #2c323c
168+
169+
autonumber "<b>[00]"
170+
actor Client #61afef
171+
participant "Downloader\n(Spider/Worker)" as D #c678dd
172+
participant "Course API\nPOST /api/v1/ocw/course/get" as CourseAPI #e06c75
173+
participant "Sessions API\nPOST /api/v1/ocw/sessions" as SessionsAPI #e06c75
174+
participant "Session Link\nGET /cms/ocw/session_link" as CMS #e06c75
175+
database "File System" as FS #98c379
176+
177+
== Initialization ==
178+
Client -> D : start(courseId)
179+
activate D
180+
181+
== Phase 1: Course Metadata Retrieval ==
182+
group #2c323c Fetch Course Metadata
183+
D -> CourseAPI : POST { "id": courseId }
184+
activate CourseAPI
185+
alt #98c379 Success [200 OK]
186+
CourseAPI --> D : { title: "Data Structures", type: "undergraduate" }
187+
note right #2c323c: Course metadata fetched\nfor directory naming
188+
else #e06c75 Error [4xx/5xx]
189+
CourseAPI --> D : 4xx/5xx
190+
deactivate CourseAPI
191+
D --> Client : ERROR: Course fetch failed
192+
deactivate D
193+
return
194+
end
195+
deactivate CourseAPI
196+
end
197+
198+
== Phase 2: Content Hierarchy Discovery ==
199+
group #2c323c Fetch Chapter/Session Hierarchy
200+
D -> SessionsAPI : POST {\n "limit": null,\n "order_type": "ASC",\n "course_id": courseId,\n "status": ["free","non-free"]\n}
201+
activate SessionsAPI
202+
alt #98c379 Success [200 OK]
203+
SessionsAPI --> D : chapters[] { title, sort,\n sessions[] { title, link, type, sort } }
204+
note right #2c323c: Complete hierarchy\nretrieved in single call
205+
else #e06c75 Error [4xx/5xx]
206+
SessionsAPI --> D : 4xx/5xx
207+
deactivate SessionsAPI
208+
D --> Client : ERROR: Sessions fetch failed
209+
deactivate D
210+
return
211+
end
212+
deactivate SessionsAPI
213+
end
214+
215+
== Phase 3: Content Download Execution ==
216+
group #2c323c Download Sessions (Ordered Processing)
217+
loop for each chapter (ascending by sort)
218+
note over D #2c323c: Create chapter directory\nif not exists
219+
loop for each session (ascending by sort)
220+
D -> CMS : GET session.link
221+
activate CMS
222+
alt #98c379 Success [200 OK]
223+
CMS --> D : content bytes
224+
deactivate CMS
225+
D -> FS : write course_title/\n chapter_sort__chapter_title/\n session_sort__session_title.ext
226+
activate FS
227+
FS --> D : write confirmation
228+
deactivate FS
229+
note right #2c323c: Path deterministically\ngenerated from metadata
230+
else #e06c75 Error [4xx/5xx]
231+
CMS --> D : 4xx/5xx
232+
deactivate CMS
233+
D --> Client : WARN: Session skipped
234+
note right #2c323c: Continue with\nnext session
235+
end
236+
end
237+
end
238+
end
239+
240+
== Completion ==
241+
D --> Client : done(summary: {\n total_sessions: 47,\n successful_downloads: 45,\n failed_downloads: 2,\n target_paths: "./Data_Structures/"\n})
242+
deactivate D
243+
@enduml
244+
```
245+
246+
### Process Flow Characteristics
247+
248+
1. **Three-Phase Execution Model**:
249+
250+
- **Phase 1**: Course metadata acquisition (blocking)
251+
- **Phase 2**: Session hierarchy retrieval (blocking)
252+
- **Phase 3**: Content download (non-blocking per session)
253+
254+
2. **Error Recovery Strategy**:
255+
256+
- Critical failures (Phases 1-2): Terminate execution
257+
- Non-critical failures (Phase 3): Log and continue
258+
259+
## Data Model Analysis
260+
261+
### API Data Model Diagram
262+
263+
```plantuml
264+
@startuml
265+
title OCW API Data Model
266+
267+
' Dark Mode Theme
268+
skinparam backgroundColor #282c34
269+
skinparam defaultTextAlignment left
270+
skinparam noteTextAlignment left
271+
skinparam ArrowColor #abb2bf
272+
skinparam NoteBorderColor #61afef
273+
skinparam NoteBackgroundColor #2c323c
274+
skinparam NoteFontColor #abb2bf
275+
skinparam Nodesep 100
276+
skinparam Ranksep 100
277+
skinparam Dpi 96
278+
skinparam PageMargin 150
279+
skinparam BoxPadding 150
280+
281+
skinparam classBackgroundColor #2c323c
282+
skinparam classBorderColor #61afef
283+
skinparam classAttributeIconSize 0
284+
skinparam classFontStyle bold
285+
skinparam classFontColor #abb2bf
286+
skinparam arrowColor #98c379
287+
skinparam entityBackgroundColor #2c323c
288+
skinparam entityBorderColor #61afef
289+
skinparam entityFontColor #abb2bf
290+
skinparam entityAttributeFontColor #abb2bf
291+
skinparam packageBackgroundColor #2c323c
292+
hide empty methods
293+
294+
entity "Course" as Course <<Entity>> #2c323c {
295+
<color:#61afef>title</color> : String [NOT NULL]
296+
--
297+
<color:#abb2bf><i>Constraints:</i>
298+
• Title used for root directory
299+
}
300+
301+
entity "Chapter" as Chapter <<Entity>> #2c323c {
302+
<color:#61afef>title</color> : String [NOT NULL]
303+
<color:#e06c75>sort</color> : Integer [UNIQUE per course]
304+
--
305+
<color:#abb2bf><i>Constraints:</i>
306+
• Sort determines processing order
307+
• Sort used in directory naming
308+
• No direct content storage</color>
309+
}
310+
311+
entity "Session" as Session <<Entity>> #2c323c {
312+
<color:#61afef>title</color> : String [NOT NULL]
313+
<color:#c678dd>link</color> : URL [NOT NULL]
314+
<color:#c678dd>ext</color> : String
315+
<color:#e06c75>sort</color> : Integer [UNIQUE per chapter]
316+
--
317+
<color:#abb2bf><i>Constraints:</i></color>
318+
<color:SUBTEXT_COLOR>• Link points to binary content</color>
319+
<color:SUBTEXT_COLOR>• Sort ensures consistent ordering</color>
320+
<color:SUBTEXT_COLOR>• Ext derived from Link</color>
321+
}
322+
323+
324+
Chapter ||-R-o{ Session : "contains\n(1:N)" #98c379
325+
326+
note top of Course #2c323c
327+
<color:#abb2bf><b>Primary Entity</b>
328+
Identified by external courseId
329+
Retrieved via Course API
330+
Forms root of storage hierarchy</color>
331+
end note
332+
333+
note top of Chapter #2c323c
334+
<color:#abb2bf><b>Organizational Container</b>
335+
Groups related sessions
336+
Sort-prefixed directory naming
337+
No direct downloadable content</color>
338+
end note
339+
340+
note bottom of Session #2c323c
341+
<color:#abb2bf><b>Content Unit</b>
342+
Atomic downloadable resource
343+
Binary content via link URL
344+
Sort-prefixed file naming</color>
345+
end note
346+
347+
note as StorageFormula #2c323c
348+
<color:#abb2bf><b>Storage Path Generation Algorithm:</b>
349+
<code>
350+
Path = {course.title}/
351+
{chapter.sort}__{chapter.title}/
352+
{session.sort}__{session.title}.{ext}
353+
</code>
354+
355+
<b>Example:</b>
356+
<code>
357+
"Introduction to Python/
358+
01__Getting Started/
359+
01__Installation Guide.pdf"
360+
</code></color>
361+
end note
362+
363+
StorageFormula .. Session
364+
@enduml
365+
```
366+
367+
### Data Model Insights
368+
369+
#### Schema Characteristics
370+
371+
1. ##### Course Schema
372+
373+
- Key attributes: title (directory naming)
374+
375+
2. ##### Chapter Schema
376+
377+
- Organizational unit without direct content
378+
- Sort attribute ensures deterministic ordering
379+
- Relationship: Children (Sessions)
380+
381+
3. ##### Session Schema
382+
383+
- Atomic content unit with downloadable resource
384+
- Link attribute provides content access URL
385+
- Sort attribute maintains consistent ordering within chapter
386+
387+
#### Data Integrity Considerations
388+
389+
- Sort values must be unique within their scope (course/chapter)
390+
- Path generation algorithm ensures filesystem compatibility

0 commit comments

Comments
 (0)