11# 2-Day Distributed Systems Workshop Agenda
22
3- > ** Workshop Format** : Each teaching block is 45 minutes (≈ 25 min explanation + 10 min Python analysis + 10 min guided coding)
3+ > ** Workshop Format** : Each teaching block is 45 minutes
44> ** Breaks** : 10–15 minutes between sessions
5- > ** Daily Structure** : Day 1: ~ 3.75 hours teaching (5 sessions), Day 2: ~ 3 hours teaching (4 sessions) + ~ 45 minutes of breaks per day
5+ > ** Daily Structure** : Day 1: ~ 3.75 hours teaching (5 sessions), Day 2: ~ 3 hours teaching (4 sessions)
66
77---
88
9- ## 🛠️ ** Workshop Setup** (5 minutes)
9+ ## ** Workshop Setup** (5 minutes)
1010** Python Performance Analysis Tools Setup:**
1111``` bash
1212cd src/main/python
@@ -20,15 +20,15 @@ python queuing_theory.py
2020
2121---
2222
23- ## 📅 Day 1: Foundations & Basic Patterns
23+ ## Day 1: Foundations & Basic Patterns
2424
25- ### ** Session 1** (45 min) 🎯 ** Why Distribute?**
25+ ### ** Session 1** (45 min) ** Why Distribute?**
2626- ** Learning Goals:**
2727 - Resource ceilings and physical limits
2828 - Little's Law and performance modeling
2929 - Motivation for distributed patterns
30- - ** 🛠️ Hands-on Lab:** Run provided disk-perf test; capture own numbers
31- - ** 📊 Performance Analysis (NEW!) :**
30+ - ** Hands-on Lab:** Run provided disk-perf test; capture own numbers
31+ - ** Performance Analysis:**
3232 ``` bash
3333 # Demonstrate system performance limits with queuing theory
3434 cd src/main/python
@@ -45,71 +45,65 @@ python queuing_theory.py
4545 - At 90% load: 100ms latency (manageable)
4646 - At 99% load: 1000ms latency (problematic)
4747 - Beyond 100%: System collapse
48- - ** 💡 Connection:** "This is WHY we need distributed systems - single machines hit performance walls!"
48+ - ** Connection:** "This is WHY we need distributed systems - single machines hit performance walls!"
4949- ** Break:** 10 minutes
5050
51- ### ** Session 2** (45 min) 🎯 ** Why Patterns? & Partial Failure Mindset**
51+ ### ** Session 2** (45 min) ** Why Patterns? & Partial Failure Mindset**
5252- ** Learning Goals:**
53- - Understanding the need for distributed patterns
53+ - Understanding the need for patterns
5454 - Pattern-based thinking in distributed systems
5555 - Probability of failure at scale and network partitions
5656 - Process pauses and their impact
57- - ** 🛠️ Hands-on Lab:**
57+ - ** Hands-on Lab:**
5858 - Overview of patterns available in the framework
5959 - Walkthrough of the 'replicate' framework with fault injection
60- - ** 📊 Failure Probability Analysis (NEW!) :**
60+ - ** Failure Probability Analysis:**
6161 ``` bash
6262 # Calculate realistic failure probabilities
6363 python failure_probability.py
64-
65- # Example scenarios to try:
66- # Scenario 1: 3 nodes, 2 failures, 0.1 failure rate → ~2.7% chance of losing majority
67- # Scenario 2: 5 nodes, 3 failures, 0.05 failure rate → ~0.13% chance of losing majority
68- # Scenario 3: Large cluster - 100 nodes, 30 failures, 0.05 failure rate
6964 ```
7065 ** Key Insights:**
71- - Even with 5% individual failure rate, losing quorum is significant risk
72- - Larger clusters provide better fault tolerance
66+ - Script calculates "N or more failures" probability using binomial distribution
7367 - Patterns help us handle these inevitable failures systematically
74- - ** 💡 Connection:** "Patterns solve recurring problems - especially failure handling!"
75- - ** 📁 Reference:** ` src/main/java/replicate/common/ ` and ` src/test/java/replicate/common/ `
68+ - ** Connection:** "Patterns solve recurring problems - especially failure handling!"
69+ - ** Reference:** ` src/main/java/replicate/common/ ` and ` src/test/java/replicate/common/ `
7670- ** Break:** 10 minutes
7771
78- ### ** Session 3** (45 min) 🎯 ** Write-Ahead Log Pattern**
72+ ### ** Session 3** (45 min) ** Write-Ahead Log Pattern**
7973- ** Learning Goals:**
8074 - Append-only discipline for durability
8175 - Recovery mechanisms and replay
8276 - WAL as foundation for other patterns
83- - ** 🛠️ Hands-on Lab:** Execute and walkthrough ` DurableKVStoreTest ` for persistent key-value store
84- - ** 💡 Connection:** "WAL ensures we can recover from the failures we just discussed!"
85- - ** 📁 Reference:** ` src/test/java/replicate/wal/DurableKVStoreTest.java `
77+ - ** Hands-on Lab:** Execute and walkthrough ` DurableKVStoreTest ` for persistent key-value store
78+ - ** Connection:** "WAL ensures we can recover from the failures we just discussed!"
79+ - ** Reference:** ` src/test/java/replicate/wal/DurableKVStoreTest.java `
8680- ** Break:** 15 minutes
8781
88- ### ** Session 4** (45 min) 🎯 ** Core Communication Patterns**
82+ ### ** Session 4** (45 min) ** Code Walkthrough and Core Patterns**
8983- ** Learning Goals:**
9084 - Request-waiting list pattern for async operations
9185 - Singular update queue for thread safety
9286 - Network messaging foundations
9387 - Building blocks for distributed protocols
94- - ** 🛠️ Hands-on Lab:**
88+ - ** Hands-on Lab:**
9589 - Code walkthrough: ` RequestWaitingList ` and ` SingularUpdateQueue ` implementations
9690 - Understand how async requests are tracked and managed
9791 - See how single-threaded execution prevents race conditions
98- - ** 📁 Reference:**
92+ - ** Reference:**
9993 - ` src/main/java/replicate/common/RequestWaitingList.java `
10094 - ` src/main/java/replicate/common/SingularUpdateQueue.java `
10195 - ` src/main/java/replicate/net/ ` - Network communication layer
102- - ** 💡 Connection:** "These patterns are the foundation for quorum-based systems and consensus algorithms!"
96+ - ** Connection:** "These patterns are the foundation for quorum-based systems and consensus algorithms!"
10397- ** Break:** 10 minutes
10498
105- ### ** Session 5** (45 min) 🎯 ** Replication & Majority Quorum**
99+ ### ** Session 5** (45 min) ** Replication & Majority Quorum**
106100- ** Learning Goals:**
107101 - Write vs read quorums trade-offs
108102 - Quorum intersection properties
109103 - Universal Scalability Law curve analysis
110- - ** 🛠️ Hands-on Lab:** Modify ` QuorumKVStoreTest ` : pass for 5-node/3-node clusters
104+ - ** Hands-on Lab:** Modify ` QuorumKVStoreTest ` : pass for 5-node/3-node clusters
111105 - ** Prerequisite:** Understanding of ` RequestWaitingList ` from Session 4 (used in quorum coordination)
112- - ** 📊 Scalability Analysis (NEW!) :**
106+ - ** Scalability Analysis:**
113107 ``` bash
114108 # Analyze how performance scales with cluster size
115109 python universal_scalability_law_improved.py
@@ -124,150 +118,55 @@ python queuing_theory.py
124118 - Coordination overhead increases with cluster size
125119 - Optimal cluster sizes depend on algorithm choice
126120 - Well-designed systems scale better than legacy systems
127- - ** 💡 Connection:** "This shows the trade-offs in quorum-based replication!"
128- - ** 📁 Reference:** ` src/test/java/replicate/quorum/QuorumKVStoreTest.java `
121+ - ** Connection:** "This shows the trade-offs in quorum-based replication!"
122+ - ** Reference:** ` src/test/java/replicate/quorum/QuorumKVStoreTest.java `
129123- ** End of Day 1**
130124
131- ### 🍽️ ** Lunch Break / Self-Paced Time**
132- ** Offline Activities:**
133- - Review morning labs and concepts
134- - Push completed work to GitHub
135- - Optional: Explore additional resources
136- - ** NEW:** Experiment with different parameters in Python scripts
137-
138125---
139126
140- ## 📅 Day 2: Consensus Algorithms & Advanced Patterns
127+ ## Day 2: Consensus Algorithms & Advanced Patterns
141128
142- ### ** Session 6** (45 min) 🎯 ** Why Simple Replication Fails**
129+ ### ** Session 6** (45 min) ** Why Simple Replication Fails**
143130- ** Learning Goals:**
144131 - Two-phase commit pitfalls
145132 - Recovery ambiguity problems
146133 - The need for consensus algorithms
147- - ** 🛠️ Hands-on Lab:** Step through ` DeferredCommitmentTest ` and ` RecoverableDeferredCommitmentTest ` ; explain why they hang
148- - ** 📊 Realistic System Behavior Analysis (NEW!) :**
134+ - ** Hands-on Lab:** Step through ` DeferredCommitmentTest ` and ` RecoverableDeferredCommitmentTest ` ;
135+ - ** Realistic System Behavior Analysis:**
149136 ``` bash
150137 # Show how systems degrade under stress (unlike theoretical models)
151138 python realistic_system_performance.py
152139 ```
153- ** Key Visualizations:**
154- 1 . ** Realistic Performance Under Load** - Shows system degradation beyond theoretical limits
155- 2 . ** Ideal vs Realistic Comparison** - Why real systems perform worse than theory
156-
157- ** Key Insights:**
158- - Systems don't just hit limits - they degrade badly under stress
159- - Performance collapse happens before theoretical limits
160- - Real systems exhibit much worse behavior than M/M/1 queue models
161- - ** 💡 Connection:** "This is exactly why 2PC fails under load - systems don't gracefully degrade!"
162- - ** 📁 Reference:** ` src/test/java/replicate/twophaseexecution/DeferredCommitmentTest.java `
163- - ** Break:** 10 minutes
164140
165- ### ** Session 7** (45 min) 🎯 ** Single-Value Paxos**
141+ ### ** Session 7** (45 min) ** Single-Value Paxos**
166142- ** Learning Goals:**
167143 - Prepare/Accept phases explained
168144 - Recovery with generation numbers
169145 - Safety and liveness properties
170- - ** 🛠️ Hands-on Lab:** Work with generation voting mechanism using existing Paxos tests
171- - ** 📁 Reference:** ` src/test/java/replicate/paxos/ ` and ` src/test/java/replicate/generationvoting/ `
146+ - ** Hands-on Lab:** Work with generation voting mechanism using existing Paxos tests
147+ - ** Reference:** ` src/test/java/replicate/paxos/ ` and ` src/test/java/replicate/generationvoting/ `
172148- ** Break:** 10 minutes
173149
174- ### ** Session 8** (45 min) 🎯 ** From Paxos to Multi-Paxos**
150+ ### ** Session 8** (45 min) ** From Paxos to Multi-Paxos**
175151- ** Learning Goals:**
176152 - Replicated log concept and implementation
177153 - High-water mark for safe execution
178154 - Heartbeats and failure detection
179- - ** 🛠️ Hands-on Lab:** Extend log to multi-slot using Multi-Paxos and Paxos Log implementations
180- - ** 📁 Reference:** ` src/test/java/replicate/multipaxos/ ` and ` src/test/java/replicate/paxoslog/ `
155+ - ** Hands-on Lab:** Extend log to multi-slot using Multi-Paxos and Paxos Log implementations
156+ - ** Reference:** ` src/test/java/replicate/multipaxos/ ` and ` src/test/java/replicate/paxoslog/ `
181157- ** Break:** 15 minutes
182158
183- ### ** Session 9** (45 min) 🎯 ** RAFT vs Multi-Paxos in Practice**
159+ ### ** Session 9** (45 min) ** RAFT vs Multi-Paxos in Practice**
184160- ** Learning Goals:**
185161 - Implementation optimizations comparison
186162 - Idempotent receiver pattern
187163 - Production considerations and future directions
188- - ** 🛠️ Hands-on Lab:** Compare RAFT & Multi-Paxos implementations; annotate pros/cons
189- - ** 📊 Consensus Algorithm Performance Comparison (NEW!):**
190- ``` bash
191- # Re-run the scalability analysis focusing on consensus algorithms
192- python universal_scalability_law_improved.py
193- # Focus on the "Consensus Algorithm Performance Comparison" graphs
164+ - ** Hands-on Lab:** Compare RAFT & Multi-Paxos implementations; annotate pros/cons
194165 ```
195166 **Discussion Points:**
196167 - **RAFT vs Multi-Paxos**: Which scales better and why?
197168 - **Optimal cluster sizes**: 3, 5, 7, or more nodes?
198- - ** Byzantine Fault Tolerance** : Performance cost analysis
199169 - **Production trade-offs**: Performance vs complexity vs reliability
200-
201- ** Key Insights:**
202- - RAFT typically has lower coordination overhead than basic Paxos
203- - Multi-Paxos (optimized) can outperform RAFT in some scenarios
204- - Byzantine protocols have significant performance penalties
205- - Optimal cluster size is algorithm-dependent
206- - ** 💡 Connection:** "Now you have quantitative data to choose algorithms, not just theoretical knowledge!"
207- - ** 📁 Reference:** ` src/main/java/replicate/raft/ ` and ` src/main/java/replicate/multipaxos/ `
208- - ** End of Day 2**
209170
210- ---
211-
212- ## 📊 Workshop Summary
213-
214- ### 🎯 ** Enhanced Learning Outcomes**
215- - ** 9 teaching blocks** with optimized timing (5 sessions Day 1, 4 sessions Day 2)
216- - ** Pattern-driven learning** progression from motivation to implementation
217- - ** Combined foundational concepts** for efficient learning progression
218- - ** Core patterns foundation** before advanced algorithms
219- - ** Quantitative analysis** integrated with hands-on labs
220- - ** Visual performance data** reinforcing theoretical concepts
221- - ** Data-driven decision making** for distributed system design
222-
223- ### 🛠️ ** Technical Skills Gained**
224- - Understanding distributed systems fundamentals
225- - ** NEW:** Performance modeling and capacity planning
226- - ** NEW:** Failure probability analysis for reliability planning
227- - ** NEW:** Scalability analysis using Universal Scalability Law
228- - Implementing Write-Ahead Log pattern
229- - Working with quorum-based replication
230- - Exploring consensus algorithms (Paxos, RAFT)
231- - Hands-on experience with fault tolerance patterns
232-
233- ### 📊 ** Performance Analysis Tools**
234- - ** Queuing Theory Analysis** : System performance limits and Little's Law
235- - ** Failure Probability Calculator** : Risk assessment for cluster sizing
236- - ** Universal Scalability Law** : Performance scaling analysis
237- - ** Realistic Performance Modeling** : System degradation under stress
238- - ** Consensus Algorithm Comparison** : Quantitative algorithm selection
239-
240- ### 🗂️ ** Available Implementations**
241- - ** Consensus Algorithms:** Paxos, Multi-Paxos, RAFT, ViewStamped Replication
242- - ** Replication Patterns:** Chain Replication, Quorum-based KV Store
243- - ** Foundational Patterns:** WAL, Two-Phase Commit, Heartbeat Detection
244- - ** Network Layer:** Socket-based messaging, Request-waiting lists
245- - ** NEW:** ** Performance Analysis Scripts:** Python-based modeling tools
246-
247- ### 📁 ** Key Files Reference**
248- - ** Core Framework:** ` src/main/java/replicate/common/ `
249- - ** WAL Implementation:** ` src/main/java/replicate/wal/DurableKVStore.java `
250- - ** Quorum KV Store:** ` src/main/java/replicate/quorum/QuorumKVStore.java `
251- - ** Chain Replication:** ` src/main/java/replicate/chain/ChainReplication.java `
252- - ** Paxos Implementation:** ` src/main/java/replicate/paxos/ `
253- - ** RAFT Implementation:** ` src/main/java/replicate/raft/ `
254- - ** Tests Directory:** ` src/test/java/replicate/ `
255- - ** NEW:** ** Performance Scripts:** ` src/main/python/ `
256- - ` queuing_theory.py ` - Little's Law and performance analysis
257- - ` failure_probability.py ` - Cluster reliability analysis
258- - ` universal_scalability_law_improved.py ` - Scaling and algorithm comparison
259- - ` realistic_system_performance.py ` - Real-world performance modeling
260-
261- ### 📚 ** Resources & Next Steps**
262- - All code examples and labs available on GitHub
263- - ** NEW:** Take-home performance analysis scripts for production use
264- - Additional reading materials provided
265- - Follow-up Q&A session for complex topics
266- - ** NEW:** Quantitative foundation for architecture decisions
171+ - **End of Day 2**
267172
268- ### 💡 ** Workshop Enhancement Benefits**
269- - ** Visual Learning** : Graphs and charts reinforce abstract concepts
270- - ** Quantitative Understanding** : Real numbers behind theoretical concepts
271- - ** Practical Tools** : Scripts participants can use in production
272- - ** Data-Driven Decisions** : Choose algorithms based on performance data
273- - ** Business Impact** : Connect technical decisions to business outcomes
0 commit comments