|
1 | 1 | # 2-Day Distributed Systems Workshop Agenda |
2 | 2 |
|
3 | | -> **Workshop Format**: Each teaching block is 40 minutes (≈ 30 min explanation + 10 min guided coding) |
| 3 | +> **Workshop Format**: Each teaching block is 45 minutes (≈ 25 min explanation + 10 min Python analysis + 10 min guided coding) |
4 | 4 | > **Breaks**: 10–15 minutes between sessions |
5 | | -> **Daily Structure**: ~4 hours teaching + ~45 minutes of breaks per day |
| 5 | +> **Daily Structure**: Day 1: ~3.75 hours teaching (5 sessions), Day 2: ~3 hours teaching (4 sessions) + ~45 minutes of breaks per day |
| 6 | +
|
| 7 | +--- |
| 8 | + |
| 9 | +## 🛠️ **Workshop Setup** (5 minutes) |
| 10 | +**Python Performance Analysis Tools Setup:** |
| 11 | +```bash |
| 12 | +cd src/main/python |
| 13 | +python3 -m venv venv |
| 14 | +source venv/bin/activate |
| 15 | +pip install numpy matplotlib scipy |
| 16 | + |
| 17 | +# Test installation |
| 18 | +python queuing_theory.py |
| 19 | +``` |
6 | 20 |
|
7 | 21 | --- |
8 | 22 |
|
9 | 23 | ## 📅 Day 1: Foundations & Basic Patterns |
10 | 24 |
|
11 | | -### **Session 1** (40 min) 🎯 **Why Distribute?** |
| 25 | +### **Session 1** (45 min) 🎯 **Why Distribute?** |
12 | 26 | - **Learning Goals:** |
13 | 27 | - Resource ceilings and physical limits |
14 | 28 | - Little's Law and performance modeling |
15 | 29 | - Motivation for distributed patterns |
16 | 30 | - **🛠️ Hands-on Lab:** Run provided disk-perf test; capture own numbers |
| 31 | +- **📊 Performance Analysis (NEW!):** |
| 32 | + ```bash |
| 33 | + # Demonstrate system performance limits with queuing theory |
| 34 | + cd src/main/python |
| 35 | + source venv/bin/activate |
| 36 | + |
| 37 | + # Show performance degradation as load increases |
| 38 | + python queuing_theory.py |
| 39 | + |
| 40 | + # Visualize the performance curves |
| 41 | + python queuing_theory_visualization.py |
| 42 | + ``` |
| 43 | + **Key Insights:** |
| 44 | + - System performance degrades dramatically near 100% utilization |
| 45 | + - At 90% load: 100ms latency (manageable) |
| 46 | + - At 99% load: 1000ms latency (problematic) |
| 47 | + - Beyond 100%: System collapse |
| 48 | +- **💡 Connection:** "This is WHY we need distributed systems - single machines hit performance walls!" |
17 | 49 | - **Break:** 10 minutes |
18 | 50 |
|
19 | | -### **Session 2** (40 min) 🎯 **Partial Failure Mindset** |
| 51 | +### **Session 2** (45 min) 🎯 **Why Patterns? & Partial Failure Mindset** |
20 | 52 | - **Learning Goals:** |
21 | | - - Probability of failure at scale |
22 | | - - Network partitions and split-brain scenarios |
| 53 | + - Understanding the need for distributed patterns |
| 54 | + - Pattern-based thinking in distributed systems |
| 55 | + - Probability of failure at scale and network partitions |
23 | 56 | - Process pauses and their impact |
24 | | -- **🛠️ Hands-on Lab:** Walkthrough of the 'replicate' framework with an example test to inject faults. |
| 57 | +- **🛠️ Hands-on Lab:** |
| 58 | + - Overview of patterns available in the framework |
| 59 | + - Walkthrough of the 'replicate' framework with fault injection |
| 60 | +- **📊 Failure Probability Analysis (NEW!):** |
| 61 | + ```bash |
| 62 | + # Calculate realistic failure probabilities |
| 63 | + python failure_probability.py |
| 64 | + |
| 65 | + # Example scenarios to try: |
| 66 | + # Scenario 1: 3 nodes, 2 failures, 0.1 failure rate → ~2.7% chance of losing majority |
| 67 | + # Scenario 2: 5 nodes, 3 failures, 0.05 failure rate → ~0.13% chance of losing majority |
| 68 | + # Scenario 3: Large cluster - 100 nodes, 30 failures, 0.05 failure rate |
| 69 | + ``` |
| 70 | + **Key Insights:** |
| 71 | + - Even with 5% individual failure rate, losing quorum is significant risk |
| 72 | + - Larger clusters provide better fault tolerance |
| 73 | + - Patterns help us handle these inevitable failures systematically |
| 74 | +- **💡 Connection:** "Patterns solve recurring problems - especially failure handling!" |
25 | 75 | - **📁 Reference:** `src/main/java/replicate/common/` and `src/test/java/replicate/common/` |
26 | 76 | - **Break:** 10 minutes |
27 | 77 |
|
28 | | -### **Session 3** (40 min) 🎯 **Write-Ahead Log Pattern** |
| 78 | +### **Session 3** (45 min) 🎯 **Write-Ahead Log Pattern** |
29 | 79 | - **Learning Goals:** |
30 | 80 | - Append-only discipline for durability |
31 | 81 | - Recovery mechanisms and replay |
32 | 82 | - WAL as foundation for other patterns |
33 | | -- **🛠️ Hands-on Lab:** Execute and walkthrough `DurableKVStoreTest` for persistent key-value store. |
| 83 | +- **🛠️ Hands-on Lab:** Execute and walkthrough `DurableKVStoreTest` for persistent key-value store |
| 84 | +- **💡 Connection:** "WAL ensures we can recover from the failures we just discussed!" |
34 | 85 | - **📁 Reference:** `src/test/java/replicate/wal/DurableKVStoreTest.java` |
35 | 86 | - **Break:** 15 minutes |
36 | 87 |
|
37 | | -### **Session 4** (40 min) 🎯 **Replication & Majority Quorum** |
| 88 | +### **Session 4** (45 min) 🎯 **Core Communication Patterns** |
| 89 | +- **Learning Goals:** |
| 90 | + - Request-waiting list pattern for async operations |
| 91 | + - Singular update queue for thread safety |
| 92 | + - Network messaging foundations |
| 93 | + - Building blocks for distributed protocols |
| 94 | +- **🛠️ Hands-on Lab:** |
| 95 | + - Code walkthrough: `RequestWaitingList` and `SingularUpdateQueue` implementations |
| 96 | + - Understand how async requests are tracked and managed |
| 97 | + - See how single-threaded execution prevents race conditions |
| 98 | +- **📁 Reference:** |
| 99 | + - `src/main/java/replicate/common/RequestWaitingList.java` |
| 100 | + - `src/main/java/replicate/common/SingularUpdateQueue.java` |
| 101 | + - `src/main/java/replicate/net/` - Network communication layer |
| 102 | +- **💡 Connection:** "These patterns are the foundation for quorum-based systems and consensus algorithms!" |
| 103 | +- **Break:** 10 minutes |
| 104 | + |
| 105 | +### **Session 5** (45 min) 🎯 **Replication & Majority Quorum** |
38 | 106 | - **Learning Goals:** |
39 | 107 | - Write vs read quorums trade-offs |
40 | 108 | - Quorum intersection properties |
41 | 109 | - Universal Scalability Law curve analysis |
42 | 110 | - **🛠️ Hands-on Lab:** Modify `QuorumKVStoreTest`: pass for 5-node/3-node clusters |
| 111 | + - **Prerequisite:** Understanding of `RequestWaitingList` from Session 4 (used in quorum coordination) |
| 112 | +- **📊 Scalability Analysis (NEW!):** |
| 113 | + ```bash |
| 114 | + # Analyze how performance scales with cluster size |
| 115 | + python universal_scalability_law_improved.py |
| 116 | + ``` |
| 117 | + **Key Visualizations Generated:** |
| 118 | + 1. **Distributed System Performance Scaling** - Shows how coordination overhead affects scaling |
| 119 | + 2. **Business Impact Metrics** - Throughput (req/s) and Response Time (ms) scaling |
| 120 | + 3. **Consensus Algorithm Comparison** - Performance differences between Paxos, RAFT, etc. |
| 121 | + |
| 122 | + **Key Insights:** |
| 123 | + - Adding more nodes doesn't always improve performance |
| 124 | + - Coordination overhead increases with cluster size |
| 125 | + - Optimal cluster sizes depend on algorithm choice |
| 126 | + - Well-designed systems scale better than legacy systems |
| 127 | +- **💡 Connection:** "This shows the trade-offs in quorum-based replication!" |
43 | 128 | - **📁 Reference:** `src/test/java/replicate/quorum/QuorumKVStoreTest.java` |
44 | 129 | - **End of Day 1** |
45 | 130 |
|
|
48 | 133 | - Review morning labs and concepts |
49 | 134 | - Push completed work to GitHub |
50 | 135 | - Optional: Explore additional resources |
| 136 | +- **NEW:** Experiment with different parameters in Python scripts |
51 | 137 |
|
52 | 138 | --- |
53 | 139 |
|
54 | 140 | ## 📅 Day 2: Consensus Algorithms & Advanced Patterns |
55 | 141 |
|
56 | | -### **Session 5** (40 min) 🎯 **Why Simple Replication Fails** |
| 142 | +### **Session 6** (45 min) 🎯 **Why Simple Replication Fails** |
57 | 143 | - **Learning Goals:** |
58 | 144 | - Two-phase commit pitfalls |
59 | 145 | - Recovery ambiguity problems |
60 | 146 | - The need for consensus algorithms |
61 | 147 | - **🛠️ Hands-on Lab:** Step through `DeferredCommitmentTest` and `RecoverableDeferredCommitmentTest`; explain why they hang |
| 148 | +- **📊 Realistic System Behavior Analysis (NEW!):** |
| 149 | + ```bash |
| 150 | + # Show how systems degrade under stress (unlike theoretical models) |
| 151 | + python realistic_system_performance.py |
| 152 | + ``` |
| 153 | + **Key Visualizations:** |
| 154 | + 1. **Realistic Performance Under Load** - Shows system degradation beyond theoretical limits |
| 155 | + 2. **Ideal vs Realistic Comparison** - Why real systems perform worse than theory |
| 156 | + |
| 157 | + **Key Insights:** |
| 158 | + - Systems don't just hit limits - they degrade badly under stress |
| 159 | + - Performance collapse happens before theoretical limits |
| 160 | + - Real systems exhibit much worse behavior than M/M/1 queue models |
| 161 | +- **💡 Connection:** "This is exactly why 2PC fails under load - systems don't gracefully degrade!" |
62 | 162 | - **📁 Reference:** `src/test/java/replicate/twophaseexecution/DeferredCommitmentTest.java` |
63 | 163 | - **Break:** 10 minutes |
64 | 164 |
|
65 | | -### **Session 6** (40 min) 🎯 **Single-Value Paxos** |
| 165 | +### **Session 7** (45 min) 🎯 **Single-Value Paxos** |
66 | 166 | - **Learning Goals:** |
67 | 167 | - Prepare/Accept phases explained |
68 | 168 | - Recovery with generation numbers |
|
71 | 171 | - **📁 Reference:** `src/test/java/replicate/paxos/` and `src/test/java/replicate/generationvoting/` |
72 | 172 | - **Break:** 10 minutes |
73 | 173 |
|
74 | | -### **Session 7** (40 min) 🎯 **From Paxos to Multi-Paxos** |
| 174 | +### **Session 8** (45 min) 🎯 **From Paxos to Multi-Paxos** |
75 | 175 | - **Learning Goals:** |
76 | 176 | - Replicated log concept and implementation |
77 | 177 | - High-water mark for safe execution |
|
80 | 180 | - **📁 Reference:** `src/test/java/replicate/multipaxos/` and `src/test/java/replicate/paxoslog/` |
81 | 181 | - **Break:** 15 minutes |
82 | 182 |
|
83 | | -### **Session 8** (40 min) 🎯 **RAFT vs Multi-Paxos in Practice** |
| 183 | +### **Session 9** (45 min) 🎯 **RAFT vs Multi-Paxos in Practice** |
84 | 184 | - **Learning Goals:** |
85 | 185 | - Implementation optimizations comparison |
86 | 186 | - Idempotent receiver pattern |
87 | 187 | - Production considerations and future directions |
88 | 188 | - **🛠️ Hands-on Lab:** Compare RAFT & Multi-Paxos implementations; annotate pros/cons |
| 189 | +- **📊 Consensus Algorithm Performance Comparison (NEW!):** |
| 190 | + ```bash |
| 191 | + # Re-run the scalability analysis focusing on consensus algorithms |
| 192 | + python universal_scalability_law_improved.py |
| 193 | + # Focus on the "Consensus Algorithm Performance Comparison" graphs |
| 194 | + ``` |
| 195 | + **Discussion Points:** |
| 196 | + - **RAFT vs Multi-Paxos**: Which scales better and why? |
| 197 | + - **Optimal cluster sizes**: 3, 5, 7, or more nodes? |
| 198 | + - **Byzantine Fault Tolerance**: Performance cost analysis |
| 199 | + - **Production trade-offs**: Performance vs complexity vs reliability |
| 200 | + |
| 201 | + **Key Insights:** |
| 202 | + - RAFT typically has lower coordination overhead than basic Paxos |
| 203 | + - Multi-Paxos (optimized) can outperform RAFT in some scenarios |
| 204 | + - Byzantine protocols have significant performance penalties |
| 205 | + - Optimal cluster size is algorithm-dependent |
| 206 | +- **💡 Connection:** "Now you have quantitative data to choose algorithms, not just theoretical knowledge!" |
89 | 207 | - **📁 Reference:** `src/main/java/replicate/raft/` and `src/main/java/replicate/multipaxos/` |
90 | 208 | - **End of Day 2** |
91 | 209 |
|
92 | 210 | --- |
93 | 211 |
|
94 | 212 | ## 📊 Workshop Summary |
95 | 213 |
|
96 | | -### 🎯 **Learning Outcomes** |
97 | | -- **8 teaching blocks** × 40 minutes each |
98 | | -- **Hands-on labs** tied directly to core lecture concepts |
99 | | -- **Built-in breaks** for focus & recovery |
100 | | -- **Progressive assignments** that reinforce distributed systems primitives step-by-step |
| 214 | +### 🎯 **Enhanced Learning Outcomes** |
| 215 | +- **9 teaching blocks** with optimized timing (5 sessions Day 1, 4 sessions Day 2) |
| 216 | +- **Pattern-driven learning** progression from motivation to implementation |
| 217 | +- **Combined foundational concepts** for efficient learning progression |
| 218 | +- **Core patterns foundation** before advanced algorithms |
| 219 | +- **Quantitative analysis** integrated with hands-on labs |
| 220 | +- **Visual performance data** reinforcing theoretical concepts |
| 221 | +- **Data-driven decision making** for distributed system design |
101 | 222 |
|
102 | 223 | ### 🛠️ **Technical Skills Gained** |
103 | 224 | - Understanding distributed systems fundamentals |
| 225 | +- **NEW:** Performance modeling and capacity planning |
| 226 | +- **NEW:** Failure probability analysis for reliability planning |
| 227 | +- **NEW:** Scalability analysis using Universal Scalability Law |
104 | 228 | - Implementing Write-Ahead Log pattern |
105 | 229 | - Working with quorum-based replication |
106 | 230 | - Exploring consensus algorithms (Paxos, RAFT) |
107 | 231 | - Hands-on experience with fault tolerance patterns |
108 | 232 |
|
| 233 | +### 📊 **Performance Analysis Tools** |
| 234 | +- **Queuing Theory Analysis**: System performance limits and Little's Law |
| 235 | +- **Failure Probability Calculator**: Risk assessment for cluster sizing |
| 236 | +- **Universal Scalability Law**: Performance scaling analysis |
| 237 | +- **Realistic Performance Modeling**: System degradation under stress |
| 238 | +- **Consensus Algorithm Comparison**: Quantitative algorithm selection |
| 239 | + |
109 | 240 | ### 🗂️ **Available Implementations** |
110 | 241 | - **Consensus Algorithms:** Paxos, Multi-Paxos, RAFT, ViewStamped Replication |
111 | 242 | - **Replication Patterns:** Chain Replication, Quorum-based KV Store |
112 | 243 | - **Foundational Patterns:** WAL, Two-Phase Commit, Heartbeat Detection |
113 | 244 | - **Network Layer:** Socket-based messaging, Request-waiting lists |
| 245 | +- **NEW:** **Performance Analysis Scripts:** Python-based modeling tools |
114 | 246 |
|
115 | 247 | ### 📁 **Key Files Reference** |
116 | 248 | - **Core Framework:** `src/main/java/replicate/common/` |
|
120 | 252 | - **Paxos Implementation:** `src/main/java/replicate/paxos/` |
121 | 253 | - **RAFT Implementation:** `src/main/java/replicate/raft/` |
122 | 254 | - **Tests Directory:** `src/test/java/replicate/` |
| 255 | +- **NEW:** **Performance Scripts:** `src/main/python/` |
| 256 | + - `queuing_theory.py` - Little's Law and performance analysis |
| 257 | + - `failure_probability.py` - Cluster reliability analysis |
| 258 | + - `universal_scalability_law_improved.py` - Scaling and algorithm comparison |
| 259 | + - `realistic_system_performance.py` - Real-world performance modeling |
123 | 260 |
|
124 | 261 | ### 📚 **Resources & Next Steps** |
125 | 262 | - All code examples and labs available on GitHub |
| 263 | +- **NEW:** Take-home performance analysis scripts for production use |
126 | 264 | - Additional reading materials provided |
127 | 265 | - Follow-up Q&A session for complex topics |
| 266 | +- **NEW:** Quantitative foundation for architecture decisions |
| 267 | + |
| 268 | +### 💡 **Workshop Enhancement Benefits** |
| 269 | +- **Visual Learning**: Graphs and charts reinforce abstract concepts |
| 270 | +- **Quantitative Understanding**: Real numbers behind theoretical concepts |
| 271 | +- **Practical Tools**: Scripts participants can use in production |
| 272 | +- **Data-Driven Decisions**: Choose algorithms based on performance data |
| 273 | +- **Business Impact**: Connect technical decisions to business outcomes |
0 commit comments