Skip to content

Commit 2bc02fc

Browse files
author
Unmesh Joshi
committed
Added agenda and python scripts
1 parent 4db4b56 commit 2bc02fc

9 files changed

+1087
-19
lines changed

agenda.md

Lines changed: 165 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,130 @@
11
# 2-Day Distributed Systems Workshop Agenda
22

3-
> **Workshop Format**: Each teaching block is 40 minutes (≈ 30 min explanation + 10 min guided coding)
3+
> **Workshop Format**: Each teaching block is 45 minutes (≈ 25 min explanation + 10 min Python analysis + 10 min guided coding)
44
> **Breaks**: 10–15 minutes between sessions
5-
> **Daily Structure**: ~4 hours teaching + ~45 minutes of breaks per day
5+
> **Daily Structure**: Day 1: ~3.75 hours teaching (5 sessions), Day 2: ~3 hours teaching (4 sessions) + ~45 minutes of breaks per day
6+
7+
---
8+
9+
## 🛠️ **Workshop Setup** (5 minutes)
10+
**Python Performance Analysis Tools Setup:**
11+
```bash
12+
cd src/main/python
13+
python3 -m venv venv
14+
source venv/bin/activate
15+
pip install numpy matplotlib scipy
16+
17+
# Test installation
18+
python queuing_theory.py
19+
```
620

721
---
822

923
## 📅 Day 1: Foundations & Basic Patterns
1024

11-
### **Session 1** (40 min) 🎯 **Why Distribute?**
25+
### **Session 1** (45 min) 🎯 **Why Distribute?**
1226
- **Learning Goals:**
1327
- Resource ceilings and physical limits
1428
- Little's Law and performance modeling
1529
- Motivation for distributed patterns
1630
- **🛠️ Hands-on Lab:** Run provided disk-perf test; capture own numbers
31+
- **📊 Performance Analysis (NEW!):**
32+
```bash
33+
# Demonstrate system performance limits with queuing theory
34+
cd src/main/python
35+
source venv/bin/activate
36+
37+
# Show performance degradation as load increases
38+
python queuing_theory.py
39+
40+
# Visualize the performance curves
41+
python queuing_theory_visualization.py
42+
```
43+
**Key Insights:**
44+
- System performance degrades dramatically near 100% utilization
45+
- At 90% load: 100ms latency (manageable)
46+
- At 99% load: 1000ms latency (problematic)
47+
- Beyond 100%: System collapse
48+
- **💡 Connection:** "This is WHY we need distributed systems - single machines hit performance walls!"
1749
- **Break:** 10 minutes
1850

19-
### **Session 2** (40 min) 🎯 **Partial Failure Mindset**
51+
### **Session 2** (45 min) 🎯 **Why Patterns? & Partial Failure Mindset**
2052
- **Learning Goals:**
21-
- Probability of failure at scale
22-
- Network partitions and split-brain scenarios
53+
- Understanding the need for distributed patterns
54+
- Pattern-based thinking in distributed systems
55+
- Probability of failure at scale and network partitions
2356
- Process pauses and their impact
24-
- **🛠️ Hands-on Lab:** Walkthrough of the 'replicate' framework with an example test to inject faults.
57+
- **🛠️ Hands-on Lab:**
58+
- Overview of patterns available in the framework
59+
- Walkthrough of the 'replicate' framework with fault injection
60+
- **📊 Failure Probability Analysis (NEW!):**
61+
```bash
62+
# Calculate realistic failure probabilities
63+
python failure_probability.py
64+
65+
# Example scenarios to try:
66+
# Scenario 1: 3 nodes, 2 failures, 0.1 failure rate → ~2.7% chance of losing majority
67+
# Scenario 2: 5 nodes, 3 failures, 0.05 failure rate → ~0.13% chance of losing majority
68+
# Scenario 3: Large cluster - 100 nodes, 30 failures, 0.05 failure rate
69+
```
70+
**Key Insights:**
71+
- Even with 5% individual failure rate, losing quorum is significant risk
72+
- Larger clusters provide better fault tolerance
73+
- Patterns help us handle these inevitable failures systematically
74+
- **💡 Connection:** "Patterns solve recurring problems - especially failure handling!"
2575
- **📁 Reference:** `src/main/java/replicate/common/` and `src/test/java/replicate/common/`
2676
- **Break:** 10 minutes
2777

28-
### **Session 3** (40 min) 🎯 **Write-Ahead Log Pattern**
78+
### **Session 3** (45 min) 🎯 **Write-Ahead Log Pattern**
2979
- **Learning Goals:**
3080
- Append-only discipline for durability
3181
- Recovery mechanisms and replay
3282
- WAL as foundation for other patterns
33-
- **🛠️ Hands-on Lab:** Execute and walkthrough `DurableKVStoreTest` for persistent key-value store.
83+
- **🛠️ Hands-on Lab:** Execute and walkthrough `DurableKVStoreTest` for persistent key-value store
84+
- **💡 Connection:** "WAL ensures we can recover from the failures we just discussed!"
3485
- **📁 Reference:** `src/test/java/replicate/wal/DurableKVStoreTest.java`
3586
- **Break:** 15 minutes
3687

37-
### **Session 4** (40 min) 🎯 **Replication & Majority Quorum**
88+
### **Session 4** (45 min) 🎯 **Core Communication Patterns**
89+
- **Learning Goals:**
90+
- Request-waiting list pattern for async operations
91+
- Singular update queue for thread safety
92+
- Network messaging foundations
93+
- Building blocks for distributed protocols
94+
- **🛠️ Hands-on Lab:**
95+
- Code walkthrough: `RequestWaitingList` and `SingularUpdateQueue` implementations
96+
- Understand how async requests are tracked and managed
97+
- See how single-threaded execution prevents race conditions
98+
- **📁 Reference:**
99+
- `src/main/java/replicate/common/RequestWaitingList.java`
100+
- `src/main/java/replicate/common/SingularUpdateQueue.java`
101+
- `src/main/java/replicate/net/` - Network communication layer
102+
- **💡 Connection:** "These patterns are the foundation for quorum-based systems and consensus algorithms!"
103+
- **Break:** 10 minutes
104+
105+
### **Session 5** (45 min) 🎯 **Replication & Majority Quorum**
38106
- **Learning Goals:**
39107
- Write vs read quorums trade-offs
40108
- Quorum intersection properties
41109
- Universal Scalability Law curve analysis
42110
- **🛠️ Hands-on Lab:** Modify `QuorumKVStoreTest`: pass for 5-node/3-node clusters
111+
- **Prerequisite:** Understanding of `RequestWaitingList` from Session 4 (used in quorum coordination)
112+
- **📊 Scalability Analysis (NEW!):**
113+
```bash
114+
# Analyze how performance scales with cluster size
115+
python universal_scalability_law_improved.py
116+
```
117+
**Key Visualizations Generated:**
118+
1. **Distributed System Performance Scaling** - Shows how coordination overhead affects scaling
119+
2. **Business Impact Metrics** - Throughput (req/s) and Response Time (ms) scaling
120+
3. **Consensus Algorithm Comparison** - Performance differences between Paxos, RAFT, etc.
121+
122+
**Key Insights:**
123+
- Adding more nodes doesn't always improve performance
124+
- Coordination overhead increases with cluster size
125+
- Optimal cluster sizes depend on algorithm choice
126+
- Well-designed systems scale better than legacy systems
127+
- **💡 Connection:** "This shows the trade-offs in quorum-based replication!"
43128
- **📁 Reference:** `src/test/java/replicate/quorum/QuorumKVStoreTest.java`
44129
- **End of Day 1**
45130

@@ -48,21 +133,36 @@
48133
- Review morning labs and concepts
49134
- Push completed work to GitHub
50135
- Optional: Explore additional resources
136+
- **NEW:** Experiment with different parameters in Python scripts
51137

52138
---
53139

54140
## 📅 Day 2: Consensus Algorithms & Advanced Patterns
55141

56-
### **Session 5** (40 min) 🎯 **Why Simple Replication Fails**
142+
### **Session 6** (45 min) 🎯 **Why Simple Replication Fails**
57143
- **Learning Goals:**
58144
- Two-phase commit pitfalls
59145
- Recovery ambiguity problems
60146
- The need for consensus algorithms
61147
- **🛠️ Hands-on Lab:** Step through `DeferredCommitmentTest` and `RecoverableDeferredCommitmentTest`; explain why they hang
148+
- **📊 Realistic System Behavior Analysis (NEW!):**
149+
```bash
150+
# Show how systems degrade under stress (unlike theoretical models)
151+
python realistic_system_performance.py
152+
```
153+
**Key Visualizations:**
154+
1. **Realistic Performance Under Load** - Shows system degradation beyond theoretical limits
155+
2. **Ideal vs Realistic Comparison** - Why real systems perform worse than theory
156+
157+
**Key Insights:**
158+
- Systems don't just hit limits - they degrade badly under stress
159+
- Performance collapse happens before theoretical limits
160+
- Real systems exhibit much worse behavior than M/M/1 queue models
161+
- **💡 Connection:** "This is exactly why 2PC fails under load - systems don't gracefully degrade!"
62162
- **📁 Reference:** `src/test/java/replicate/twophaseexecution/DeferredCommitmentTest.java`
63163
- **Break:** 10 minutes
64164

65-
### **Session 6** (40 min) 🎯 **Single-Value Paxos**
165+
### **Session 7** (45 min) 🎯 **Single-Value Paxos**
66166
- **Learning Goals:**
67167
- Prepare/Accept phases explained
68168
- Recovery with generation numbers
@@ -71,7 +171,7 @@
71171
- **📁 Reference:** `src/test/java/replicate/paxos/` and `src/test/java/replicate/generationvoting/`
72172
- **Break:** 10 minutes
73173

74-
### **Session 7** (40 min) 🎯 **From Paxos to Multi-Paxos**
174+
### **Session 8** (45 min) 🎯 **From Paxos to Multi-Paxos**
75175
- **Learning Goals:**
76176
- Replicated log concept and implementation
77177
- High-water mark for safe execution
@@ -80,37 +180,69 @@
80180
- **📁 Reference:** `src/test/java/replicate/multipaxos/` and `src/test/java/replicate/paxoslog/`
81181
- **Break:** 15 minutes
82182

83-
### **Session 8** (40 min) 🎯 **RAFT vs Multi-Paxos in Practice**
183+
### **Session 9** (45 min) 🎯 **RAFT vs Multi-Paxos in Practice**
84184
- **Learning Goals:**
85185
- Implementation optimizations comparison
86186
- Idempotent receiver pattern
87187
- Production considerations and future directions
88188
- **🛠️ Hands-on Lab:** Compare RAFT & Multi-Paxos implementations; annotate pros/cons
189+
- **📊 Consensus Algorithm Performance Comparison (NEW!):**
190+
```bash
191+
# Re-run the scalability analysis focusing on consensus algorithms
192+
python universal_scalability_law_improved.py
193+
# Focus on the "Consensus Algorithm Performance Comparison" graphs
194+
```
195+
**Discussion Points:**
196+
- **RAFT vs Multi-Paxos**: Which scales better and why?
197+
- **Optimal cluster sizes**: 3, 5, 7, or more nodes?
198+
- **Byzantine Fault Tolerance**: Performance cost analysis
199+
- **Production trade-offs**: Performance vs complexity vs reliability
200+
201+
**Key Insights:**
202+
- RAFT typically has lower coordination overhead than basic Paxos
203+
- Multi-Paxos (optimized) can outperform RAFT in some scenarios
204+
- Byzantine protocols have significant performance penalties
205+
- Optimal cluster size is algorithm-dependent
206+
- **💡 Connection:** "Now you have quantitative data to choose algorithms, not just theoretical knowledge!"
89207
- **📁 Reference:** `src/main/java/replicate/raft/` and `src/main/java/replicate/multipaxos/`
90208
- **End of Day 2**
91209

92210
---
93211

94212
## 📊 Workshop Summary
95213

96-
### 🎯 **Learning Outcomes**
97-
- **8 teaching blocks** × 40 minutes each
98-
- **Hands-on labs** tied directly to core lecture concepts
99-
- **Built-in breaks** for focus & recovery
100-
- **Progressive assignments** that reinforce distributed systems primitives step-by-step
214+
### 🎯 **Enhanced Learning Outcomes**
215+
- **9 teaching blocks** with optimized timing (5 sessions Day 1, 4 sessions Day 2)
216+
- **Pattern-driven learning** progression from motivation to implementation
217+
- **Combined foundational concepts** for efficient learning progression
218+
- **Core patterns foundation** before advanced algorithms
219+
- **Quantitative analysis** integrated with hands-on labs
220+
- **Visual performance data** reinforcing theoretical concepts
221+
- **Data-driven decision making** for distributed system design
101222

102223
### 🛠️ **Technical Skills Gained**
103224
- Understanding distributed systems fundamentals
225+
- **NEW:** Performance modeling and capacity planning
226+
- **NEW:** Failure probability analysis for reliability planning
227+
- **NEW:** Scalability analysis using Universal Scalability Law
104228
- Implementing Write-Ahead Log pattern
105229
- Working with quorum-based replication
106230
- Exploring consensus algorithms (Paxos, RAFT)
107231
- Hands-on experience with fault tolerance patterns
108232

233+
### 📊 **Performance Analysis Tools**
234+
- **Queuing Theory Analysis**: System performance limits and Little's Law
235+
- **Failure Probability Calculator**: Risk assessment for cluster sizing
236+
- **Universal Scalability Law**: Performance scaling analysis
237+
- **Realistic Performance Modeling**: System degradation under stress
238+
- **Consensus Algorithm Comparison**: Quantitative algorithm selection
239+
109240
### 🗂️ **Available Implementations**
110241
- **Consensus Algorithms:** Paxos, Multi-Paxos, RAFT, ViewStamped Replication
111242
- **Replication Patterns:** Chain Replication, Quorum-based KV Store
112243
- **Foundational Patterns:** WAL, Two-Phase Commit, Heartbeat Detection
113244
- **Network Layer:** Socket-based messaging, Request-waiting lists
245+
- **NEW:** **Performance Analysis Scripts:** Python-based modeling tools
114246

115247
### 📁 **Key Files Reference**
116248
- **Core Framework:** `src/main/java/replicate/common/`
@@ -120,8 +252,22 @@
120252
- **Paxos Implementation:** `src/main/java/replicate/paxos/`
121253
- **RAFT Implementation:** `src/main/java/replicate/raft/`
122254
- **Tests Directory:** `src/test/java/replicate/`
255+
- **NEW:** **Performance Scripts:** `src/main/python/`
256+
- `queuing_theory.py` - Little's Law and performance analysis
257+
- `failure_probability.py` - Cluster reliability analysis
258+
- `universal_scalability_law_improved.py` - Scaling and algorithm comparison
259+
- `realistic_system_performance.py` - Real-world performance modeling
123260

124261
### 📚 **Resources & Next Steps**
125262
- All code examples and labs available on GitHub
263+
- **NEW:** Take-home performance analysis scripts for production use
126264
- Additional reading materials provided
127265
- Follow-up Q&A session for complex topics
266+
- **NEW:** Quantitative foundation for architecture decisions
267+
268+
### 💡 **Workshop Enhancement Benefits**
269+
- **Visual Learning**: Graphs and charts reinforce abstract concepts
270+
- **Quantitative Understanding**: Real numbers behind theoretical concepts
271+
- **Practical Tools**: Scripts participants can use in production
272+
- **Data-Driven Decisions**: Choose algorithms based on performance data
273+
- **Business Impact**: Connect technical decisions to business outcomes
3.28 KB
Binary file not shown.
4.67 KB
Binary file not shown.
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
from scipy.stats import binom
2+
3+
def calculate_node_failures(total_nodes, num_failures, failure_prob):
4+
"""
5+
Calculate probability of exactly N failures and N or more failures
6+
in a cluster of given size
7+
8+
Args:
9+
total_nodes: Total number of nodes in system
10+
num_failures: Number of failures to calculate probability for
11+
failure_prob: Individual node failure probability (between 0 and 1)
12+
"""
13+
# Probability of exactly num_failures
14+
exact_prob = binom.pmf(num_failures, total_nodes, failure_prob)
15+
16+
# Probability of num_failures or more
17+
cumulative_prob = 1 - binom.cdf(num_failures - 1, total_nodes, failure_prob)
18+
19+
# Calculate '1 in X' chance for easier interpretation
20+
one_in_x = int(1/cumulative_prob) if cumulative_prob > 0 else float('inf')
21+
22+
print(f"\nFailure Analysis:")
23+
print(f"Total Nodes: {total_nodes}")
24+
print(f"Number of Failures: {num_failures}")
25+
print(f"Individual Node Failure Probability: {failure_prob:.1%}")
26+
print("-" * 50)
27+
# Convert to percentage with 8 decimal places for better readability
28+
exact_percentage = exact_prob * 100
29+
cumulative_percentage = cumulative_prob * 100
30+
31+
print(f"Probability of exactly {num_failures} failures: {exact_percentage:.8f}%")
32+
print(f"Probability of {num_failures} or more failures: {cumulative_percentage:.8f}%")
33+
print(f"This is approximately a 1 in {one_in_x:,} chance")
34+
35+
# Example usage
36+
if __name__ == "__main__":
37+
# Get input from user
38+
total_nodes = int(input("Enter total number of nodes: "))
39+
num_failures = int(input("Enter number of failures to calculate: "))
40+
failure_prob = float(input("Enter individual node failure probability (0-1): "))
41+
42+
calculate_node_failures(total_nodes, num_failures, failure_prob)

0 commit comments

Comments
 (0)