Skip to content

Commit 38cec15

Browse files
author
Unmesh Joshi
committed
Fixed agenda
1 parent 2bc02fc commit 38cec15

File tree

1 file changed

+41
-142
lines changed

1 file changed

+41
-142
lines changed

agenda.md

Lines changed: 41 additions & 142 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
# 2-Day Distributed Systems Workshop Agenda
22

3-
> **Workshop Format**: Each teaching block is 45 minutes (≈ 25 min explanation + 10 min Python analysis + 10 min guided coding)
3+
> **Workshop Format**: Each teaching block is 45 minutes
44
> **Breaks**: 10–15 minutes between sessions
5-
> **Daily Structure**: Day 1: ~3.75 hours teaching (5 sessions), Day 2: ~3 hours teaching (4 sessions) + ~45 minutes of breaks per day
5+
> **Daily Structure**: Day 1: ~3.75 hours teaching (5 sessions), Day 2: ~3 hours teaching (4 sessions)
66
77
---
88

9-
## 🛠️ **Workshop Setup** (5 minutes)
9+
## **Workshop Setup** (5 minutes)
1010
**Python Performance Analysis Tools Setup:**
1111
```bash
1212
cd src/main/python
@@ -20,15 +20,15 @@ python queuing_theory.py
2020

2121
---
2222

23-
## 📅 Day 1: Foundations & Basic Patterns
23+
## Day 1: Foundations & Basic Patterns
2424

25-
### **Session 1** (45 min) 🎯 **Why Distribute?**
25+
### **Session 1** (45 min) **Why Distribute?**
2626
- **Learning Goals:**
2727
- Resource ceilings and physical limits
2828
- Little's Law and performance modeling
2929
- Motivation for distributed patterns
30-
- **🛠️ Hands-on Lab:** Run provided disk-perf test; capture own numbers
31-
- **📊 Performance Analysis (NEW!):**
30+
- **Hands-on Lab:** Run provided disk-perf test; capture own numbers
31+
- **Performance Analysis:**
3232
```bash
3333
# Demonstrate system performance limits with queuing theory
3434
cd src/main/python
@@ -45,71 +45,65 @@ python queuing_theory.py
4545
- At 90% load: 100ms latency (manageable)
4646
- At 99% load: 1000ms latency (problematic)
4747
- Beyond 100%: System collapse
48-
- **💡 Connection:** "This is WHY we need distributed systems - single machines hit performance walls!"
48+
- **Connection:** "This is WHY we need distributed systems - single machines hit performance walls!"
4949
- **Break:** 10 minutes
5050

51-
### **Session 2** (45 min) 🎯 **Why Patterns? & Partial Failure Mindset**
51+
### **Session 2** (45 min) **Why Patterns? & Partial Failure Mindset**
5252
- **Learning Goals:**
53-
- Understanding the need for distributed patterns
53+
- Understanding the need for patterns
5454
- Pattern-based thinking in distributed systems
5555
- Probability of failure at scale and network partitions
5656
- Process pauses and their impact
57-
- **🛠️ Hands-on Lab:**
57+
- **Hands-on Lab:**
5858
- Overview of patterns available in the framework
5959
- Walkthrough of the 'replicate' framework with fault injection
60-
- **📊 Failure Probability Analysis (NEW!):**
60+
- **Failure Probability Analysis:**
6161
```bash
6262
# Calculate realistic failure probabilities
6363
python failure_probability.py
64-
65-
# Example scenarios to try:
66-
# Scenario 1: 3 nodes, 2 failures, 0.1 failure rate → ~2.7% chance of losing majority
67-
# Scenario 2: 5 nodes, 3 failures, 0.05 failure rate → ~0.13% chance of losing majority
68-
# Scenario 3: Large cluster - 100 nodes, 30 failures, 0.05 failure rate
6964
```
7065
**Key Insights:**
71-
- Even with 5% individual failure rate, losing quorum is significant risk
72-
- Larger clusters provide better fault tolerance
66+
- Script calculates "N or more failures" probability using binomial distribution
7367
- Patterns help us handle these inevitable failures systematically
74-
- **💡 Connection:** "Patterns solve recurring problems - especially failure handling!"
75-
- **📁 Reference:** `src/main/java/replicate/common/` and `src/test/java/replicate/common/`
68+
- **Connection:** "Patterns solve recurring problems - especially failure handling!"
69+
- **Reference:** `src/main/java/replicate/common/` and `src/test/java/replicate/common/`
7670
- **Break:** 10 minutes
7771

78-
### **Session 3** (45 min) 🎯 **Write-Ahead Log Pattern**
72+
### **Session 3** (45 min) **Write-Ahead Log Pattern**
7973
- **Learning Goals:**
8074
- Append-only discipline for durability
8175
- Recovery mechanisms and replay
8276
- WAL as foundation for other patterns
83-
- **🛠️ Hands-on Lab:** Execute and walkthrough `DurableKVStoreTest` for persistent key-value store
84-
- **💡 Connection:** "WAL ensures we can recover from the failures we just discussed!"
85-
- **📁 Reference:** `src/test/java/replicate/wal/DurableKVStoreTest.java`
77+
- **Hands-on Lab:** Execute and walkthrough `DurableKVStoreTest` for persistent key-value store
78+
- **Connection:** "WAL ensures we can recover from the failures we just discussed!"
79+
- **Reference:** `src/test/java/replicate/wal/DurableKVStoreTest.java`
8680
- **Break:** 15 minutes
8781

88-
### **Session 4** (45 min) 🎯 **Core Communication Patterns**
82+
### **Session 4** (45 min) **Code Walkthrough and Core Patterns**
8983
- **Learning Goals:**
9084
- Request-waiting list pattern for async operations
9185
- Singular update queue for thread safety
9286
- Network messaging foundations
9387
- Building blocks for distributed protocols
94-
- **🛠️ Hands-on Lab:**
88+
- **Hands-on Lab:**
9589
- Code walkthrough: `RequestWaitingList` and `SingularUpdateQueue` implementations
9690
- Understand how async requests are tracked and managed
9791
- See how single-threaded execution prevents race conditions
98-
- **📁 Reference:**
92+
- **Reference:**
9993
- `src/main/java/replicate/common/RequestWaitingList.java`
10094
- `src/main/java/replicate/common/SingularUpdateQueue.java`
10195
- `src/main/java/replicate/net/` - Network communication layer
102-
- **💡 Connection:** "These patterns are the foundation for quorum-based systems and consensus algorithms!"
96+
- **Connection:** "These patterns are the foundation for quorum-based systems and consensus algorithms!"
10397
- **Break:** 10 minutes
10498

105-
### **Session 5** (45 min) 🎯 **Replication & Majority Quorum**
99+
### **Session 5** (45 min) **Replication & Majority Quorum**
106100
- **Learning Goals:**
107101
- Write vs read quorums trade-offs
108102
- Quorum intersection properties
109103
- Universal Scalability Law curve analysis
110-
- **🛠️ Hands-on Lab:** Modify `QuorumKVStoreTest`: pass for 5-node/3-node clusters
104+
- **Hands-on Lab:** Modify `QuorumKVStoreTest`: pass for 5-node/3-node clusters
111105
- **Prerequisite:** Understanding of `RequestWaitingList` from Session 4 (used in quorum coordination)
112-
- **📊 Scalability Analysis (NEW!):**
106+
- **Scalability Analysis:**
113107
```bash
114108
# Analyze how performance scales with cluster size
115109
python universal_scalability_law_improved.py
@@ -124,150 +118,55 @@ python queuing_theory.py
124118
- Coordination overhead increases with cluster size
125119
- Optimal cluster sizes depend on algorithm choice
126120
- Well-designed systems scale better than legacy systems
127-
- **💡 Connection:** "This shows the trade-offs in quorum-based replication!"
128-
- **📁 Reference:** `src/test/java/replicate/quorum/QuorumKVStoreTest.java`
121+
- **Connection:** "This shows the trade-offs in quorum-based replication!"
122+
- **Reference:** `src/test/java/replicate/quorum/QuorumKVStoreTest.java`
129123
- **End of Day 1**
130124

131-
### 🍽️ **Lunch Break / Self-Paced Time**
132-
**Offline Activities:**
133-
- Review morning labs and concepts
134-
- Push completed work to GitHub
135-
- Optional: Explore additional resources
136-
- **NEW:** Experiment with different parameters in Python scripts
137-
138125
---
139126

140-
## 📅 Day 2: Consensus Algorithms & Advanced Patterns
127+
## Day 2: Consensus Algorithms & Advanced Patterns
141128

142-
### **Session 6** (45 min) 🎯 **Why Simple Replication Fails**
129+
### **Session 6** (45 min) **Why Simple Replication Fails**
143130
- **Learning Goals:**
144131
- Two-phase commit pitfalls
145132
- Recovery ambiguity problems
146133
- The need for consensus algorithms
147-
- **🛠️ Hands-on Lab:** Step through `DeferredCommitmentTest` and `RecoverableDeferredCommitmentTest`; explain why they hang
148-
- **📊 Realistic System Behavior Analysis (NEW!):**
134+
- **Hands-on Lab:** Step through `DeferredCommitmentTest` and `RecoverableDeferredCommitmentTest`;
135+
- **Realistic System Behavior Analysis:**
149136
```bash
150137
# Show how systems degrade under stress (unlike theoretical models)
151138
python realistic_system_performance.py
152139
```
153-
**Key Visualizations:**
154-
1. **Realistic Performance Under Load** - Shows system degradation beyond theoretical limits
155-
2. **Ideal vs Realistic Comparison** - Why real systems perform worse than theory
156-
157-
**Key Insights:**
158-
- Systems don't just hit limits - they degrade badly under stress
159-
- Performance collapse happens before theoretical limits
160-
- Real systems exhibit much worse behavior than M/M/1 queue models
161-
- **💡 Connection:** "This is exactly why 2PC fails under load - systems don't gracefully degrade!"
162-
- **📁 Reference:** `src/test/java/replicate/twophaseexecution/DeferredCommitmentTest.java`
163-
- **Break:** 10 minutes
164140

165-
### **Session 7** (45 min) 🎯 **Single-Value Paxos**
141+
### **Session 7** (45 min) **Single-Value Paxos**
166142
- **Learning Goals:**
167143
- Prepare/Accept phases explained
168144
- Recovery with generation numbers
169145
- Safety and liveness properties
170-
- **🛠️ Hands-on Lab:** Work with generation voting mechanism using existing Paxos tests
171-
- **📁 Reference:** `src/test/java/replicate/paxos/` and `src/test/java/replicate/generationvoting/`
146+
- **Hands-on Lab:** Work with generation voting mechanism using existing Paxos tests
147+
- **Reference:** `src/test/java/replicate/paxos/` and `src/test/java/replicate/generationvoting/`
172148
- **Break:** 10 minutes
173149

174-
### **Session 8** (45 min) 🎯 **From Paxos to Multi-Paxos**
150+
### **Session 8** (45 min) **From Paxos to Multi-Paxos**
175151
- **Learning Goals:**
176152
- Replicated log concept and implementation
177153
- High-water mark for safe execution
178154
- Heartbeats and failure detection
179-
- **🛠️ Hands-on Lab:** Extend log to multi-slot using Multi-Paxos and Paxos Log implementations
180-
- **📁 Reference:** `src/test/java/replicate/multipaxos/` and `src/test/java/replicate/paxoslog/`
155+
- **Hands-on Lab:** Extend log to multi-slot using Multi-Paxos and Paxos Log implementations
156+
- **Reference:** `src/test/java/replicate/multipaxos/` and `src/test/java/replicate/paxoslog/`
181157
- **Break:** 15 minutes
182158

183-
### **Session 9** (45 min) 🎯 **RAFT vs Multi-Paxos in Practice**
159+
### **Session 9** (45 min) **RAFT vs Multi-Paxos in Practice**
184160
- **Learning Goals:**
185161
- Implementation optimizations comparison
186162
- Idempotent receiver pattern
187163
- Production considerations and future directions
188-
- **🛠️ Hands-on Lab:** Compare RAFT & Multi-Paxos implementations; annotate pros/cons
189-
- **📊 Consensus Algorithm Performance Comparison (NEW!):**
190-
```bash
191-
# Re-run the scalability analysis focusing on consensus algorithms
192-
python universal_scalability_law_improved.py
193-
# Focus on the "Consensus Algorithm Performance Comparison" graphs
164+
- **Hands-on Lab:** Compare RAFT & Multi-Paxos implementations; annotate pros/cons
194165
```
195166
**Discussion Points:**
196167
- **RAFT vs Multi-Paxos**: Which scales better and why?
197168
- **Optimal cluster sizes**: 3, 5, 7, or more nodes?
198-
- **Byzantine Fault Tolerance**: Performance cost analysis
199169
- **Production trade-offs**: Performance vs complexity vs reliability
200-
201-
**Key Insights:**
202-
- RAFT typically has lower coordination overhead than basic Paxos
203-
- Multi-Paxos (optimized) can outperform RAFT in some scenarios
204-
- Byzantine protocols have significant performance penalties
205-
- Optimal cluster size is algorithm-dependent
206-
- **💡 Connection:** "Now you have quantitative data to choose algorithms, not just theoretical knowledge!"
207-
- **📁 Reference:** `src/main/java/replicate/raft/` and `src/main/java/replicate/multipaxos/`
208-
- **End of Day 2**
209170
210-
---
211-
212-
## 📊 Workshop Summary
213-
214-
### 🎯 **Enhanced Learning Outcomes**
215-
- **9 teaching blocks** with optimized timing (5 sessions Day 1, 4 sessions Day 2)
216-
- **Pattern-driven learning** progression from motivation to implementation
217-
- **Combined foundational concepts** for efficient learning progression
218-
- **Core patterns foundation** before advanced algorithms
219-
- **Quantitative analysis** integrated with hands-on labs
220-
- **Visual performance data** reinforcing theoretical concepts
221-
- **Data-driven decision making** for distributed system design
222-
223-
### 🛠️ **Technical Skills Gained**
224-
- Understanding distributed systems fundamentals
225-
- **NEW:** Performance modeling and capacity planning
226-
- **NEW:** Failure probability analysis for reliability planning
227-
- **NEW:** Scalability analysis using Universal Scalability Law
228-
- Implementing Write-Ahead Log pattern
229-
- Working with quorum-based replication
230-
- Exploring consensus algorithms (Paxos, RAFT)
231-
- Hands-on experience with fault tolerance patterns
232-
233-
### 📊 **Performance Analysis Tools**
234-
- **Queuing Theory Analysis**: System performance limits and Little's Law
235-
- **Failure Probability Calculator**: Risk assessment for cluster sizing
236-
- **Universal Scalability Law**: Performance scaling analysis
237-
- **Realistic Performance Modeling**: System degradation under stress
238-
- **Consensus Algorithm Comparison**: Quantitative algorithm selection
239-
240-
### 🗂️ **Available Implementations**
241-
- **Consensus Algorithms:** Paxos, Multi-Paxos, RAFT, ViewStamped Replication
242-
- **Replication Patterns:** Chain Replication, Quorum-based KV Store
243-
- **Foundational Patterns:** WAL, Two-Phase Commit, Heartbeat Detection
244-
- **Network Layer:** Socket-based messaging, Request-waiting lists
245-
- **NEW:** **Performance Analysis Scripts:** Python-based modeling tools
246-
247-
### 📁 **Key Files Reference**
248-
- **Core Framework:** `src/main/java/replicate/common/`
249-
- **WAL Implementation:** `src/main/java/replicate/wal/DurableKVStore.java`
250-
- **Quorum KV Store:** `src/main/java/replicate/quorum/QuorumKVStore.java`
251-
- **Chain Replication:** `src/main/java/replicate/chain/ChainReplication.java`
252-
- **Paxos Implementation:** `src/main/java/replicate/paxos/`
253-
- **RAFT Implementation:** `src/main/java/replicate/raft/`
254-
- **Tests Directory:** `src/test/java/replicate/`
255-
- **NEW:** **Performance Scripts:** `src/main/python/`
256-
- `queuing_theory.py` - Little's Law and performance analysis
257-
- `failure_probability.py` - Cluster reliability analysis
258-
- `universal_scalability_law_improved.py` - Scaling and algorithm comparison
259-
- `realistic_system_performance.py` - Real-world performance modeling
260-
261-
### 📚 **Resources & Next Steps**
262-
- All code examples and labs available on GitHub
263-
- **NEW:** Take-home performance analysis scripts for production use
264-
- Additional reading materials provided
265-
- Follow-up Q&A session for complex topics
266-
- **NEW:** Quantitative foundation for architecture decisions
171+
- **End of Day 2**
267172
268-
### 💡 **Workshop Enhancement Benefits**
269-
- **Visual Learning**: Graphs and charts reinforce abstract concepts
270-
- **Quantitative Understanding**: Real numbers behind theoretical concepts
271-
- **Practical Tools**: Scripts participants can use in production
272-
- **Data-Driven Decisions**: Choose algorithms based on performance data
273-
- **Business Impact**: Connect technical decisions to business outcomes

0 commit comments

Comments
 (0)