Skip to content

Commit 47b8af2

Browse files
committed
Add design doc for SEA implementation
1 parent 56b8a73 commit 47b8af2

File tree

6 files changed

+5175
-0
lines changed

6 files changed

+5175
-0
lines changed

.claude/commands/PECO.Design.md

Lines changed: 326 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,326 @@
1+
---
2+
description: You are a senior developer and in this session, you need create a detailed design for a feature.
3+
---
4+
5+
### User Input
6+
```text
7+
$ARGUMENTS
8+
```
9+
10+
You **MUST** consider the user input before proceeding. If empty, ask the user for what to work on.
11+
12+
# Collect requirement from user
13+
14+
always collect enough information from user, this might be one or more of the following
15+
16+
## an existing design doc
17+
in this case, we are working an existing doc either addressing review comments or change some of the designs.
18+
19+
## an PM requirement document
20+
you will need review the doc and reconcile with existing design doc if there is one to make sure everything is in sync, if not, working on the change of design
21+
22+
## conversation with user
23+
always good to ask a lot of clarification questions and probe users on overall requirement and corner cases
24+
25+
26+
27+
# Start the design by following below best practises
28+
29+
## Overview
30+
This guide outlines best practices for writing technical design documents, based on lessons learned from code reviews.
31+
32+
---
33+
## 1. Use Visual Diagrams Over Text
34+
35+
### ✅ DO:
36+
- Use **mermaid diagrams** for all architectural illustrations
37+
- Include **class diagrams** to show relationships between components
38+
- Use **sequence diagrams** to illustrate data flow and interactions
39+
- Render diagrams inline in markdown using mermaid code blocks
40+
41+
### ❌ DON'T:
42+
- Use ASCII art for diagrams
43+
- Describe flows in long text paragraphs
44+
- Include large blocks of code to explain architecture
45+
46+
### Example:
47+
```markdown
48+
## Architecture
49+
```mermaid
50+
classDiagram
51+
class TelemetryCollector {
52+
+Record(event: TelemetryEvent)
53+
+Flush()
54+
}
55+
class TelemetryExporter {
56+
+Export(events: List~Event~)
57+
}
58+
TelemetryCollector --> TelemetryExporter
59+
```
60+
```
61+
62+
---
63+
64+
## 2. Focus on Interfaces and Contracts
65+
66+
### ✅ DO:
67+
- Document **public APIs** and **interfaces**
68+
- Show **contracts between components**
69+
- Specify **input/output** parameters
70+
- Define **error handling contracts**
71+
- Document **async/await patterns** where applicable
72+
73+
### ❌ DON'T:
74+
- Include detailed implementation code
75+
- Show private method implementations
76+
- Include complete class implementations
77+
78+
### Example:
79+
```markdown
80+
## ITelemetryCollector Interface
81+
82+
```csharp
83+
public interface ITelemetryCollector
84+
{
85+
// Records a telemetry event asynchronously
86+
Task RecordAsync(TelemetryEvent event, CancellationToken ct);
87+
88+
// Flushes pending events
89+
Task FlushAsync(CancellationToken ct);
90+
}
91+
```
92+
93+
**Contract:**
94+
- RecordAsync: Must be non-blocking, returns immediately
95+
- FlushAsync: Waits for all pending events to export
96+
- Both methods must never throw exceptions to caller
97+
```
98+
99+
---
100+
101+
## 3. Remove Implementation Details
102+
103+
### ✅ DO:
104+
- Focus on **what** the system does
105+
- Explain **why** design decisions were made
106+
- Document **integration points**
107+
- Describe **configuration options**
108+
109+
### ❌ DON'T:
110+
- Include internal implementation details
111+
- Show vendor-specific backend implementations
112+
- Document internal database schemas (unless part of public contract)
113+
- Include proprietary or confidential information
114+
- **Include customer names, customer data, or any personally identifiable information (PII)**
115+
116+
---
117+
118+
## 4. Simplify Code Examples
119+
120+
### ✅ DO:
121+
- Use **minimal code snippets** to illustrate concepts
122+
- Show only **signature changes** to existing APIs
123+
- Replace code with **diagrams** where possible
124+
- Use **pseudocode** for complex flows
125+
126+
### ❌ DON'T:
127+
- Include complete class implementations
128+
- Show detailed algorithm implementations
129+
- Copy-paste large code blocks
130+
131+
### Example:
132+
```markdown
133+
## DatabricksConnection Changes
134+
135+
**Modified Methods:**
136+
```csharp
137+
// Add telemetry initialization
138+
public override async Task OpenAsync(CancellationToken ct)
139+
{
140+
// ... existing code ...
141+
await InitializeTelemetryAsync(ct); // NEW
142+
}
143+
```
144+
145+
**New Fields:**
146+
- `_telemetryCollector`: Optional collector instance
147+
- `_telemetryConfig`: Configuration from connection string
148+
```
149+
150+
---
151+
152+
## 5. Simplify Test Sections
153+
154+
### ✅ DO:
155+
- List **test case names** with brief descriptions
156+
- Group tests by **category** (unit, integration, performance)
157+
- Document **test strategy** and coverage goals
158+
- Include **edge cases** to be tested
159+
160+
### ❌ DON'T:
161+
- Include complete test code implementations
162+
- Show detailed assertion logic
163+
- Copy test method bodies
164+
165+
### Example:
166+
```markdown
167+
## Test Strategy
168+
169+
### Unit Tests
170+
- `TelemetryCollector_RecordEvent_AddsToQueue`
171+
- `TelemetryCollector_Flush_ExportsAllEvents`
172+
- `CircuitBreaker_OpensAfter_ConsecutiveFailures`
173+
174+
### Integration Tests
175+
- `Telemetry_EndToEnd_ConnectionToExport`
176+
- `Telemetry_WithFeatureFlag_RespectsServerSide`
177+
```
178+
179+
---
180+
181+
## 6. Consider Existing Infrastructure
182+
183+
### ✅ DO:
184+
- **Research existing solutions** before designing new ones
185+
- Document how your design **integrates with existing systems**
186+
- Explain why existing solutions are **insufficient** (if creating new)
187+
- **Reuse components** where possible
188+
189+
### ❌ DON'T:
190+
- Reinvent the wheel without justification
191+
- Ignore existing patterns in the codebase
192+
- Create parallel systems without explaining why
193+
194+
### Example:
195+
```markdown
196+
## Alternatives Considered
197+
198+
### Option 1: Extend Existing ActivityTrace Framework (PR #3315)
199+
**Pros:** Reuses existing infrastructure, familiar patterns
200+
**Cons:** ActivityTrace is designed for tracing, not metrics aggregation
201+
202+
### Option 2: New Telemetry System (Chosen)
203+
**Rationale:** Requires aggregation across statements, batching, and different export format than traces
204+
```
205+
206+
---
207+
208+
## 7. Address Concurrency and Async Patterns
209+
210+
### ✅ DO:
211+
- Clearly mark **async operations** in interfaces
212+
- Document **thread-safety** guarantees
213+
- Explain **blocking vs non-blocking** operations
214+
- Show **cancellation token** usage
215+
216+
### ❌ DON'T:
217+
- Mix sync and async without explanation
218+
- Leave thread-safety unspecified
219+
- Ignore backpressure and resource exhaustion scenarios
220+
221+
### Example:
222+
```markdown
223+
## Concurrency Model
224+
225+
### Thread Safety
226+
- `TelemetryCollector.RecordAsync()`: Thread-safe, non-blocking
227+
- `TelemetryExporter.ExportAsync()`: Called from background thread only
228+
229+
### Async Operations
230+
All telemetry operations are async to avoid blocking driver operations:
231+
```mermaid
232+
sequenceDiagram
233+
Driver->>+Collector: RecordAsync(event)
234+
Collector->>Queue: Enqueue(event)
235+
Collector-->>-Driver: Task completed (non-blocking)
236+
Collector->>+Exporter: ExportAsync(batch)
237+
Exporter-->>-Collector: Task completed
238+
```
239+
```
240+
241+
---
242+
243+
## 8. Document Edge Cases and Failure Modes
244+
245+
### ✅ DO:
246+
- Explain what happens during **failures**
247+
- Document **circuit breaker** or retry logic
248+
- Address **data loss** scenarios
249+
- Show how **duplicate events** are handled
250+
251+
### ❌ DON'T:
252+
- Only show happy path
253+
- Ignore error scenarios
254+
- Leave failure behavior undefined
255+
256+
### Example:
257+
```markdown
258+
## Error Handling
259+
260+
### Circuit Breaker Behavior
261+
When export fails 5 consecutive times:
262+
1. Circuit opens, drops new events (avoids memory exhaustion)
263+
2. Sends circuit breaker event to server
264+
3. Attempts recovery after 60s
265+
266+
### Duplicate Handling
267+
If same statement reported multiple times:
268+
- Backend merges by `statement_id`
269+
- Uses latest timestamp for each metric type
270+
```
271+
272+
---
273+
274+
## 9. Include Configuration Options
275+
276+
### ✅ DO:
277+
- Document **all configuration parameters**
278+
- Show **default values** and acceptable ranges
279+
- Explain **opt-out mechanisms**
280+
- Document **feature flags** and server-side controls
281+
282+
### Example:
283+
```markdown
284+
## Configuration
285+
286+
| Parameter | Type | Default | Description |
287+
|-----------|------|---------|-------------|
288+
| `telemetry.enabled` | bool | `true` | Enable/disable telemetry |
289+
| `telemetry.batch_size` | int | `100` | Events per batch (1-1000) |
290+
| `telemetry.flush_interval_ms` | int | `5000` | Flush interval (1000-30000) |
291+
292+
**Feature Flag:** `spark.databricks.adbc.telemetry.enabled` (server-side)
293+
```
294+
295+
---
296+
297+
## 10. Keep Sections Focused
298+
299+
### ✅ DO:
300+
- Include only **necessary sections**
301+
- Each section should answer **specific questions**
302+
- Remove sections that don't add value
303+
304+
### ❌ DON'T:
305+
- Include boilerplate sections "just because"
306+
- Add sections that duplicate information
307+
- Keep sections that reviewers flag as unnecessary
308+
309+
---
310+
311+
## Summary Checklist
312+
313+
Before submitting your design doc:
314+
315+
- [ ] All diagrams are in **mermaid format**
316+
- [ ] Focus is on **interfaces, not implementations**
317+
- [ ] **Internal details** removed
318+
- [ ] Code examples are **minimal and relevant**
319+
- [ ] Test sections show **case names, not code**
320+
- [ ] **Existing infrastructure** considered and discussed
321+
- [ ] **Async/thread-safety** clearly documented
322+
- [ ] **Edge cases and failures** addressed
323+
- [ ] **Configuration options** fully documented
324+
- [ ] All sections are **necessary and focused**
325+
- [ ] Design explains **why**, not just **what**
326+

0 commit comments

Comments
 (0)