|
| 1 | +--- |
| 2 | +description: You are a senior developer and in this session, you need create a detailed design for a feature. |
| 3 | +--- |
| 4 | + |
| 5 | +### User Input |
| 6 | +```text |
| 7 | +$ARGUMENTS |
| 8 | +``` |
| 9 | + |
| 10 | +You **MUST** consider the user input before proceeding. If empty, ask the user for what to work on. |
| 11 | + |
| 12 | +# Collect requirement from user |
| 13 | + |
| 14 | +always collect enough information from user, this might be one or more of the following |
| 15 | + |
| 16 | +## an existing design doc |
| 17 | +in this case, we are working an existing doc either addressing review comments or change some of the designs. |
| 18 | + |
| 19 | +## an PM requirement document |
| 20 | +you will need review the doc and reconcile with existing design doc if there is one to make sure everything is in sync, if not, working on the change of design |
| 21 | + |
| 22 | +## conversation with user |
| 23 | +always good to ask a lot of clarification questions and probe users on overall requirement and corner cases |
| 24 | + |
| 25 | + |
| 26 | + |
| 27 | +# Start the design by following below best practises |
| 28 | + |
| 29 | +## Overview |
| 30 | +This guide outlines best practices for writing technical design documents, based on lessons learned from code reviews. |
| 31 | + |
| 32 | +--- |
| 33 | +## 1. Use Visual Diagrams Over Text |
| 34 | + |
| 35 | +### ✅ DO: |
| 36 | +- Use **mermaid diagrams** for all architectural illustrations |
| 37 | +- Include **class diagrams** to show relationships between components |
| 38 | +- Use **sequence diagrams** to illustrate data flow and interactions |
| 39 | +- Render diagrams inline in markdown using mermaid code blocks |
| 40 | + |
| 41 | +### ❌ DON'T: |
| 42 | +- Use ASCII art for diagrams |
| 43 | +- Describe flows in long text paragraphs |
| 44 | +- Include large blocks of code to explain architecture |
| 45 | + |
| 46 | +### Example: |
| 47 | +```markdown |
| 48 | +## Architecture |
| 49 | +```mermaid |
| 50 | +classDiagram |
| 51 | + class TelemetryCollector { |
| 52 | + +Record(event: TelemetryEvent) |
| 53 | + +Flush() |
| 54 | + } |
| 55 | + class TelemetryExporter { |
| 56 | + +Export(events: List~Event~) |
| 57 | + } |
| 58 | + TelemetryCollector --> TelemetryExporter |
| 59 | +``` |
| 60 | +``` |
| 61 | +
|
| 62 | +--- |
| 63 | +
|
| 64 | +## 2. Focus on Interfaces and Contracts |
| 65 | +
|
| 66 | +### ✅ DO: |
| 67 | +- Document **public APIs** and **interfaces** |
| 68 | +- Show **contracts between components** |
| 69 | +- Specify **input/output** parameters |
| 70 | +- Define **error handling contracts** |
| 71 | +- Document **async/await patterns** where applicable |
| 72 | +
|
| 73 | +### ❌ DON'T: |
| 74 | +- Include detailed implementation code |
| 75 | +- Show private method implementations |
| 76 | +- Include complete class implementations |
| 77 | +
|
| 78 | +### Example: |
| 79 | +```markdown |
| 80 | +## ITelemetryCollector Interface |
| 81 | +
|
| 82 | +```csharp |
| 83 | +public interface ITelemetryCollector |
| 84 | +{ |
| 85 | + // Records a telemetry event asynchronously |
| 86 | + Task RecordAsync(TelemetryEvent event, CancellationToken ct); |
| 87 | +
|
| 88 | + // Flushes pending events |
| 89 | + Task FlushAsync(CancellationToken ct); |
| 90 | +} |
| 91 | +``` |
| 92 | + |
| 93 | +**Contract:** |
| 94 | +- RecordAsync: Must be non-blocking, returns immediately |
| 95 | +- FlushAsync: Waits for all pending events to export |
| 96 | +- Both methods must never throw exceptions to caller |
| 97 | +``` |
| 98 | +
|
| 99 | +--- |
| 100 | +
|
| 101 | +## 3. Remove Implementation Details |
| 102 | +
|
| 103 | +### ✅ DO: |
| 104 | +- Focus on **what** the system does |
| 105 | +- Explain **why** design decisions were made |
| 106 | +- Document **integration points** |
| 107 | +- Describe **configuration options** |
| 108 | +
|
| 109 | +### ❌ DON'T: |
| 110 | +- Include internal implementation details |
| 111 | +- Show vendor-specific backend implementations |
| 112 | +- Document internal database schemas (unless part of public contract) |
| 113 | +- Include proprietary or confidential information |
| 114 | +- **Include customer names, customer data, or any personally identifiable information (PII)** |
| 115 | +
|
| 116 | +--- |
| 117 | +
|
| 118 | +## 4. Simplify Code Examples |
| 119 | +
|
| 120 | +### ✅ DO: |
| 121 | +- Use **minimal code snippets** to illustrate concepts |
| 122 | +- Show only **signature changes** to existing APIs |
| 123 | +- Replace code with **diagrams** where possible |
| 124 | +- Use **pseudocode** for complex flows |
| 125 | +
|
| 126 | +### ❌ DON'T: |
| 127 | +- Include complete class implementations |
| 128 | +- Show detailed algorithm implementations |
| 129 | +- Copy-paste large code blocks |
| 130 | +
|
| 131 | +### Example: |
| 132 | +```markdown |
| 133 | +## DatabricksConnection Changes |
| 134 | +
|
| 135 | +**Modified Methods:** |
| 136 | +```csharp |
| 137 | +// Add telemetry initialization |
| 138 | +public override async Task OpenAsync(CancellationToken ct) |
| 139 | +{ |
| 140 | + // ... existing code ... |
| 141 | + await InitializeTelemetryAsync(ct); // NEW |
| 142 | +} |
| 143 | +``` |
| 144 | + |
| 145 | +**New Fields:** |
| 146 | +- `_telemetryCollector`: Optional collector instance |
| 147 | +- `_telemetryConfig`: Configuration from connection string |
| 148 | +``` |
| 149 | +
|
| 150 | +--- |
| 151 | +
|
| 152 | +## 5. Simplify Test Sections |
| 153 | +
|
| 154 | +### ✅ DO: |
| 155 | +- List **test case names** with brief descriptions |
| 156 | +- Group tests by **category** (unit, integration, performance) |
| 157 | +- Document **test strategy** and coverage goals |
| 158 | +- Include **edge cases** to be tested |
| 159 | +
|
| 160 | +### ❌ DON'T: |
| 161 | +- Include complete test code implementations |
| 162 | +- Show detailed assertion logic |
| 163 | +- Copy test method bodies |
| 164 | +
|
| 165 | +### Example: |
| 166 | +```markdown |
| 167 | +## Test Strategy |
| 168 | +
|
| 169 | +### Unit Tests |
| 170 | +- `TelemetryCollector_RecordEvent_AddsToQueue` |
| 171 | +- `TelemetryCollector_Flush_ExportsAllEvents` |
| 172 | +- `CircuitBreaker_OpensAfter_ConsecutiveFailures` |
| 173 | +
|
| 174 | +### Integration Tests |
| 175 | +- `Telemetry_EndToEnd_ConnectionToExport` |
| 176 | +- `Telemetry_WithFeatureFlag_RespectsServerSide` |
| 177 | +``` |
| 178 | + |
| 179 | +--- |
| 180 | + |
| 181 | +## 6. Consider Existing Infrastructure |
| 182 | + |
| 183 | +### ✅ DO: |
| 184 | +- **Research existing solutions** before designing new ones |
| 185 | +- Document how your design **integrates with existing systems** |
| 186 | +- Explain why existing solutions are **insufficient** (if creating new) |
| 187 | +- **Reuse components** where possible |
| 188 | + |
| 189 | +### ❌ DON'T: |
| 190 | +- Reinvent the wheel without justification |
| 191 | +- Ignore existing patterns in the codebase |
| 192 | +- Create parallel systems without explaining why |
| 193 | + |
| 194 | +### Example: |
| 195 | +```markdown |
| 196 | +## Alternatives Considered |
| 197 | + |
| 198 | +### Option 1: Extend Existing ActivityTrace Framework (PR #3315) |
| 199 | +**Pros:** Reuses existing infrastructure, familiar patterns |
| 200 | +**Cons:** ActivityTrace is designed for tracing, not metrics aggregation |
| 201 | + |
| 202 | +### Option 2: New Telemetry System (Chosen) |
| 203 | +**Rationale:** Requires aggregation across statements, batching, and different export format than traces |
| 204 | +``` |
| 205 | + |
| 206 | +--- |
| 207 | + |
| 208 | +## 7. Address Concurrency and Async Patterns |
| 209 | + |
| 210 | +### ✅ DO: |
| 211 | +- Clearly mark **async operations** in interfaces |
| 212 | +- Document **thread-safety** guarantees |
| 213 | +- Explain **blocking vs non-blocking** operations |
| 214 | +- Show **cancellation token** usage |
| 215 | + |
| 216 | +### ❌ DON'T: |
| 217 | +- Mix sync and async without explanation |
| 218 | +- Leave thread-safety unspecified |
| 219 | +- Ignore backpressure and resource exhaustion scenarios |
| 220 | + |
| 221 | +### Example: |
| 222 | +```markdown |
| 223 | +## Concurrency Model |
| 224 | + |
| 225 | +### Thread Safety |
| 226 | +- `TelemetryCollector.RecordAsync()`: Thread-safe, non-blocking |
| 227 | +- `TelemetryExporter.ExportAsync()`: Called from background thread only |
| 228 | + |
| 229 | +### Async Operations |
| 230 | +All telemetry operations are async to avoid blocking driver operations: |
| 231 | +```mermaid |
| 232 | +sequenceDiagram |
| 233 | + Driver->>+Collector: RecordAsync(event) |
| 234 | + Collector->>Queue: Enqueue(event) |
| 235 | + Collector-->>-Driver: Task completed (non-blocking) |
| 236 | + Collector->>+Exporter: ExportAsync(batch) |
| 237 | + Exporter-->>-Collector: Task completed |
| 238 | +``` |
| 239 | +``` |
| 240 | +
|
| 241 | +--- |
| 242 | +
|
| 243 | +## 8. Document Edge Cases and Failure Modes |
| 244 | +
|
| 245 | +### ✅ DO: |
| 246 | +- Explain what happens during **failures** |
| 247 | +- Document **circuit breaker** or retry logic |
| 248 | +- Address **data loss** scenarios |
| 249 | +- Show how **duplicate events** are handled |
| 250 | +
|
| 251 | +### ❌ DON'T: |
| 252 | +- Only show happy path |
| 253 | +- Ignore error scenarios |
| 254 | +- Leave failure behavior undefined |
| 255 | +
|
| 256 | +### Example: |
| 257 | +```markdown |
| 258 | +## Error Handling |
| 259 | +
|
| 260 | +### Circuit Breaker Behavior |
| 261 | +When export fails 5 consecutive times: |
| 262 | +1. Circuit opens, drops new events (avoids memory exhaustion) |
| 263 | +2. Sends circuit breaker event to server |
| 264 | +3. Attempts recovery after 60s |
| 265 | +
|
| 266 | +### Duplicate Handling |
| 267 | +If same statement reported multiple times: |
| 268 | +- Backend merges by `statement_id` |
| 269 | +- Uses latest timestamp for each metric type |
| 270 | +``` |
| 271 | + |
| 272 | +--- |
| 273 | + |
| 274 | +## 9. Include Configuration Options |
| 275 | + |
| 276 | +### ✅ DO: |
| 277 | +- Document **all configuration parameters** |
| 278 | +- Show **default values** and acceptable ranges |
| 279 | +- Explain **opt-out mechanisms** |
| 280 | +- Document **feature flags** and server-side controls |
| 281 | + |
| 282 | +### Example: |
| 283 | +```markdown |
| 284 | +## Configuration |
| 285 | + |
| 286 | +| Parameter | Type | Default | Description | |
| 287 | +|-----------|------|---------|-------------| |
| 288 | +| `telemetry.enabled` | bool | `true` | Enable/disable telemetry | |
| 289 | +| `telemetry.batch_size` | int | `100` | Events per batch (1-1000) | |
| 290 | +| `telemetry.flush_interval_ms` | int | `5000` | Flush interval (1000-30000) | |
| 291 | + |
| 292 | +**Feature Flag:** `spark.databricks.adbc.telemetry.enabled` (server-side) |
| 293 | +``` |
| 294 | + |
| 295 | +--- |
| 296 | + |
| 297 | +## 10. Keep Sections Focused |
| 298 | + |
| 299 | +### ✅ DO: |
| 300 | +- Include only **necessary sections** |
| 301 | +- Each section should answer **specific questions** |
| 302 | +- Remove sections that don't add value |
| 303 | + |
| 304 | +### ❌ DON'T: |
| 305 | +- Include boilerplate sections "just because" |
| 306 | +- Add sections that duplicate information |
| 307 | +- Keep sections that reviewers flag as unnecessary |
| 308 | + |
| 309 | +--- |
| 310 | + |
| 311 | +## Summary Checklist |
| 312 | + |
| 313 | +Before submitting your design doc: |
| 314 | + |
| 315 | +- [ ] All diagrams are in **mermaid format** |
| 316 | +- [ ] Focus is on **interfaces, not implementations** |
| 317 | +- [ ] **Internal details** removed |
| 318 | +- [ ] Code examples are **minimal and relevant** |
| 319 | +- [ ] Test sections show **case names, not code** |
| 320 | +- [ ] **Existing infrastructure** considered and discussed |
| 321 | +- [ ] **Async/thread-safety** clearly documented |
| 322 | +- [ ] **Edge cases and failures** addressed |
| 323 | +- [ ] **Configuration options** fully documented |
| 324 | +- [ ] All sections are **necessary and focused** |
| 325 | +- [ ] Design explains **why**, not just **what** |
| 326 | + |
0 commit comments