|
| 1 | +# OpenTelemetry Guide |
| 2 | + |
| 3 | +OpenTelemetry integration for distributed tracing in items-service. |
| 4 | + |
| 5 | +## What is OpenTelemetry? |
| 6 | + |
| 7 | +**OpenTelemetry (OTel)** creates **traces** - records of requests as they flow through your system. |
| 8 | + |
| 9 | +**Key Concepts:** |
| 10 | + |
| 11 | +- **Trace**: The complete journey of a request through your system |
| 12 | +- **Span**: A single operation within a trace (e.g., HTTP request, database query) |
| 13 | +- **Attributes**: Key-value pairs that provide context (e.g., HTTP method, status code) |
| 14 | +- **Events**: Timestamped logs within a span |
| 15 | +- **Exceptions**: Errors that occurred during a span |
| 16 | +## Use Cases |
| 17 | + |
| 18 | +**Find Performance Bottlenecks:** |
| 19 | +- Identify slow database queries |
| 20 | +- Measure endpoint response times |
| 21 | +- Pinpoint time-consuming operations |
| 22 | + |
| 23 | +**Debug Issues:** |
| 24 | +- See operation sequence leading to errors |
| 25 | +- View exception details and stack traces |
| 26 | +- Understand request context |
| 27 | + |
| 28 | +**Analyze Behavior:** |
| 29 | +- Count database queries per endpoint |
| 30 | +- Track typical response times |
| 31 | +- Identify frequently called operations |
| 32 | + |
| 33 | +## Generate Test Traces |
| 34 | + |
| 35 | +```bash |
| 36 | +# Health check |
| 37 | +curl http://localhost:8081/v1/health |
| 38 | + |
| 39 | +# List items |
| 40 | +curl http://localhost:8081/v1/items |
| 41 | + |
| 42 | +# Create an item |
| 43 | +curl -X POST http://localhost:8081/v1/items \ |
| 44 | + -H "Content-Type: application/json" \ |
| 45 | + -d '{"name":"test-item"}' |
| 46 | +``` |
| 47 | + |
| 48 | +## What's Captured |
| 49 | + |
| 50 | +**HTTP Requests:** |
| 51 | +- Operation: `GET /v1/items`, `POST /v1/items` |
| 52 | +- Attributes: method, URL, status code, host |
| 53 | + |
| 54 | +**Database Operations:** |
| 55 | +- SQL queries and duration |
| 56 | +- Connection details |
| 57 | +- Query parameters (sanitized) |
| 58 | + |
| 59 | +**Custom Attributes:** |
| 60 | +- `items.count`: Number of items returned |
| 61 | +- `item.id`: Item ID |
| 62 | +- `item.name`: Item name |
| 63 | + |
| 64 | +**Errors:** |
| 65 | +- Exception type and message |
| 66 | +- Stack trace |
| 67 | +- Error context |
| 68 | + |
| 69 | +## Common Workflows |
| 70 | + |
| 71 | +**Debug Slow Endpoint:** |
| 72 | +1. Search for endpoint traces |
| 73 | +2. Sort by duration (longest first) |
| 74 | +3. Check timeline for bottlenecks |
| 75 | + |
| 76 | +**Investigate Errors:** |
| 77 | +1. Filter by `error=true` tag |
| 78 | +2. View exception details |
| 79 | +3. Check request context |
| 80 | + |
| 81 | +**Analyze Database Usage:** |
| 82 | +1. Open any trace |
| 83 | +2. Count database spans |
| 84 | +3. Look for N+1 query patterns |
| 85 | + |
| 86 | +## Filtering Traces |
| 87 | + |
| 88 | +**By Duration:** `minDuration=100ms maxDuration=1s` |
| 89 | + |
| 90 | +**By Tags:** `http.status_code=500`, `http.method=POST` |
| 91 | + |
| 92 | +**By Time:** Use time picker for specific periods |
| 93 | + |
| 94 | +## Health Indicators |
| 95 | + |
| 96 | +**Good:** |
| 97 | +- Requests < 100ms |
| 98 | +- Few errors |
| 99 | +- Consistent timing |
| 100 | +- Minimal DB queries |
| 101 | + |
| 102 | +**Warning:** |
| 103 | +- High variance in response times |
| 104 | +- N+1 query problems |
| 105 | +- Frequent errors |
| 106 | + |
| 107 | +**Critical:** |
| 108 | +- Timeouts |
| 109 | +- Cascading failures |
| 110 | +- Slow database queries (> 1s) |
| 111 | + |
| 112 | +## 🔧 Troubleshooting |
| 113 | + |
| 114 | +### No Traces Appearing |
| 115 | + |
| 116 | +1. **Check if OpenTelemetry is enabled:** |
| 117 | + ```bash |
| 118 | + kubectl get deployment items-service -o yaml | grep OTEL_ENABLED |
| 119 | + ``` |
| 120 | + Should show `value: "true"` |
| 121 | + |
| 122 | +2. **Check items-service logs:** |
| 123 | + Look for "📊 OpenTelemetry initialized" message |
| 124 | + |
| 125 | +3. **Check Jaeger is running:** |
| 126 | + ```bash |
| 127 | + kubectl get pods | grep jaeger |
| 128 | + ``` |
| 129 | + |
| 130 | +4. **Check connectivity:** |
| 131 | + ```bash |
| 132 | + kubectl exec -it <items-service-pod> -- curl http://jaeger:4318 |
| 133 | + ``` |
| 134 | + |
| 135 | +### Traces Missing Information |
| 136 | + |
| 137 | +1. **Check OTEL_LOG_LEVEL:** |
| 138 | + Set to "debug" to see detailed logs: |
| 139 | + ```yaml |
| 140 | + - name: OTEL_LOG_LEVEL |
| 141 | + value: "debug" |
| 142 | + ``` |
| 143 | +
|
| 144 | +2. **Check auto-instrumentation:** |
| 145 | + Some libraries may not be auto-instrumented |
| 146 | + May need manual instrumentation |
| 147 | +
|
| 148 | +### Performance Impact |
| 149 | +
|
| 150 | +OpenTelemetry has minimal overhead: |
| 151 | +- ~1-5ms per request |
| 152 | +- Sampling can reduce overhead further |
| 153 | +- Can be disabled in production if needed |
| 154 | +
|
| 155 | +## 📚 Additional Resources |
| 156 | +
|
| 157 | +- [OpenTelemetry Documentation](https://opentelemetry.io/docs/) |
| 158 | +- [Jaeger Documentation](https://www.jaegertracing.io/docs/) |
| 159 | +- [OpenTelemetry JavaScript SDK](https://opentelemetry.io/docs/instrumentation/js/) |
| 160 | +- [Distributed Tracing Best Practices](https://opentelemetry.io/docs/concepts/observability-primer/) |
| 161 | +
|
| 162 | +## 🎯 Next Steps |
| 163 | +
|
| 164 | +1. **Run the test script** to validate your setup: |
| 165 | + ```bash |
| 166 | + ./scripts/test-otel.sh |
| 167 | + ``` |
| 168 | + |
| 169 | +2. **Explore Jaeger UI** at http://localhost:16686 |
| 170 | + |
| 171 | +3. **Make some requests** and watch the traces appear |
| 172 | + |
| 173 | +4. **Try the use cases** above to get familiar with the UI |
| 174 | + |
| 175 | +5. **Consider adding custom spans** for important business operations |
| 176 | + |
| 177 | +6. **Set up alerts** based on trace data (requires additional setup) |
| 178 | + |
| 179 | + |
0 commit comments