|
| 1 | +--- |
| 2 | +description: Guidelines for creating OpenTelemetry metrics to avoid cardinality issues |
| 3 | +globs: |
| 4 | + - "**/*.ts" |
| 5 | +--- |
| 6 | + |
| 7 | +# OpenTelemetry Metrics Guidelines |
| 8 | + |
| 9 | +When creating or editing OTEL metrics (counters, histograms, gauges), always ensure metric attributes have **low cardinality**. |
| 10 | + |
| 11 | +## What is Cardinality? |
| 12 | + |
| 13 | +Cardinality refers to the number of unique values an attribute can have. Each unique combination of attribute values creates a new time series, which consumes memory and storage in your metrics backend. |
| 14 | + |
| 15 | +## Rules |
| 16 | + |
| 17 | +### DO use low-cardinality attributes: |
| 18 | +- **Enums**: `environment_type` (PRODUCTION, STAGING, DEVELOPMENT, PREVIEW) |
| 19 | +- **Booleans**: `hasFailures`, `streaming`, `success` |
| 20 | +- **Bounded error codes**: A finite, controlled set of error types |
| 21 | +- **Shard IDs**: When sharding is bounded (e.g., 0-15) |
| 22 | + |
| 23 | +### DO NOT use high-cardinality attributes: |
| 24 | +- **UUIDs/IDs**: `envId`, `userId`, `runId`, `projectId`, `organizationId` |
| 25 | +- **Unbounded integers**: `itemCount`, `batchSize`, `retryCount` |
| 26 | +- **Timestamps**: `createdAt`, `startTime` |
| 27 | +- **Free-form strings**: `errorMessage`, `taskName`, `queueName` |
| 28 | + |
| 29 | +## Example |
| 30 | + |
| 31 | +```typescript |
| 32 | +// BAD - High cardinality |
| 33 | +this.counter.add(1, { |
| 34 | + envId: options.environmentId, // UUID - unbounded |
| 35 | + itemCount: options.runCount, // Integer - unbounded |
| 36 | +}); |
| 37 | + |
| 38 | +// GOOD - Low cardinality |
| 39 | +this.counter.add(1, { |
| 40 | + environment_type: options.environmentType, // Enum - 4 values |
| 41 | + streaming: true, // Boolean - 2 values |
| 42 | +}); |
| 43 | +``` |
| 44 | + |
| 45 | +## Prometheus Metric Naming |
| 46 | + |
| 47 | +When metrics are exported via OTLP to Prometheus, the exporter automatically adds unit suffixes to metric names: |
| 48 | + |
| 49 | +| OTel Metric Name | Unit | Prometheus Name | |
| 50 | +|------------------|------|-----------------| |
| 51 | +| `my_duration_ms` | `ms` | `my_duration_ms_milliseconds` | |
| 52 | +| `my_counter` | counter | `my_counter_total` | |
| 53 | +| `items_inserted` | counter | `items_inserted_inserts_total` | |
| 54 | +| `batch_size` | histogram | `batch_size_items_bucket` | |
| 55 | + |
| 56 | +Keep this in mind when writing Grafana dashboards or Prometheus queries—the metric names in Prometheus will differ from the names defined in code. |
| 57 | + |
| 58 | +## Reference |
| 59 | + |
| 60 | +See the schedule engine (`internal-packages/schedule-engine/src/engine/index.ts`) for a good example of low-cardinality metric attributes. |
| 61 | + |
| 62 | +High cardinality metrics can cause: |
| 63 | +- Memory bloat in metrics backends (Axiom, Prometheus, etc.) |
| 64 | +- Slow queries and dashboard timeouts |
| 65 | +- Increased costs (many backends charge per time series) |
| 66 | +- Potential data loss or crashes at scale |
0 commit comments