-
Notifications
You must be signed in to change notification settings - Fork 172
Introduce OTel Collector mixin #1465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tpaschalis
wants to merge
8
commits into
master
Choose a base branch
from
add-otel-collector-mixin
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
990511c
Introduce OTel Collector mixin
785da33
Remove separate license
tpaschalis c48784d
Use parent Makefile instead
tpaschalis a893f1c
Remove unnecessary files
tpaschalis 50135dc
Remove committed alerts and rules
tpaschalis 430bb91
Add .lint file
tpaschalis fe6098c
Fix alerts
tpaschalis 3f1a169
Add UID in another way
tpaschalis File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
exclusions: | ||
template-job-rule: | ||
reason: Allows All value to be '.*' instead of '.*' to accommodate for non-K8S environments | ||
template-instance-rule: | ||
reason: Allows All value to be '.*' instead of '.*' to accommodate for non-K8S environments | ||
panel-datasource-rule: | ||
reason: The new Grafonnet promotes the use of datasources at the query level. This should probably end up in the linter as a valid option. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
include ../Makefile_mixin |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
# opentelemetry-collector-mixin | ||
|
||
Prometheus Monitoring Mixin for the OpenTelemetry Collector | ||
|
||
This mixin contains a set of Prometheus alert rules and Grafana dashboards | ||
based on the metrics exported by the OpenTelemetry Collector's [internal | ||
telemetry](https://opentelemetry.io/docs/collector/internal-telemetry/). | ||
|
||
To use it, you need to have `jsonnet` (any sufficiently modern version should | ||
do, but ideally v0.20+) and `jb` installed. | ||
|
||
If you have a working Go development environment, you can run the following to | ||
get started: | ||
``` | ||
go install github.com/google/go-jsonnet/cmd/jsonnet@latest | ||
go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest | ||
``` | ||
|
||
### Usage | ||
|
||
First, install the dependencies by running the following command from the repo | ||
root: | ||
``` | ||
$ jb install | ||
``` | ||
|
||
You can then build the Prometheus alert and recording rules by running: | ||
``` | ||
$ make prometheus_alerts.yaml | ||
$ make prometheus_rules.yaml | ||
``` | ||
|
||
You can also render a JSON dashboard file for Grafana by running the following | ||
command. The results are stored in the `dashboards_out/` directory. | ||
``` | ||
$ make dashboards_out | ||
``` | ||
|
||
### OpenTelemetry Collector configuration | ||
|
||
By default, the OpenTelemetry Collector exposes its [internal | ||
telemetry](https://opentelemetry.io/docs/collector/internal-telemetry/) as | ||
prometheus metrics on port 8888. | ||
|
||
The following configuration can be used as a starting point for scraping and | ||
sending metrics in a Prometheus-compatible store. | ||
|
||
```yaml | ||
extensions: | ||
basicauth/remote_write: | ||
client_auth: | ||
username: "username" | ||
password: "password" | ||
|
||
receivers: | ||
prometheus: | ||
config: | ||
scrape_configs: | ||
- job_name: 'otel-collector' | ||
scrape_interval: 15s | ||
static_configs: | ||
- targets: ['0.0.0.0:8888'] | ||
|
||
processors: | ||
batch: | ||
|
||
exporters: | ||
prometheusremotewrite: | ||
endpoint: "http://prometheus/api/prom/push" | ||
auth: | ||
authenticator: basicauth/remote_write | ||
resource_to_telemetry_conversion: | ||
enabled: true # Convert resource attributes to metric labels | ||
|
||
service: | ||
telemetry: | ||
metrics: | ||
level: "detailed" | ||
readers: | ||
- pull: | ||
exporter: | ||
prometheus: | ||
host: '0.0.0.0' | ||
port: 8888 | ||
extensions: [basicauth/remote_write] | ||
pipelines: | ||
metrics: | ||
receivers: [prometheus] | ||
processors: [batch] | ||
exporters: [prometheusremotewrite] | ||
``` | ||
|
||
### Other requirements | ||
|
||
The Makefile contains commands for formatting, linting and testing the mixin. | ||
For development purposes you may need one or more of the following as well. | ||
``` | ||
go install github.com/google/go-jsonnet/cmd/jsonnet-lint@latest | ||
go install github.com/grafana/dashboard-linter@latest | ||
go install github.com/prometheus/prometheus/cmd/promtool@latest | ||
go install github.com/monitoring-mixins/mixtool/cmd/mixtool@main | ||
``` | ||
|
||
### Contributing | ||
|
||
To contribute: | ||
|
||
1. Fork the repository | ||
2. Make your changes | ||
3. Run `make all` to verify your changes and test in a Prometheus/Grafana environment. Screenshots are welcome for new panels/dashboards. | ||
4. Submit a pull request | ||
|
||
If you want to make some parameter configurable, use `config.libsonnet` as an | ||
entrypoint. | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
std.manifestYamlDoc((import 'mixin.libsonnet').prometheusAlerts) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
{ | ||
prometheusAlerts+:: { | ||
groups+: [ | ||
{ | ||
name: 'otelcol', | ||
rules: [ | ||
{ | ||
alert: 'OtelcolSendingQueueFull', | ||
expr: ||| | ||
otelcol_exporter_queue_size >= otelcol_exporter_queue_capacity | ||
|||, | ||
'for': '30m', | ||
labels: { | ||
severity: 'warning', | ||
}, | ||
annotations: { | ||
summary: 'The sending queue has filled up.', | ||
description: 'The sending queue is full for {{ $labels.instance }}. The collector might start dropping data', | ||
}, | ||
}, | ||
], | ||
}, | ||
], | ||
}, | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
{ | ||
_config+:: { | ||
// Grafana dashboard IDs are necessary for stable links for dashboards | ||
grafanaDashboardIDs: { | ||
'collector.json': std.md5('collector.json'), | ||
}, | ||
|
||
// Config for the Grafana dashboards in the Kubernetes Mixin | ||
grafana: { | ||
// The default refresh time for all dashboards, default to 10s | ||
refresh: '10s', | ||
|
||
// Timezone for Grafana dashboards:: UTC, browser, ... | ||
grafanaTimezone: 'UTC', | ||
|
||
// Tags for Grafana dashboards | ||
dashboardTags: ['otelcol'], | ||
}, | ||
|
||
// Default datasource name | ||
datasourceName: 'default', | ||
}, | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
local dashboards = (import 'mixin.libsonnet').grafanaDashboards; | ||
local cfg = import 'config.libsonnet'; | ||
|
||
{ | ||
[name]: dashboards[name] { | ||
timezone: cfg._config.grafana.grafanaTimezone, | ||
refresh: cfg._config.grafana.refresh, | ||
tags: cfg._config.grafana.dashboardTags, | ||
} | ||
for name in std.objectFields(dashboards) | ||
} |
137 changes: 137 additions & 0 deletions
137
opentelemetry-collector-mixin/dashboards/collector.libsonnet
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,137 @@ | ||
local g = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet'; | ||
local row = g.panel.row; | ||
local variables = import './utils/variables.libsonnet'; | ||
local panels = import './utils/panels.libsonnet'; | ||
local queries = import './utils/queries.libsonnet'; | ||
local cfg = import '../config.libsonnet'; | ||
|
||
{ | ||
grafanaDashboards+:: { | ||
'collector.json': | ||
g.dashboard.new( | ||
'OpenTelemetry Collector Health', | ||
) | ||
+ g.dashboard.withDescription('A dashboard for monitoring the health of OpenTelemetry Collector instances using their internal metrics.') | ||
+ g.dashboard.graphTooltip.withSharedCrosshair() | ||
+ g.dashboard.withVariables([ | ||
variables.datasourceVariable, | ||
variables.jobVariable, | ||
variables.clusterVariable, | ||
variables.namespaceVariable, | ||
variables.instanceVariable, | ||
]) | ||
+ g.dashboard.withUid(cfg._config.grafanaDashboardIDs['collector.json']) | ||
+ g.dashboard.withPanels( | ||
g.util.grid.wrapPanels([ | ||
// Overview row | ||
row.new('Overview'), | ||
panels.stat.base('Running Collectors', [queries.runningCollectors]), | ||
panels.table.uptime('Collector uptime', [queries.collectorUptime]), | ||
|
||
// Resources row | ||
row.new('Resources'), | ||
panels.timeSeries.cpuUsage('CPU usage', [queries.cpuUsage]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.memoryUsage('Memory (RSS)', queries.memUsageRSS) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.memoryUsage('Memory (Heap Alloc)', queries.memUsageHeapAlloc) | ||
+ { gridPos: { w: 8 } }, | ||
|
||
// Receivers row | ||
row.new('Receivers'), | ||
panels.timeSeries.short('Accepted metric points', [queries.acceptedMetricPoints]) | ||
+ { gridPos: { w: 6 } }, | ||
panels.timeSeries.short('Accepted log records', [queries.acceptedLogRecords]) | ||
+ { gridPos: { w: 6 } }, | ||
panels.timeSeries.short('Accepted spans', [queries.acceptedSpans]) | ||
+ { gridPos: { w: 6 } }, | ||
panels.timeSeries.short('Total incoming items', [queries.incomingItems]) | ||
+ { gridPos: { w: 6 } }, | ||
panels.timeSeries.short('Refused metric points', [queries.refusedMetricPoints]) | ||
+ { gridPos: { w: 6 } }, | ||
panels.timeSeries.short('Refused log records', [queries.refusedLogRecords]) | ||
+ { gridPos: { w: 6 } }, | ||
panels.timeSeries.short('Refused spans', [queries.refusedSpans]) | ||
+ { gridPos: { w: 6 } }, | ||
panels.timeSeries.short('Total outgoing items', [queries.outgoingItems]) | ||
+ { gridPos: { w: 6 } }, | ||
|
||
// Processors row | ||
row.new('Processors'), | ||
panels.heatmap.base('Number of units in the batch', [queries.batchSendSize]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Batch cardinality', [queries.batchCardinality]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Queue current size vs capacity', [queries.queueSize, queries.queueCapacity]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Batch size send trigger', [queries.batchSizeSendTrigger]), | ||
panels.timeSeries.short('Batch timeout send trigger', [queries.batchTimeoutSendTrigger]), | ||
|
||
// Exporters row | ||
row.new('Exporters'), | ||
panels.timeSeries.short('Exported metrics', [queries.exportedMetrics]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Exported logs', [queries.exportedLogs]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Exported spans', [queries.exportedSpans]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Failed metrics', [queries.failedMetrics]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Failed logs', [queries.failedLogs]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Failed spans', [queries.failedSpans]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Enqueue failed metrics', [queries.enqueueFailedMetrics]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Enqueue failed logs', [queries.enqueueFailedLogs]) | ||
+ { gridPos: { w: 8 } }, | ||
panels.timeSeries.short('Enqueue failed spans', [queries.enqueueFailedSpans]) | ||
+ { gridPos: { w: 8 } }, | ||
|
||
// Network traffic row | ||
row.new('Network traffic'), | ||
panels.timeSeries.seconds('Inbound gRPC request duration percentiles', [ | ||
queries.grpcInboundDurationP50, | ||
queries.grpcInboundDurationP90, | ||
queries.grpcInboundDurationP99, | ||
]), | ||
panels.timeSeries.seconds('Inbound HTTP request duration percentiles', [ | ||
queries.httpInboundDurationP50, | ||
queries.httpInboundDurationP90, | ||
queries.httpInboundDurationP99, | ||
]), | ||
panels.timeSeries.bytes('Inbound gRPC request size percentiles', [ | ||
queries.grpcInboundSizeP50, | ||
queries.grpcInboundSizeP90, | ||
queries.grpcInboundSizeP99, | ||
]), | ||
panels.timeSeries.bytes('Inbound HTTP request size percentiles', [ | ||
queries.httpInboundSizeP50, | ||
queries.httpInboundSizeP90, | ||
queries.httpInboundSizeP99, | ||
]), | ||
panels.timeSeries.seconds('Outgoing gRPC request duration percentiles', [ | ||
queries.grpcOutboundDurationP50, | ||
queries.grpcOutboundDurationP90, | ||
queries.grpcOutboundDurationP99, | ||
]), | ||
panels.timeSeries.seconds('Outgoing HTTP request duration percentiles', [ | ||
queries.httpOutboundDurationP50, | ||
queries.httpOutboundDurationP90, | ||
queries.httpOutboundDurationP99, | ||
]), | ||
panels.timeSeries.bytes('Outgoing gRPC request size percentiles', [ | ||
queries.grpcOutboundSizeP50, | ||
queries.grpcOutboundSizeP90, | ||
queries.grpcOutboundSizeP99, | ||
]), | ||
panels.timeSeries.bytes('Outgoing HTTP request size percentiles', [ | ||
queries.httpOutboundSizeP50, | ||
queries.httpOutboundSizeP90, | ||
queries.httpOutboundSizeP99, | ||
]), | ||
|
||
], panelWidth=12, panelHeight=8), | ||
), | ||
}, | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
(import 'collector.libsonnet') |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We normally use Sentence case for Dashboard titles and panel titles