Add opamp supervisor support to support opamp#171
Add opamp supervisor support to support opamp#171agnello-noronha wants to merge 25 commits intocloudfoundry:mainfrom
Conversation
agnello-noronha
commented
Dec 18, 2025
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector
- When opamp is disabled only otel-collector will start
- Supervisor should be configured with opamp server webscoket or http url.
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector - When opamp is disabled only otel-collector will start - Supervisor should be configured with opamp server webscoket or https url.
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector - When opamp is disabled only otel-collector will start - Supervisor should be configured with opamp server webscoket or https url.
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector - When opamp is disabled only otel-collector will start - Supervisor should be configured with opamp server webscoket or https url.
- Fixed test issues
- Checking in missing vendor files
|
Hi @agnello-noronha, This PR looks interesting, but there are few things that we have to think about:
At the moment the Otel Collector supports pipelines working the same way as the aggregate Syslog drains In the future we have to build a feature similar to the application Syslog drains, where app devs can do ad-hoc re-configuration of the export destination. In Otel Collector terms, that would mean reconfigurable pipelines, observer, connector, processor and an exporter. Practically, we need a pipeline creator which create cf app specific pipelines based on the configuration in the Cloud Controller. Think of it as a receivercreator, but for pipelines. I'm writing an RFC about this and will finish it in the next few weeks. I will ask on CNCF Slack in the Otel Collector and OpAmp channels to get more information on how do people reliably do re-configuration (redeployment + OpAmp) without data loss. At the moment, I see adding the OpAmp Server for monitoring plausible, but we are far away from managing the collectors with OpAmp. On the other side why would we add another monitoring tool that a CF operator should use when everyone already had some platform monitoring solution? |
Plan is to use it for monitoring and extend later for re-configuration. Otel-collector has ability to merge the confg which we can leverage to apply distinct configs upon reapply. Other other way is when a config is pushed update the bosh manifest which would be applied upon next deployment.
opamp-server can be anywhere as long as it is reachable. In TAS we can deploy in opsman. Scheduler VM can also be a good fit.
This is a open question I have not evaluated how to handle. Will spend some time over this
Main reason why we need opamp support is to reconfigure only collectors without redeploying entire bosh deployment and avoid downtime. I am open to any other solution if it avoids downtime. |
There are some good blogposts and YouTube videos about Otel Collector's reliability. I guess we habe to check them :)
If we use OpAmp for re-configuration, we have three problems to figure out:
|
Ideally we may not be looking at bosh deployment manifest if we are using opamp. Also when bosh redeploys, supervisor will resync with opamp-server(Which is source of truth) and update config. Other way I can think of is staging new config pushed to bosh manifest whenever dynamic config is pushed, so it will be effective for next deployment. Yes Otel Collector will be restarted when we apply new changes. This is the current state as well. But as document mentioned, most optimal would be to use persistence queue.
Will this RFC proposes approach reapplying config without bosh deployment? Looking forward to it. |
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.77.0 to 1.78.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.77.0...v1.78.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.78.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [code.cloudfoundry.org/tlsconfig](https://github.com/cloudfoundry/tlsconfig) from 0.41.0 to 0.42.0. - [Release notes](https://github.com/cloudfoundry/tlsconfig/releases) - [Commits](cloudfoundry/tlsconfig@v0.41.0...v0.42.0) --- updated-dependencies: - dependency-name: code.cloudfoundry.org/tlsconfig dependency-version: 0.42.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [go.opentelemetry.io/collector/cmd/builder](https://github.com/open-telemetry/opentelemetry-collector) from 0.135.0 to 0.142.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-collector/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-collector/blob/main/CHANGELOG-API.md) - [Commits](open-telemetry/opentelemetry-collector@v0.135.0...v0.142.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/collector/cmd/builder dependency-version: 0.142.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0. - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.38.3...v1.39.0) --- updated-dependencies: - dependency-name: github.com/onsi/gomega dependency-version: 1.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0. - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.38.3...v1.39.0) --- updated-dependencies: - dependency-name: github.com/onsi/gomega dependency-version: 1.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.4. - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.3...v2.27.4) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.4. - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.3...v2.27.4) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [go.opentelemetry.io/collector/cmd/builder](https://github.com/open-telemetry/opentelemetry-collector) from 0.142.0 to 0.143.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-collector/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-collector/blob/main/CHANGELOG-API.md) - [Commits](open-telemetry/opentelemetry-collector@v0.142.0...v0.143.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/collector/cmd/builder dependency-version: 0.143.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector - When opamp is disabled only otel-collector will start - Supervisor should be configured with opamp server webscoket or https url.
- Rebasing changes with main
- Sqashing all changes
- Refactoring code and integration tests.
| default: [] | ||
| opamp.enabled: | ||
| description: "Enable OpAMP extension in the collector. When true, collector includes OpAMP extension. Supervisor is managed by separate job." | ||
| default: false |
There was a problem hiding this comment.
We could also re-think and enable the collector's OpAMP extension by default as it simply sends status data to to the Supervisor and the supervisor sends it to the OpAMP Server.
| secrets: | ||
| description: "Variables to interpolate into the configuration" | ||
| default: [] | ||
| opamp.enabled: |
|
|
||
| pushd "${release_dir}/src/otel-collector" | ||
| go get toolchain@none | ||
| go mod tidy |
There was a problem hiding this comment.
Do we need this? Aren't all the dependencies managed by the otel collector builder?
|
|
||
| set -e | ||
|
|
||
| echo "🧪 Running OpAMP Acceptance Tests" |
There was a problem hiding this comment.
The emojis look cool, but I doubt that they will be shown in every terminal
| @@ -0,0 +1,98 @@ | |||
| package main | |||
There was a problem hiding this comment.
this supervisor builder looks cool.
|
Hi @agnello-noronha, I went through the PR in detail and it looks pretty good so far. There is one thing that bothers me and that is we don't have a concept on how are we going to manage the Otel Collectors. Having OpAMP Server + a Supervisor (with I guess blue-green strategy to deploy/update the collectors) is the right way to go. There are still many open questions, though. Where and how will the OpAMP server run (the scheduler VM I imagine with multiple instances, at least one per availability zone for ha), how are we going to manage the configuration in the OpAMP server, where will be the configuration stored. I suggest that we pause the work on this PR until we have documented our decisions and we have a clear direction what we want to do in a form of a rfc. |
@chombium We should treat opamp-server as an external system like splunk for logs and dynatrace for observability. We should allow customers to configure the endpoint of opamp-server and enable them to use this feature. We do not have any open source opamp-server available which is production grade and whoever is going to use this feature should have their own implementation. One of commercially available opamp-server is https://github.com/yotamloe/bindplane-op. In TAS we are planning to deploy opamp-server in ops-manager VM and manage all collectors. We can think of opamp capability as an additional feature provided for cloudfoundry users who are willing to use and manage their collector outside of cloudfoundry. And once we use external source(opamp-server) the source of truth will be opamp-server for all otel configs and not the bosh manifest. In case of bosh redeployment we can always control how effective config will be merged by using config_files parameter in opamp-server(https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/opampsupervisor/specification/README.md). Let me know your thoughts. |
|
@agnello-noronha I agree with most of what you've written and generally what we've discussed so far and I must say this is why we must write a RFC for this. We have to document our problems, decisions and proposed solutions. There are two general problems that we are dealing with:
IMO, if we do it this way, most of the people won't even use OpAMP as they will have to buy an OpAMP server. There are still to many open questions which we have to clear. I find it good that now we agree on a clear direction on how we want to manage the configuration either with BOSH or OpAMP. Nevertheless, we have to document how and what are the trade-offs of both approaches as RFC and in cf-docs. |
|
@chombium Thanks for clarification. I will look at defining RFC and take it forward. I presume @weili-broadcom had a discussion with you and agreed for only opamp extension as part of otel-collector. I have raised a CL for same #185 |
|
Thanks for the understanding @agnello-noronha. Yes, we've spoken with @weili-broadcom how do we proceed with this PR and we decided to work on a RFC together. I'll organize something to get a google doc in the community GCP account, so that we can work transparently on the RFC. I wil leave this PR open for now as there is some great work and discussions been done here. |