Add opamp supervisor support to support opamp by agnello-noronha · Pull Request #171 · cloudfoundry/otel-collector-release

agnello-noronha · 2025-12-18T10:19:20Z

When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector
When opamp is disabled only otel-collector will start
Supervisor should be configured with opamp server webscoket or http url.

- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector - When opamp is disabled only otel-collector will start - Supervisor should be configured with opamp server webscoket or https url.

- Fixed test issues

- Checking in missing vendor files

chombium · 2025-12-25T08:55:28Z

Hi @agnello-noronha,

This PR looks interesting, but there are few things that we have to think about:

What will be the OpAmp Server be used for? Only for monitoring or for agent updating and re-configuration as well? If it's only for monitoring of the Otel Collectors it would be easy. If we should also cover updating and re-configuration, the things become more complex as the sent changes from the OpAmp Server have to be kept in sync with the BOSH deployment manifest, so that when the Otel Collector is redeployed with BOSH the configuration won't change and everything will continue to work as before
Where should the OpAmp server be deployed? I guess the Scheduler VMs would be a proper place as there is only one instance active at the time and the configuration and management would be centralized
The Otel Collector doesn't support hot reload, which means that by applying config changes all pipelines will be restarted. In order to avoid data loss we have to add few more things to make it more reliable and resilient to restarts, so that we don't have data loss.

At the moment the Otel Collector supports pipelines working the same way as the aggregate Syslog drains In the future we have to build a feature similar to the application Syslog drains, where app devs can do ad-hoc re-configuration of the export destination. In Otel Collector terms, that would mean reconfigurable pipelines, observer, connector, processor and an exporter. Practically, we need a pipeline creator which create cf app specific pipelines based on the configuration in the Cloud Controller. Think of it as a receivercreator, but for pipelines. I'm writing an RFC about this and will finish it in the next few weeks.

I will ask on CNCF Slack in the Otel Collector and OpAmp channels to get more information on how do people reliably do re-configuration (redeployment + OpAmp) without data loss.

At the moment, I see adding the OpAmp Server for monitoring plausible, but we are far away from managing the collectors with OpAmp. On the other side why would we add another monitoring tool that a CF operator should use when everyone already had some platform monitoring solution?

agnello-noronha · 2026-01-05T09:36:48Z

Hi @agnello-noronha,

This PR looks interesting, but there are few things that we have to think about:

What will be the OpAmp Server be used for? Only for monitoring or for agent updating and re-configuration as well? If it's only for monitoring of the Otel Collectors it would be easy. If we should also cover updating and re-configuration, the things become more complex as the sent changes from the OpAmp Server have to be kept in sync with the BOSH deployment manifest, so that when the Otel Collector is redeployed with BOSH the configuration won't change and everything will continue to work as before

Plan is to use it for monitoring and extend later for re-configuration. Otel-collector has ability to merge the confg which we can leverage to apply distinct configs upon reapply. Other other way is when a config is pushed update the bosh manifest which would be applied upon next deployment.

Where should the OpAmp server be deployed? I guess the Scheduler VMs would be a proper place as there is only one instance active at the time and the configuration and management would be centralized

opamp-server can be anywhere as long as it is reachable. In TAS we can deploy in opsman. Scheduler VM can also be a good fit.

The Otel Collector doesn't support hot reload, which means that by applying config changes all pipelines will be restarted. In order to avoid data loss we have to add few more things to make it more reliable and resilient to restarts, so that we don't have data loss.

This is a open question I have not evaluated how to handle. Will spend some time over this

At the moment the Otel Collector supports pipelines working the same way as the aggregate Syslog drains In the future we have to build a feature similar to the application Syslog drains, where app devs can do ad-hoc re-configuration of the export destination. In Otel Collector terms, that would mean reconfigurable pipelines, observer, connector, processor and an exporter. Practically, we need a pipeline creator which create cf app specific pipelines based on the configuration in the Cloud Controller. Think of it as a receivercreator, but for pipelines. I'm writing an RFC about this and will finish it in the next few weeks.

I will ask on CNCF Slack in the Otel Collector and OpAmp channels to get more information on how do people reliably do re-configuration (redeployment + OpAmp) without data loss.

At the moment, I see adding the OpAmp Server for monitoring plausible, but we are far away from managing the collectors with OpAmp. On the other side why would we add another monitoring tool that a CF operator should use when everyone already had some platform monitoring solution?

Main reason why we need opamp support is to reconfigure only collectors without redeploying entire bosh deployment and avoid downtime. I am open to any other solution if it avoids downtime.

chombium · 2026-01-05T13:40:27Z

This is a open question I have not evaluated how to handle. Will spend some time over this

There are some good blogposts and YouTube videos about Otel Collector's reliability. I guess we habe to check them :)

Main reason why we need opamp support is to reconfigure only collectors without redeploying entire bosh deployment and avoid downtime. I am open to any other solution if it avoids downtime.

If we use OpAmp for re-configuration, we have three problems to figure out:

How to keep the current, running configuration of the collectors in sync with the bosh deployment manifest.
AFAIK the Otel Collector has to be restarted in order to apply the ne configuration. We have to figure out how to avoid data loss. Maybe we can use something as described here and here.
How to implement and handle similar functionality like Application Syslog drains. I'm preparing a RFC about that. I hope to finish it by the end of next week

agnello-noronha · 2026-01-06T09:53:10Z

This is a open question I have not evaluated how to handle. Will spend some time over this

There are some good blogposts and YouTube videos about Otel Collector's reliability. I guess we habe to check them :)

Main reason why we need opamp support is to reconfigure only collectors without redeploying entire bosh deployment and avoid downtime. I am open to any other solution if it avoids downtime.

If we use OpAmp for re-configuration, we have three problems to figure out:

How to keep the current, running configuration of the collectors in sync with the bosh deployment manifest.

Ideally we may not be looking at bosh deployment manifest if we are using opamp. Also when bosh redeploys, supervisor will resync with opamp-server(Which is source of truth) and update config.

Other way I can think of is staging new config pushed to bosh manifest whenever dynamic config is pushed, so it will be effective for next deployment.

AFAIK the Otel Collector has to be restarted in order to apply the ne configuration. We have to figure out how to avoid data loss. Maybe we can use something as described here and here.

Yes Otel Collector will be restarted when we apply new changes. This is the current state as well. But as document mentioned, most optimal would be to use persistence queue.

How to implement and handle similar functionality like Application Syslog drains. I'm preparing a RFC about that. I hope to finish it by the end of next week

Will this RFC proposes approach reapplying config without bosh deployment? Looking forward to it.

Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.77.0 to 1.78.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](grpc/grpc-go@v1.77.0...v1.78.0) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-version: 1.78.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [code.cloudfoundry.org/tlsconfig](https://github.com/cloudfoundry/tlsconfig) from 0.41.0 to 0.42.0. - [Release notes](https://github.com/cloudfoundry/tlsconfig/releases) - [Commits](cloudfoundry/tlsconfig@v0.41.0...v0.42.0) --- updated-dependencies: - dependency-name: code.cloudfoundry.org/tlsconfig dependency-version: 0.42.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [go.opentelemetry.io/collector/cmd/builder](https://github.com/open-telemetry/opentelemetry-collector) from 0.135.0 to 0.142.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-collector/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-collector/blob/main/CHANGELOG-API.md) - [Commits](open-telemetry/opentelemetry-collector@v0.135.0...v0.142.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/collector/cmd/builder dependency-version: 0.142.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0. - [Release notes](https://github.com/onsi/gomega/releases) - [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md) - [Commits](onsi/gomega@v1.38.3...v1.39.0) --- updated-dependencies: - dependency-name: github.com/onsi/gomega dependency-version: 1.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.4. - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.3...v2.27.4) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>

Bumps [go.opentelemetry.io/collector/cmd/builder](https://github.com/open-telemetry/opentelemetry-collector) from 0.142.0 to 0.143.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-collector/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-collector/blob/main/CHANGELOG-API.md) - [Commits](open-telemetry/opentelemetry-collector@v0.142.0...v0.143.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/collector/cmd/builder dependency-version: 0.143.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector - When opamp is disabled only otel-collector will start - Supervisor should be configured with opamp server webscoket or https url.

- Rebasing changes with main

linux-foundation-easycla · 2026-01-15T06:28:23Z

The committers listed above are authorized under a signed CLA.

✅ login: agnello-noronha / name: Agnello Noronha (70a0524)
✅ login: agnello-noronha / name: agnello-noronha (7ff8f86)

- Sqashing all changes

- Refactoring code and integration tests.

chombium · 2026-01-19T08:54:30Z

jobs/otel-collector-windows/spec

    default: []
+  opamp.enabled:
+    description: "Enable OpAMP extension in the collector. When true, collector includes OpAMP extension. Supervisor is managed by separate job."
+    default: false


We could also re-think and enable the collector's OpAMP extension by default as it simply sends status data to to the Supervisor and the supervisor sends it to the OpAMP Server.

chombium · 2026-01-19T08:55:26Z

jobs/otel-collector/spec

  secrets:
    description: "Variables to interpolate into the configuration"
    default: []
+  opamp.enabled:


same as above

chombium · 2026-01-19T08:57:22Z

scripts/regenerate-otel-collector-distribution


 pushd "${release_dir}/src/otel-collector"
  go get toolchain@none
+  go mod tidy


Do we need this? Aren't all the dependencies managed by the otel collector builder?

chombium · 2026-01-19T08:58:49Z

src/acceptance/run_opamp_acceptance_tests.sh

+
+set -e
+
+echo "🧪 Running OpAMP Acceptance Tests"


The emojis look cool, but I doubt that they will be shown in every terminal

chombium · 2026-01-19T09:05:15Z

src/opamp-supervisor-builder/main.go

@@ -0,0 +1,98 @@
+package main


this supervisor builder looks cool.

chombium · 2026-01-19T09:15:56Z

Hi @agnello-noronha, I went through the PR in detail and it looks pretty good so far. There is one thing that bothers me and that is we don't have a concept on how are we going to manage the Otel Collectors. Having OpAMP Server + a Supervisor (with I guess blue-green strategy to deploy/update the collectors) is the right way to go. There are still many open questions, though. Where and how will the OpAMP server run (the scheduler VM I imagine with multiple instances, at least one per availability zone for ha), how are we going to manage the configuration in the OpAMP server, where will be the configuration stored.

I suggest that we pause the work on this PR until we have documented our decisions and we have a clear direction what we want to do in a form of a rfc.

agnello-noronha · 2026-01-21T03:21:32Z

Hi @agnello-noronha, I went through the PR in detail and it looks pretty good so far. There is one thing that bothers me and that is we don't have a concept on how are we going to manage the Otel Collectors. Having OpAMP Server + a Supervisor (with I guess blue-green strategy to deploy/update the collectors) is the right way to go. There are still many open questions, though. Where and how will the OpAMP server run (the scheduler VM I imagine with multiple instances, at least one per availability zone for ha), how are we going to manage the configuration in the OpAMP server, where will be the configuration stored.

I suggest that we pause the work on this PR until we have documented our decisions and we have a clear direction what we want to do in a form of a rfc.

@chombium We should treat opamp-server as an external system like splunk for logs and dynatrace for observability. We should allow customers to configure the endpoint of opamp-server and enable them to use this feature.

We do not have any open source opamp-server available which is production grade and whoever is going to use this feature should have their own implementation. One of commercially available opamp-server is https://github.com/yotamloe/bindplane-op. In TAS we are planning to deploy opamp-server in ops-manager VM and manage all collectors.

We can think of opamp capability as an additional feature provided for cloudfoundry users who are willing to use and manage their collector outside of cloudfoundry. And once we use external source(opamp-server) the source of truth will be opamp-server for all otel configs and not the bosh manifest.

In case of bosh redeployment we can always control how effective config will be merged by using config_files parameter in opamp-server(https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/opampsupervisor/specification/README.md).

Let me know your thoughts.

chombium · 2026-01-21T13:58:46Z

@agnello-noronha I agree with most of what you've written and generally what we've discussed so far and I must say this is why we must write a RFC for this. We have to document our problems, decisions and proposed solutions.

There are two general problems that we are dealing with:

how are we going to manage the configuration of the Otel Collectors:

bosh deploy manifest and bosh redeployment (that's what we have now)
OpAMP managed configuration
- Local, CF deployed OpAMP server, we don't have one at the moment. If we want to have one, we have to build one with is production ready with HA and all other things that will provide reliable running
- Remote OpAMP server (the things that you are suggesting)
- A general problem here is how will the updated Otel Collector configuration be sent to the OpAMP server and where will it be stored
  - For a local OpAMP server we could either:
    - use the bosh way, updated job/bosh release configuration and redeployment. If the OpAMP server runs on the Scheduler VMs, this would updating of the Scheduler VMs and the Supervisors will read the
    - build another API to manage Otel Collector configuration updates in the OpAMP server
- A remote OpAMP server would have to provide some API for Otel Collector configuration updates (the new config which should be pushed to the Otel Collectors)

How are we going to make sure that we don't have telemetry data drops during the (re-)deployment and re-configuraton?

Classical bosh redeploy (that's what we have now). Bosh will drain the VM, Diego will move all of the apps to running cells and the VM will be updated
In case of re-configuration without VM restart we need an OpAMP serverm Supervisor which we restart the collector in blue-green mode and Otel Collector with buffers and probably disk storage extensions configured.

We can think of opamp capability as an additional feature provided for cloudfoundry users who are willing to use and manage their collector outside of cloudfoundry. And once we use external source(opamp-server) the source of truth will be opamp-server for all otel configs and not the bosh manifest.

IMO, if we do it this way, most of the people won't even use OpAMP as they will have to buy an OpAMP server.

There are still to many open questions which we have to clear.

I find it good that now we agree on a clear direction on how we want to manage the configuration either with BOSH or OpAMP. Nevertheless, we have to document how and what are the trade-offs of both approaches as RFC and in cf-docs.

agnello-noronha · 2026-01-22T04:52:16Z

@chombium Thanks for clarification. I will look at defining RFC and take it forward. I presume @weili-broadcom had a discussion with you and agreed for only opamp extension as part of otel-collector. I have raised a CL for same #185

chombium · 2026-01-22T08:30:57Z

Thanks for the understanding @agnello-noronha. Yes, we've spoken with @weili-broadcom how do we proceed with this PR and we decided to work on a RFC together. I'll organize something to get a google doc in the community GCP account, so that we can work transparently on the RFC.

I wil leave this PR open for now as there is some great work and discussions been done here.

agnello-noronha added 3 commits December 18, 2025 11:43

agnello-noronha requested a review from a team as a code owner December 18, 2025 10:19

Merge branch 'main' into opamp-support

f8ec39e

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Dec 18, 2025

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Dec 18, 2025

agnello-noronha added 2 commits December 18, 2025 21:08

Add opamp supervisor support to support opamp

c0ebcef

- Fixed test issues

Add opamp supervisor support to support opamp

46a03ae

- Checking in missing vendor files

dependabot bot and others added 14 commits January 15, 2026 11:45

Bump github.com/expr-lang/expr to v1.17.7 to fix CVE-2025-68156

bd89fd9

bump Otel Collector to v0.143.0

f61f359

update version to make patch release

6bd7e88

Create patch release

aa66f3f

Add opamp supervisor support to support opamp

7772f08

- Rebasing changes with main

agnello-noronha added 2 commits January 15, 2026 12:14

Merge branch 'main' into opamp-support

16afe63

Add opamp supervisor support to support opamp

b465f87

- Sqashing all changes

agnello-noronha added 3 commits January 15, 2026 15:27

Merge remote-tracking branch 'origin/opamp-support' into opamp-support

dca6e0b

Merge branch 'main' into opamp-support

7ff8f86

Add opamp supervisor support to support opamp

70a0524

- Refactoring code and integration tests.

chombium reviewed Jan 19, 2026

View reviewed changes

src/opamp-supervisor-builder/main.go

@@ -0,0 +1,98 @@

package main

Copy link

Contributor

chombium Jan 19, 2026 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this supervisor builder looks cool.

chombium added the do-not-merge label Jan 22, 2026

Conversation

agnello-noronha commented Dec 18, 2025

Uh oh!

chombium commented Dec 25, 2025

Uh oh!

agnello-noronha commented Jan 5, 2026

Uh oh!

chombium commented Jan 5, 2026

Uh oh!

agnello-noronha commented Jan 6, 2026

Uh oh!

linux-foundation-easycla bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chombium Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

chombium Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

chombium Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

chombium Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

chombium Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chombium commented Jan 19, 2026

Uh oh!

agnello-noronha commented Jan 21, 2026

Uh oh!

chombium commented Jan 21, 2026

Uh oh!

agnello-noronha commented Jan 22, 2026

Uh oh!

chombium commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

linux-foundation-easycla bot commented Jan 15, 2026 •

edited

Loading

chombium Jan 19, 2026 •

edited

Loading