Skip to content

Add opamp supervisor support to support opamp#171

Open
agnello-noronha wants to merge 25 commits intocloudfoundry:mainfrom
agnello-noronha:opamp-support
Open

Add opamp supervisor support to support opamp#171
agnello-noronha wants to merge 25 commits intocloudfoundry:mainfrom
agnello-noronha:opamp-support

Conversation

@agnello-noronha
Copy link
Member

  • When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector
  • When opamp is disabled only otel-collector will start
  • Supervisor should be configured with opamp server webscoket or http url.

- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector
- When opamp is disabled only otel-collector will start
- Supervisor should be configured with opamp server webscoket or https url.
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector
- When opamp is disabled only otel-collector will start
- Supervisor should be configured with opamp server webscoket or https url.
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector
- When opamp is disabled only otel-collector will start
- Supervisor should be configured with opamp server webscoket or https url.
@agnello-noronha agnello-noronha requested a review from a team as a code owner December 18, 2025 10:19
@chombium
Copy link
Contributor

Hi @agnello-noronha,

This PR looks interesting, but there are few things that we have to think about:

  1. What will be the OpAmp Server be used for? Only for monitoring or for agent updating and re-configuration as well? If it's only for monitoring of the Otel Collectors it would be easy. If we should also cover updating and re-configuration, the things become more complex as the sent changes from the OpAmp Server have to be kept in sync with the BOSH deployment manifest, so that when the Otel Collector is redeployed with BOSH the configuration won't change and everything will continue to work as before
  2. Where should the OpAmp server be deployed? I guess the Scheduler VMs would be a proper place as there is only one instance active at the time and the configuration and management would be centralized
  3. The Otel Collector doesn't support hot reload, which means that by applying config changes all pipelines will be restarted. In order to avoid data loss we have to add few more things to make it more reliable and resilient to restarts, so that we don't have data loss.

At the moment the Otel Collector supports pipelines working the same way as the aggregate Syslog drains In the future we have to build a feature similar to the application Syslog drains, where app devs can do ad-hoc re-configuration of the export destination. In Otel Collector terms, that would mean reconfigurable pipelines, observer, connector, processor and an exporter. Practically, we need a pipeline creator which create cf app specific pipelines based on the configuration in the Cloud Controller. Think of it as a receivercreator, but for pipelines. I'm writing an RFC about this and will finish it in the next few weeks.

I will ask on CNCF Slack in the Otel Collector and OpAmp channels to get more information on how do people reliably do re-configuration (redeployment + OpAmp) without data loss.

At the moment, I see adding the OpAmp Server for monitoring plausible, but we are far away from managing the collectors with OpAmp. On the other side why would we add another monitoring tool that a CF operator should use when everyone already had some platform monitoring solution?

@agnello-noronha
Copy link
Member Author

Hi @agnello-noronha,

This PR looks interesting, but there are few things that we have to think about:

  1. What will be the OpAmp Server be used for? Only for monitoring or for agent updating and re-configuration as well? If it's only for monitoring of the Otel Collectors it would be easy. If we should also cover updating and re-configuration, the things become more complex as the sent changes from the OpAmp Server have to be kept in sync with the BOSH deployment manifest, so that when the Otel Collector is redeployed with BOSH the configuration won't change and everything will continue to work as before

Plan is to use it for monitoring and extend later for re-configuration. Otel-collector has ability to merge the confg which we can leverage to apply distinct configs upon reapply. Other other way is when a config is pushed update the bosh manifest which would be applied upon next deployment.

  1. Where should the OpAmp server be deployed? I guess the Scheduler VMs would be a proper place as there is only one instance active at the time and the configuration and management would be centralized

opamp-server can be anywhere as long as it is reachable. In TAS we can deploy in opsman. Scheduler VM can also be a good fit.

  1. The Otel Collector doesn't support hot reload, which means that by applying config changes all pipelines will be restarted. In order to avoid data loss we have to add few more things to make it more reliable and resilient to restarts, so that we don't have data loss.

This is a open question I have not evaluated how to handle. Will spend some time over this

At the moment the Otel Collector supports pipelines working the same way as the aggregate Syslog drains In the future we have to build a feature similar to the application Syslog drains, where app devs can do ad-hoc re-configuration of the export destination. In Otel Collector terms, that would mean reconfigurable pipelines, observer, connector, processor and an exporter. Practically, we need a pipeline creator which create cf app specific pipelines based on the configuration in the Cloud Controller. Think of it as a receivercreator, but for pipelines. I'm writing an RFC about this and will finish it in the next few weeks.

I will ask on CNCF Slack in the Otel Collector and OpAmp channels to get more information on how do people reliably do re-configuration (redeployment + OpAmp) without data loss.

At the moment, I see adding the OpAmp Server for monitoring plausible, but we are far away from managing the collectors with OpAmp. On the other side why would we add another monitoring tool that a CF operator should use when everyone already had some platform monitoring solution?

Main reason why we need opamp support is to reconfigure only collectors without redeploying entire bosh deployment and avoid downtime. I am open to any other solution if it avoids downtime.

@chombium
Copy link
Contributor

chombium commented Jan 5, 2026

This is a open question I have not evaluated how to handle. Will spend some time over this

There are some good blogposts and YouTube videos about Otel Collector's reliability. I guess we habe to check them :)

Main reason why we need opamp support is to reconfigure only collectors without redeploying entire bosh deployment and avoid downtime. I am open to any other solution if it avoids downtime.

If we use OpAmp for re-configuration, we have three problems to figure out:

  1. How to keep the current, running configuration of the collectors in sync with the bosh deployment manifest.
  2. AFAIK the Otel Collector has to be restarted in order to apply the ne configuration. We have to figure out how to avoid data loss. Maybe we can use something as described here and here.
  3. How to implement and handle similar functionality like Application Syslog drains. I'm preparing a RFC about that. I hope to finish it by the end of next week

@agnello-noronha
Copy link
Member Author

This is a open question I have not evaluated how to handle. Will spend some time over this

There are some good blogposts and YouTube videos about Otel Collector's reliability. I guess we habe to check them :)

Main reason why we need opamp support is to reconfigure only collectors without redeploying entire bosh deployment and avoid downtime. I am open to any other solution if it avoids downtime.

If we use OpAmp for re-configuration, we have three problems to figure out:

  1. How to keep the current, running configuration of the collectors in sync with the bosh deployment manifest.

Ideally we may not be looking at bosh deployment manifest if we are using opamp. Also when bosh redeploys, supervisor will resync with opamp-server(Which is source of truth) and update config.

Other way I can think of is staging new config pushed to bosh manifest whenever dynamic config is pushed, so it will be effective for next deployment.

  1. AFAIK the Otel Collector has to be restarted in order to apply the ne configuration. We have to figure out how to avoid data loss. Maybe we can use something as described here and here.

Yes Otel Collector will be restarted when we apply new changes. This is the current state as well. But as document mentioned, most optimal would be to use persistence queue.

  1. How to implement and handle similar functionality like Application Syslog drains. I'm preparing a RFC about that. I hope to finish it by the end of next week

Will this RFC proposes approach reapplying config without bosh deployment? Looking forward to it.

dependabot bot and others added 14 commits January 15, 2026 11:45
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.77.0 to 1.78.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](grpc/grpc-go@v1.77.0...v1.78.0)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.78.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [code.cloudfoundry.org/tlsconfig](https://github.com/cloudfoundry/tlsconfig) from 0.41.0 to 0.42.0.
- [Release notes](https://github.com/cloudfoundry/tlsconfig/releases)
- [Commits](cloudfoundry/tlsconfig@v0.41.0...v0.42.0)

---
updated-dependencies:
- dependency-name: code.cloudfoundry.org/tlsconfig
  dependency-version: 0.42.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [go.opentelemetry.io/collector/cmd/builder](https://github.com/open-telemetry/opentelemetry-collector) from 0.135.0 to 0.142.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-collector/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-collector/blob/main/CHANGELOG-API.md)
- [Commits](open-telemetry/opentelemetry-collector@v0.135.0...v0.142.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/collector/cmd/builder
  dependency-version: 0.142.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](onsi/gomega@v1.38.3...v1.39.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-version: 1.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/gomega](https://github.com/onsi/gomega) from 1.38.3 to 1.39.0.
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](onsi/gomega@v1.38.3...v1.39.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-version: 1.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.4.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.27.3...v2.27.4)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.27.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.4.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.27.3...v2.27.4)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.27.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [go.opentelemetry.io/collector/cmd/builder](https://github.com/open-telemetry/opentelemetry-collector) from 0.142.0 to 0.143.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-collector/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-collector/blob/main/CHANGELOG-API.md)
- [Commits](open-telemetry/opentelemetry-collector@v0.142.0...v0.143.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/collector/cmd/builder
  dependency-version: 0.143.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
- When opamp is enabled. opamp-supervisor job will start. opamp-supervisor will manage lifecycle of otel-collector
- When opamp is disabled only otel-collector will start
- Supervisor should be configured with opamp server webscoket or https url.
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 15, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: agnello-noronha / name: Agnello Noronha (70a0524)
  • ✅ login: agnello-noronha / name: agnello-noronha (7ff8f86)

default: []
opamp.enabled:
description: "Enable OpAMP extension in the collector. When true, collector includes OpAMP extension. Supervisor is managed by separate job."
default: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also re-think and enable the collector's OpAMP extension by default as it simply sends status data to to the Supervisor and the supervisor sends it to the OpAMP Server.

secrets:
description: "Variables to interpolate into the configuration"
default: []
opamp.enabled:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above


pushd "${release_dir}/src/otel-collector"
go get toolchain@none
go mod tidy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Aren't all the dependencies managed by the otel collector builder?


set -e

echo "🧪 Running OpAMP Acceptance Tests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The emojis look cool, but I doubt that they will be shown in every terminal

@@ -0,0 +1,98 @@
package main
Copy link
Contributor

@chombium chombium Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this supervisor builder looks cool.

@chombium
Copy link
Contributor

Hi @agnello-noronha, I went through the PR in detail and it looks pretty good so far. There is one thing that bothers me and that is we don't have a concept on how are we going to manage the Otel Collectors. Having OpAMP Server + a Supervisor (with I guess blue-green strategy to deploy/update the collectors) is the right way to go. There are still many open questions, though. Where and how will the OpAMP server run (the scheduler VM I imagine with multiple instances, at least one per availability zone for ha), how are we going to manage the configuration in the OpAMP server, where will be the configuration stored.

I suggest that we pause the work on this PR until we have documented our decisions and we have a clear direction what we want to do in a form of a rfc.

@agnello-noronha
Copy link
Member Author

Hi @agnello-noronha, I went through the PR in detail and it looks pretty good so far. There is one thing that bothers me and that is we don't have a concept on how are we going to manage the Otel Collectors. Having OpAMP Server + a Supervisor (with I guess blue-green strategy to deploy/update the collectors) is the right way to go. There are still many open questions, though. Where and how will the OpAMP server run (the scheduler VM I imagine with multiple instances, at least one per availability zone for ha), how are we going to manage the configuration in the OpAMP server, where will be the configuration stored.

I suggest that we pause the work on this PR until we have documented our decisions and we have a clear direction what we want to do in a form of a rfc.

@chombium We should treat opamp-server as an external system like splunk for logs and dynatrace for observability. We should allow customers to configure the endpoint of opamp-server and enable them to use this feature.

We do not have any open source opamp-server available which is production grade and whoever is going to use this feature should have their own implementation. One of commercially available opamp-server is https://github.com/yotamloe/bindplane-op. In TAS we are planning to deploy opamp-server in ops-manager VM and manage all collectors.

We can think of opamp capability as an additional feature provided for cloudfoundry users who are willing to use and manage their collector outside of cloudfoundry. And once we use external source(opamp-server) the source of truth will be opamp-server for all otel configs and not the bosh manifest.

In case of bosh redeployment we can always control how effective config will be merged by using config_files parameter in opamp-server(https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/cmd/opampsupervisor/specification/README.md).

Let me know your thoughts.

@chombium
Copy link
Contributor

@agnello-noronha I agree with most of what you've written and generally what we've discussed so far and I must say this is why we must write a RFC for this. We have to document our problems, decisions and proposed solutions.

There are two general problems that we are dealing with:

  1. how are we going to manage the configuration of the Otel Collectors:
  • bosh deploy manifest and bosh redeployment (that's what we have now)
  • OpAMP managed configuration
    • Local, CF deployed OpAMP server, we don't have one at the moment. If we want to have one, we have to build one with is production ready with HA and all other things that will provide reliable running

    • Remote OpAMP server (the things that you are suggesting)

    • A general problem here is how will the updated Otel Collector configuration be sent to the OpAMP server and where will it be stored

      • For a local OpAMP server we could either:
        • use the bosh way, updated job/bosh release configuration and redeployment. If the OpAMP server runs on the Scheduler VMs, this would updating of the Scheduler VMs and the Supervisors will read the
        • build another API to manage Otel Collector configuration updates in the OpAMP server
    • A remote OpAMP server would have to provide some API for Otel Collector configuration updates (the new config which should be pushed to the Otel Collectors)

  1. How are we going to make sure that we don't have telemetry data drops during the (re-)deployment and re-configuraton?
  • Classical bosh redeploy (that's what we have now). Bosh will drain the VM, Diego will move all of the apps to running cells and the VM will be updated
  • In case of re-configuration without VM restart we need an OpAMP serverm Supervisor which we restart the collector in blue-green mode and Otel Collector with buffers and probably disk storage extensions configured.

We can think of opamp capability as an additional feature provided for cloudfoundry users who are willing to use and manage their collector outside of cloudfoundry. And once we use external source(opamp-server) the source of truth will be opamp-server for all otel configs and not the bosh manifest.

IMO, if we do it this way, most of the people won't even use OpAMP as they will have to buy an OpAMP server.

There are still to many open questions which we have to clear.

I find it good that now we agree on a clear direction on how we want to manage the configuration either with BOSH or OpAMP. Nevertheless, we have to document how and what are the trade-offs of both approaches as RFC and in cf-docs.

@agnello-noronha
Copy link
Member Author

@chombium Thanks for clarification. I will look at defining RFC and take it forward. I presume @weili-broadcom had a discussion with you and agreed for only opamp extension as part of otel-collector. I have raised a CL for same #185

@chombium
Copy link
Contributor

Thanks for the understanding @agnello-noronha. Yes, we've spoken with @weili-broadcom how do we proceed with this PR and we decided to work on a RFC together. I'll organize something to get a google doc in the community GCP account, so that we can work transparently on the RFC.

I wil leave this PR open for now as there is some great work and discussions been done here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants