Skip to content

Conversation

@adri1197
Copy link
Contributor

@adri1197 adri1197 commented Jul 23, 2025

Implementation of: fluxcd/flux2#5321

Closes: #1097

Use Cases

 apiVersion: notification.toolkit.fluxcd.io/v1beta3
 kind: Provider
 metadata:
   name: jaeger
   namespace: flux-system
 spec:
   type: otel
   address: http://jaeger-collector.jaeger:4318/v1/traces
  • GitRepository -> Kustomization -> ConfigMap

     apiVersion: notification.toolkit.fluxcd.io/v1beta3
     kind: Alert
     metadata:
       name: alert-test-1
       namespace: flux-system
     spec:
       # summary: "Flagger impacted in us-east-2"
       providerRef:
         name: jaeger
       # eventSeverity: error
       eventSources:
         - kind: GitRepository
           name: '*'
           namespace: test-1
         - kind: Kustomization
           name: '*'
           namespace: test-1
       eventMetadata:
         env: staging
         cluster: cluster-1
         region: eu-west-1
    
     apiVersion: v1
     kind: Secret
     metadata:
       name: test-repo-secret
       namespace: test-1
     data:
       username: Z29ncw==
       password: Z29ncw==
     ---
     apiVersion: source.toolkit.fluxcd.io/v1
     kind: GitRepository
     metadata:
       name: test-repo
       namespace: test-1
     spec:
       interval: 5m0s
       url: http://gogs-svc.gogs:18080/gogs/test-1
       ref:
         branch: master
       secretRef:
         name: test-repo-secret
     ---
     apiVersion: kustomize.toolkit.fluxcd.io/v1
     kind: Kustomization
     metadata:
       name: podinfo
       namespace: test-1
       annotations:
         event.toolkit.fluxcd.io/summary: "Reconcile ConfigMaps: cm-1 & cm-2"
     spec:
       interval: 10m
       sourceRef:
         kind: GitRepository
         name: test-repo
       path: "."
       prune: true
       timeout: 1m
    
    image
  • OCIRepository > HelmRelease

     apiVersion: notification.toolkit.fluxcd.io/v1beta3
     kind: Alert
     metadata:
       name: alert-test-2
       namespace: flux-system
     spec:
       providerRef:
         name: jaeger
       eventSources:
         - kind: OCIRepository
           name: '*'
           namespace: test-2
         - kind: HelmRelease
           name: '*'
           namespace: test-2
       eventMetadata:
         env: staging
         cluster: cluster-1
         region: us-east-1
    
     apiVersion: source.toolkit.fluxcd.io/v1
     kind: OCIRepository
     metadata:
       name: podinfo-repo
       namespace: test-2
     spec:
       interval: 5m0s
       url: oci://ghcr.io/stefanprodan/charts/podinfo
       ref:
         semver: ">= 6.0.0"
     ---
     apiVersion: helm.toolkit.fluxcd.io/v2
     kind: HelmRelease
     metadata:
       name: podinfo-1
       namespace: test-2
       annotations:
         event.toolkit.fluxcd.io/summary: "Test-2: staging env, cluster-1 & us-east-1"
         event.toolkit.fluxcd.io/deploymentID: podinfo-1
     spec:
       interval: 10m
       chartRef:
         kind: OCIRepository
         name: podinfo-repo
       values:
         replicaCount: 2
     ---
     apiVersion: helm.toolkit.fluxcd.io/v2
     kind: HelmRelease
     metadata:
       name: podinfo-2
       namespace: test-2
       annotations:
         event.toolkit.fluxcd.io/summary: "Test-2: staging env, cluster-1 & us-east-1"
         event.toolkit.fluxcd.io/deploymentID: podinfo-2
     spec:
       interval: 10m
       chartRef:
         kind: OCIRepository
         name: podinfo-repo
       values:
         replicaCount: 2
     ---
     apiVersion: helm.toolkit.fluxcd.io/v2
     kind: HelmRelease
     metadata:
       name: podinfo-3
       namespace: test-2
       annotations:
         event.toolkit.fluxcd.io/summary: "Test-2: staging env, cluster-1 & us-east-1"
         event.toolkit.fluxcd.io/deploymentID: podinfo-3
     spec:
       interval: 10m
       chartRef:
         kind: OCIRepository
         name: podinfo-repo
       values:
         replicaCount: 2
    
    image However, looks like `OCIRepository` and `HelmRelease` collection is created as different spans due to having a different naming conventions to populate the revision: image image
    • OCIRepository
      - revision: 6.9.1@sha256:565d310746f1fa4be7f93ba7965bb393153a2d57a15cfe5befc909b790a73f8a
    • HelmRelease
      - revision: 6.9.1+565d310746f1
      - oci-digest: sha256:565d310746f1fa4be7f93ba7965bb393153a2d57a15cfe5befc909b790a73f8a
      - app-version: 6.9.1
  • HelmChart -> HelmReleases

    apiVersion: notification.toolkit.fluxcd.io/v1beta3
    kind: Alert
    metadata:
      name: alert-test-3
      namespace: flux-system
    spec:
      # summary: "Flagger impacted in us-east-2"
      providerRef:
        name: jaeger
      # eventSeverity: error
      eventSources:
        - kind: HelmChart
          name: '*'
          namespace: test-3
        - kind: HelmRelease
          name: '*'
          namespace: test-3
      eventMetadata:
        env: staging
        cluster: cluster-3
        region: us-west-1
    
    apiVersion: source.toolkit.fluxcd.io/v1
    kind: HelmRepository
    metadata:
      name: podinfo
      namespace: test-3
    spec:
      interval: 5m0s
      url: https://stefanprodan.github.io/podinfo
    ---
    apiVersion: source.toolkit.fluxcd.io/v1
    kind: HelmChart
    metadata:
      name: podinfo
      namespace: test-3
    spec:
      interval: 5m0s
      chart: podinfo
      reconcileStrategy: ChartVersion
      sourceRef:
        kind: HelmRepository
        name: podinfo
      version: '5.*'
    ---
    apiVersion: helm.toolkit.fluxcd.io/v2
    kind: HelmRelease
    metadata:
      name: podinfo-1
      namespace: test-3
    spec:
      interval: 10m
      timeout: 5m
      chartRef:
        kind: HelmChart
        name: podinfo
      install:
        remediation:
          retries: 3
      upgrade:
        remediation:
          retries: 3
      test:
        enable: true
    ---
    apiVersion: helm.toolkit.fluxcd.io/v2
    kind: HelmRelease
    metadata:
      name: podinfo-2
      namespace: test-3
    spec:
      interval: 10m
      timeout: 5m
      chartRef:
        kind: HelmChart
        name: podinfo
      install:
        remediation:
          retries: 3
      upgrade:
        remediation:
          retries: 3
      test:
        enable: true
    
    image

Part of: #1097
Part of: fluxcd/flux2#5321

@adri1197 adri1197 changed the title Otel [RFC-0011] - OpenTelemetry integration based on alerts Jul 23, 2025
@adri1197 adri1197 force-pushed the otel branch 2 times, most recently from 6e2dbbd to 7a793ec Compare July 24, 2025 08:14
@adri1197 adri1197 marked this pull request as ready for review July 24, 2025 08:17
@adri1197 adri1197 marked this pull request as draft July 24, 2025 13:20
@adri1197 adri1197 changed the title [RFC-0011] - OpenTelemetry integration based on alerts [RFC-0011] - OTEL integration based on alerts Jul 24, 2025
@adri1197 adri1197 force-pushed the otel branch 4 times, most recently from 261f89d to 95a19f7 Compare July 30, 2025 08:51
@stefanprodan stefanprodan added enhancement New feature or request area/alerting Alerting related issues and PRs experimental Issues and pull requests related to experimental features labels Jul 30, 2025
Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

@adri1197 adri1197 force-pushed the otel branch 2 times, most recently from 2721ea2 to 4d10e0d Compare August 11, 2025 18:08
@adri1197
Copy link
Contributor Author

adri1197 commented Aug 18, 2025

All the uses cases with their respective screenshots were updated in the comment above.

Most of them went as expected. However, about OCIRepository -> HelmRelease, there is a different convention for revision which creates two separate traces. This could be addressed via coding, but I'd like to know your view on this, prior to start working on it.

Additionally, if you see under the traces list, the naming is not really descriptive unknown_service:notification-controller, it could be replaced by setting the service name (OTEL terminology). What do you think the proper convention could be? I would suggest <alert-namespace>/<alert-name>.

On the other hand, by checking the attribute limits (the information populated under every single span), looks like it's configurable (Attribute Limits). If it exceeds the limit set, it truncates the content, therefore I think the event message could be populated, as well. At this moment, I removed it from the attributes, could be easily added at any point in time

@adri1197 adri1197 marked this pull request as ready for review August 19, 2025 08:17
@stealthybox
Copy link
Member

stealthybox commented Aug 21, 2025

Setting otel service name to <alert-namespace>/<alert-name> makes sense to me since the spans are ultimately sourced from the Alert and its UID.

We could also explore the option of sourcing the service name entirely from Alert.spec.eventMetadata or appending to in the style of: <alert-namespace>/<alert-name>/<alert.spec.eventMetadata.otel_service_name>? It feels bespoke but would be flexible 🤔

@adri1197
Copy link
Contributor Author

adri1197 commented Aug 27, 2025

Resuming the discussion we had some time ago, this implementation does not support parent-child relationship across spans.

Currently, the code forces a specific value for traceID (hash based on the convention: AlertUID:revision). In this way, all the spans are properly populated into a single trace. If any of the convention criteria changes (either UID or revision), it may imply a new traceID and therefore, a newly created trace (with all the subsequent spans underneath).

OTEL also allows to force spanID, however this one needs to be already present in the system (Jaeger or any other OTEL Collector), because that's the way to establish a parent-child relationship, refer to an existing spanID.

Coming back to our use case, if we would like to have deeper level in our spans, we should think a way to establish that relationship. For instance and following the first use case above, have a GitRepository as root span (parent) and all the Kustomizations underneath (child). To do so, we should find a way to populate over spanID that other objects can reuse. One possible solution that I came across could be by propagating annotations with such information (spanID).

Not necessarily needs to be tackled here in this PR, just to kick off the discussion a bit 😃.

Any thoughts on this? @stefanprodan @matheuscscp

Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add otel to the provider docs and explain what it does

@adri1197
Copy link
Contributor Author

adri1197 commented Sep 3, 2025

Please add otel to the provider docs and explain what it does

Sorry, I missed to add the documentation. Just added now 😄 .

Signed-off-by: Adrian Fernandez De La Torre <[email protected]>
@matheuscscp
Copy link
Member

Hey @adri1197, I have one last question for this PR:

The screenshots above are proving that after notification-controller has sent all the events of a particular revision to Jaegar, Jaegar can effectively do the correlation between them and display all the spans sent from each controller as part of a single trace, is that correct?

@adri1197
Copy link
Contributor Author

adri1197 commented Sep 5, 2025

Hey @adri1197, I have one last question for this PR:

The screenshots above are proving that after notification-controller has sent all the events of a particular revision to Jaegar, Jaegar can effectively do the correlation between them and display all the spans sent from each controller as part of a single trace, is that correct?

Hi @matheuscscp! 😃

Yes, based on the use cases discussed. Perhaps, the top screenshots does not depict the whole information properly. I would suggest you if you can take a look at others shared below. All the spans you see on each screenshot belong to a unique trace (and therefore, they have a shared traceID - rootSpan).

Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @adri1197 🥇

@stefanprodan stefanprodan merged commit a7cac5f into fluxcd:main Sep 5, 2025
5 checks passed
@adri1197 adri1197 deleted the otel branch September 5, 2025 21:55
@matheuscscp matheuscscp changed the title [RFC-0011] - OTEL integration based on alerts [RFC-0011] OTEL integration based on alerts Sep 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/alerting Alerting related issues and PRs enhancement New feature or request experimental Issues and pull requests related to experimental features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenTelemetry integration

4 participants