diff --git a/docs/latest/modules/en/nav.adoc b/docs/latest/modules/en/nav.adoc index 9eb0a757..700858a0 100644 --- a/docs/latest/modules/en/nav.adoc +++ b/docs/latest/modules/en/nav.adoc @@ -15,6 +15,7 @@ * Monitors and alerts ** xref:use/alerting/k8s-monitors.adoc[Monitors] ** xref:use/alerting/kubernetes-monitors.adoc[Out of the box monitors for Kubernetes] +*** xref:use/alerting/k8s-override-monitor-arguments.adoc[Override monitor arguments] ** xref:use/alerting/notifications/README.adoc[Notifications] *** xref:use/alerting/notifications/configure.adoc[Configure notifications] *** xref:use/alerting/notifications/channels/README.adoc[Notification channels] @@ -23,19 +24,22 @@ **** xref:use/alerting/notifications/channels/webhook.adoc[Webhook] **** xref:use/alerting/notifications/channels/opsgenie.adoc[Opsgenie] *** xref:use/alerting/notifications/troubleshooting.adoc[Troubleshooting] +ifndef::ss-ff-stackpacks2_enabled[] ** Customize *** xref:use/alerting/k8s-add-monitors-cli.adoc[Add a monitor using the CLI] *** xref:use/alerting/k8s-derived-state-monitors.adoc[Derived State monitors] *** xref:use/alerting/k8s-dynamic-threshold-monitors.adoc[Dynamic Threshold monitors] -*** xref:use/alerting/k8s-override-monitor-arguments.adoc[Override monitor arguments] *** xref:use/alerting/k8s-write-remediation-guide.adoc[Write a remediation guide] +endif::ss-ff-stackpacks2_enabled[] * Metrics ** xref:use/metrics/k8sTs-explore-metrics.adoc[Explore Metrics] ** xref:use/metrics/k8sTs-metric-reference.adoc[Metrics references] +ifndef::ss-ff-stackpacks2_enabled[] ** Custom charts *** xref:use/metrics/k8s-add-charts.adoc[Adding custom charts to components] *** xref:use/metrics/k8s-writing-promql-for-charts.adoc[Writing PromQL queries for representative charts] *** xref:use/metrics/k8sTs-metrics-troubleshooting.adoc[Troubleshooting custom charts] +endif::ss-ff-stackpacks2_enabled[] ** Advanced Metrics *** xref:use/metrics/k8s-stackstate-grafana-datasource.adoc[Grafana Datasource] *** xref:use/metrics/k8s-prometheus-remote-write.adoc[Prometheus remote_write] @@ -93,6 +97,18 @@ ifdef::ss-ff-stackpacks2_enabled[] *** xref:setup/custom-integrations/otelmappings/getting-started.adoc[Getting Started] *** xref:setup/custom-integrations/otelmappings/schemas-ref.adoc[Schema Reference] *** xref:setup/custom-integrations/otelmappings/troubleshooting.adoc[Troubleshooting] +** xref:setup/custom-integrations/metric-bindings/index.adoc[Metric Bindings] +*** xref:setup/custom-integrations/metric-bindings/writing-promql.adoc[Writing PromQL] +*** xref:setup/custom-integrations/metric-bindings/schemas-ref.adoc[Schema Reference] +*** xref:setup/custom-integrations/metric-bindings/troubleshooting.adoc[Troubleshooting] +*** xref:setup/custom-integrations/metric-bindings/cli.adoc[CLI Support] +** xref:setup/custom-integrations/monitors/index.adoc[Monitors] +*** xref:setup/custom-integrations/monitors/remediation-guide.adoc[Write a remediation guide] +*** xref:setup/custom-integrations/monitors/schemas-ref.adoc[Schema Reference] +*** xref:setup/custom-integrations/monitors/troubleshooting.adoc[Troubleshooting] +*** xref:setup/custom-integrations/monitors/cli.adoc[CLI Support] +*** xref:setup/custom-integrations/monitors/derived-state-monitors.adoc[Derived State monitors] +*** xref:setup/custom-integrations/monitors/dynamic-threshold-monitors.adoc[Dynamic Threshold monitors] endif::ss-ff-stackpacks2_enabled[] * Open Telemetry ** xref:setup/otel/overview.adoc[Overview] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/develop.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/develop.adoc index 6d88ee74..0a5d5be3 100644 --- a/docs/latest/modules/en/pages/setup/custom-integrations/develop.adoc +++ b/docs/latest/modules/en/pages/setup/custom-integrations/develop.adoc @@ -17,7 +17,7 @@ As a developer, you want to integrate your technology with {stackstate-product-n - Access to the xref:/setup/cli/cli-sts.adoc[StackPack CLI] - Feature flag enabled: - ++ [source,bash] ---- export STS_EXPERIMENTAL_STACKPACK=true @@ -26,10 +26,11 @@ export STS_EXPERIMENTAL_STACKPACK=true == How to Develop a Custom Integration (StackPack) -. *Scaffold a New StackPack* -+ +=== Scaffold a New StackPack + The `scaffold` subcommand creates a new StackPack project structure from configurable templates, streamlining the initial setup process for StackPack development. The command supports both local directory templates and GitHub-hosted templates. -Create a new StackPack project using the CLI: + +- Create a new StackPack project using the CLI: + [source,bash] ---- @@ -79,27 +80,26 @@ Next steps: ---- sts stackpack scaffold --name my-stackpack --template-local-dir ./templates --template-name webapp ---- -+ -. *Review and Customize* -+ +=== Review and Customize + - Review the generated files in the target directory. - Edit `stackpack.yaml` and other files to define your integration logic. * Integrating data. To map components and ingest metrics see xref:setup/custom-integrations/otelmappings/README.adoc[Adding Otel Telemetry Mappings] - * xref:use/alerting/k8s-add-monitors-cli.adoc[Adding Monitors] - * xref:use/metrics/k8s-add-charts.adoc[Adding Metric Bindings] + * xref:setup/custom-integrations/monitors/index.adoc[Adding Monitors] + * xref:setup/custom-integrations/metric-bindings/index.adoc[Adding Metric Bindings] [NOTE] ==== The default: https://github.com/StackVista/stackpack-templates[stackvista/stackpack-templates] template is a great starting point as it contains examples for most of the extension points such as Monitors and Metric Bindings. ==== -. *Test Your StackPack* -+ +=== Test Your StackPack + The `test` subcommand command that streamlines the stackpack development workflow by automating the package → upload → install/upgrade sequence with automatic version management for testing iterations. -Rapidly test your StackPack in a pre-production environment (requires a running Suse Observability instance): +- Rapidly test your StackPack in a pre-production environment (requires a running Suse Observability instance): + [source,bash] ---- @@ -139,11 +139,11 @@ ID | NAME | STATUS | VERSION | LAST UPDATED - The `test` subcommand packages, uploads, and installs your StackPack with a test version suffix. - Iterate on your StackPack, using the `test` command for rapid feedback, make changes and observe the install/upgrade process to be successfully executed. Review the ingested topology and telemetry in the Suse Observability UI. -. *Package Your finished StackPack version* -+ -The `package` subcommand that creates zip files from stackpack directories. This command packages all required stackpack files and directories into a properly named zip archive for distribution and deployment. -Package your StackPack into a distributable zip: +=== Package Your finished StackPack version + +The `package` subcommand creates a zip file from stackpack directories. This command packages all required stackpack files and directories into a properly named zip archive for distribution and deployment. +- Package your StackPack into a distributable zip: + [source,bash] ---- @@ -161,9 +161,11 @@ Zip file: /Users/viliakov/Workspace/src/github/stackstate-cli/my-stackpack/my-st sts stackpack package -d ./my-stackpack -f my-custom-archive.zip ---- +=== Upload your StackPack +The `upload` subcommand pushes the zip archive to a running {stackstate-product-name} instance. -. *Upload your stackpack to your instance* +- Upload the archive to {stackstate-product-name}: + [source,bash] ---- @@ -171,6 +173,8 @@ sts stackpack upload -f ./my-stackpack-0.0.1.zip ✅ ✓ Stackpack uploaded successfully! ---- -. xref:/stackpacks/about-stackpacks.adoc#_install_or_uninstall_a_stackpack[Install or Upgrade your StackPack] via the {stackstate-product-name} stackpacks UI. +=== Install the StackPack + +xref:/stackpacks/about-stackpacks.adoc#_install_or_uninstall_a_stackpack[Install or Upgrade your StackPack] via the {stackstate-product-name} stackpacks UI. endif::ss-ff-stackpacks2_enabled[] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/cli.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/cli.adoc new file mode 100644 index 00000000..f6ce85de --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/cli.adoc @@ -0,0 +1,119 @@ += Using the CLI for metric bindings +:revdate: 2026-01-14 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +You can use the {stackstate-product-name} CLI to inspect and modify metric bindings. These are handled like other settings using the `sts settings` command. + +== Inspecting Metric Bindings + +=== Listing Metric Bindings + +The `sts settings` command can list all metric bindings: + +[,bash] +---- +sts settings list --type MetricBinding +TYPE | ID | IDENTIFIER | NAME | OWNED BY | LAST UPDATED +MetricBinding | 190567588459765 | urn:stackpack:kube | .NET GC Allocated | urn:stackpack:kube | Sun Jan 11 01:28:2 + | | rnetes-v2:shared:m | | rnetes-v2:shared | 8 2026 CET + | | etric-binding:pod: | | | + | | dotnet-gc-allocate | | | + | | d | | | +MetricBinding | 247972504900226 | urn:stackpack:kube | .NET GC Allocated | urn:stackpack:kube | Sun Jan 11 01:28:2 + | | rnetes-v2:shared:m | | rnetes-v2:shared | 8 2026 CET + | | etric-binding:depl | | | + | | oyment:dotnet-gc-a | | | + | | llocated | | | +MetricBinding | 109239589408271 | urn:stackpack:open | .NET GC Allocated | urn:stackpack:open | Wed Jan 7 00:20:48 + | | -telemetry:shared: | | -telemetry:shared | 2026 CET + | | metric-binding:ser | | | + | | vice:dotnet-gc-all | | | + | | ocated | | | +... +---- + +=== Describing Metric Bindings + +You can get the definition of an existing metric binding by using the `describe` command: + +[,bash] +---- +sts settings describe --ids 190567588459765 +_version: 1.0.93 +nodes: +- _type: MetricBinding + chartType: line + description: Bytes allocated to GC Heap + enabled: true + id: -1 + identifier: urn:stackpack:kubernetes-v2:shared:metric-binding:pod:dotnet-gc-allocated + layout: + metricPerspective: + section: GC + tab: .NET + weight: 3 + name: .NET GC Allocated + priority: high + queries: + - alias: allocated + expression: rate(process_runtime_dotnet_gc_allocations_size_bytes_total{k8s_cluster_name="${tags.cluster-name}", k8s_namespace_name="${tags.namespace}", k8s_pod_name="${name}"}[${__rate_interval}]) + scope: (label = "stackpack:kubernetes" and type = "pod") + unit: bytes(IEC) +timestamp: 2026-01-14T13:11:07.575662922Z[Etc/UTC] +---- + +== Modifying Metric Bindings + +[NOTE] +==== +The recommended way of working is to store metric bindings (and any other custom resources created in {stackstate-product-name}) as YAML files in xref:/setup/custom-integrations/overview.adoc[a StackPack]. From there changes can be manually applied or it can be fully automated by using the SUSE Observability CLI in a CI/CD system like GitHub actions or GitLab pipelines. +==== + +=== Create/update a Metric Binding +Create a file `metric-bindings.yaml` that looks like this: + +[source,bash] +---- +nodes: +- _type: MetricBinding + chartType: line + enabled: true + tags: {} + unit: short + name: Replica counts + priority: MEDIUM + identifier: urn:stackpack:my-stackpack:metric-binding:my-deployment-replica-counts + queries: + - expression: max_over_time(kubernetes_state_deployment_replicas{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", deployment="${name}"}[${__interval}]) + alias: Total replicas + scope: type = "deployment" and label = "stackpack:kubernetes" +---- + +Use the xref:/setup/cli/cli-sts.adoc[SUSE Observability CLI] to create the metric binding: + +[,bash] +---- +sts settings apply -f metric-bindings.yaml +---- + +Verify the results in SUSE Observability by opening the metrics perspective for a deployment. If you're not happy with the result simply change the metric binding in the YAML file and run the command again to update it. The list of nodes supports adding many metric bindings. Simply add another metric binding entry to the YAML array using the same steps as before. + +[CAUTION] +==== +The identifier is used as the unique key of a metric binding. Changing the identifier will create a new metric binding instead of updating the existing one. +==== + +=== Delete a Metric Binding + +Finally to delete a metric binding use + +[,bash] +---- +sts settings delete --ids +---- + +The `` in this command isn't the identifier but the number in the `Id` column of the `sts settings list` output. + diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/index.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/index.adoc new file mode 100644 index 00000000..54a494c9 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/index.adoc @@ -0,0 +1,210 @@ +ifdef::ss-ff-stackpacks2_enabled[] += Add custom charts to components +:revdate: 2025-07-10 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +{stackstate-product-name} provides already many metric charts by default on most types of components that represent Kubernetes resources. Extra metric charts can be added to any set of components whenever needed. When adding metrics to components there are two options: + +. The metrics are already collected by {stackstate-product-name} but aren't visualized on a component, by default +. The metrics aren't yet collected by {stackstate-product-name} at all and therefore aren't available yet + +For option 1, the steps below will instruct you on how to create a metric binding which will configure {stackstate-product-name} to add a specific metric to a specific set of components. + +For option 2, ensure the metrics are available in {stackstate-product-name} by sending them to {stackstate-product-name} using the xref:/use/metrics/k8s-prometheus-remote-write.adoc[Prometheus remote write protocol]. Continue by adding charts for the metrics to the components ONLY after ensuring the metrics are available. + +== Creating a metric binding + +. <<_create_an_outline_of_the_metric_binding,Create an outline of the metric binding>> +. <<_select_the_components_to_bind_to,Select the components to bind to>> +. <<_write_the_promql_query,Write the PromQL query for the desired metric>> +. <<_bind_the_correct_time_series_to_each_component,Bind the correct time series to each component>> + +For example, the steps will add a metric binding for the `Replica counts` of Kubernetes deployments. This metric binding already exists in {stackstate-product-name}, by default. + +=== Create an outline of the metric binding + +Open the `metricbindings.yaml` YAML file in your favorite code editor to change it throughout this guide. You can use the CLI to xref:/setup/custom-integrations/develop.adoc#_test_your_stackpack[Test Your StackPack]. + +---- +- _type: MetricBinding + chartType: line + enabled: true + identifier: urn:stackpack:my-stackpack:shared:metric-binding:node-memory-bytes-available-scheduling + layout: + metricPerspective: + section: Resources + tab: Kubernetes Node + name: Memory available for scheduling (Custom) + priority: medium + queries: + - alias: ${cluster_name} - ${node} + expression: max_over_time(kubernetes_state_node_memory_allocatable{cluster_name="${tags.cluster-name}", node="${name}"}[${__interval}]) + scope: (label = "stackpack:kubernetes" and type = "node") + unit: bytes(IEC) +---- + +The queries and scope section will be filled in the next steps. Note that the unit used is `short`, which will simply render a numeric value. In case you're not sure yet about the unit of the metric, you can leave it open and decide on the correct unit when writing the PromQL query. + +=== Select the components to bind to + +Save a view of the xref:/use/views/k8s-topology-perspective.adoc[Topology perspective] and use the filters (Filters -> Topology -> Switch to STQL) to query the components that need to show the new metric. The most common fields to select topology on for metric bindings are `type` for the component type and `label` for selecting all the labels. For example for the deployments: + +---- +type = "deployment" and label = "stackpack:kubernetes" +---- + +The type filter selects all deployments, while the label filter selects only components created by the Kubernetes stackpack (label name is `stackpack` and label value is `kubernetes`). The latter can also be omitted to get the same result. All xref:/develop/reference/k8sTs-stql_reference.adoc#_component_filters[STQL query Component Filters] can be used for filtering. + +Switch to the advanced mode to copy the resulting topology query and put it in the `scope` field of the metric binding. + +[NOTE] +==== +Metric bindings only support the query filters. Query functions like `withNeighborsOf` are not supported and cannot be used. +==== + + +=== Write the PromQL query + +Go to the xref:/use/metrics/k8sTs-explore-metrics.adoc[metric explorer] of your {stackstate-product-name} instance, http://your-instance/#/metrics, and use it to query for the metric of interest. The explorer has auto-completion for metrics, labels, label values but also PromQL functions, and operators to help you out. Start with a short time range of, for example, an hour to get the best results. + +For the total number of replicas, use the `kubernetes_state_deployment_replicas` metric. To show the metrics charts of the time series data, extend the query to do an aggregation using the `+${__interval}+` parameter: + +---- +max_over_time(kubernetes_state_deployment_replicas[${__interval}]) +---- + +In this specific case, use `max_over_time` to make sure the chart always shows the highest number of replicas at any given time. For longer time ranges, a short dip in replicas are not shown. To emphasize the lowest number of replicas, use `min_over_time` instead. + +Copy the query into the `expression` property of the first entry in the `queries` field of the metric binding. Use `Total replicas` as an alias. for it to show up in the chart legend. + +[NOTE] +==== +In {stackstate-product-name}, the size of the metric chart automatically determines the granularity of the metric shown in the chart. PromQL queries can adjusted to make optimal use of this behavior to get a representative chart for the metric. xref:/setup/custom-integrations/metric-bindings/writing-promql.adoc[Writing PromQL for charts] explains this in detail. +==== + + +=== Bind the correct time series to each component + +The metric binding with all fields filled in: + +---- +_type: MetricBinding +chartType: line +enabled: true +tags: {} +unit: short +name: Replica counts +priority: MEDIUM +identifier: urn:stackpack:my-stackpack:metric-binding:my-deployment-replica-counts +queries: + - expression: max_over_time(kubernetes_state_deployment_replicas[${__interval}]) + alias: Total replicas +scope: type = "deployment" and label = "stackpack:kubernetes" +---- + +Creating it in {stackstate-product-name} and viewing the "Replica count" chart on a deployment component gives an unexpected result. The chart shows the replica counts for all deployments. Logically one would expect only one time series: the replica count for this specific deployment. + +image::k8s/k8s-replica-counts-without-binding.png[The incorrect chart for a single deployment, it shows the replica count for all deployments] + +To fix this make the PromQL query specific for a component using information from the component. Filter on enough metric labels to select only the specific time series for the component. This is the "binding" of the correct time series to the component. For anyone experienced in making Grafana dashboards this is similar to a dashboard with parameters that are used in queries on the dashboard. Let's change the query in the metric binding to this: + +---- +max_over_time(kubernetes_state_deployment_replicas{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", deployment="${name}"}[${__interval}]) +---- + +image::k8s/k8s-replica-counts-with-binding.png[After adding the parameterized filters the resulting chart looks as expected, only one time series for this component] + +The PromQL query now filters on three labels, `cluster_name`, `namespace` and `deployment`. Instead of specifying an actual value for these labels a variable reference to fields of the component is used. In this case the labels `cluster-name` and `namespace` are used, referenced using `${tags.cluster-name}` and `${tags.namespace}`. Further the component name is referenced with `+${name}+`. + +Supported variable references are: + +* Any component label, using `${tags.}` +* The component name, using `+${name}+` + +image::k8s/k8s-carts-highlights.png[Component Highlights page that shows the labels and component name (both highlighted in red)] + +[NOTE] +==== +The cluster name, namespace and a combination of the component type and name are ususally enough for selecting the metrics for a specific component from Kubernetes. These labels, or similar labels, are usually available on most metrics and components. +==== + + +== Advanced + +=== More than one time series in a chart + +[NOTE] +==== +There is only one unit for a metric binding (it gets plotted on the y-axis of the chart). As a result you should only combine queries that produce time series with the same unit in one metric binding. Sometimes it might be possible to convert the unit. For example, CPU usage might be reported in milli-cores or cores, milli-cores can be converted to cores by multiplying by 1000 like this `() * 1000`. +==== + + +There are two ways to get more than one time series in a single metric binding and therefore in a single chart: + +. Write a PromQL query that returns multiple time series for a single component +. Add more PromQL queries to the metric binding + +For the first option an example is given in the xref:/setup/custom-integrations/metric-bindings/index.adoc#_using_metric_labels_in_aliases[next section]. The second option can be useful for comparing related metrics. Some typical use-cases: + +* Comparing total replicas vs desired and available +* Resource usage: limits, requests and usage in a single chart + +To add more queries to a metric binding simply repeat xref:/setup/custom-integrations/metric-bindings/index.adoc#_steps[steps] 3. and 4. and add the query as an extra entry in the list of queries. For the deployment replica counts there are several related metrics that can be included in the same chart: + +---- +- _type: MetricBinding + chartType: line + enabled: true + tags: {} + unit: short + name: Replica counts + priority: MEDIUM + identifier: urn:stackpack:my-stackpack:metric-binding:my-deployment-replica-counts + queries: + - expression: max_over_time(kubernetes_state_deployment_replicas{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", deployment="${name}"}[${__interval}]) + alias: Total replicas + - expression: max_over_time(kubernetes_state_deployment_replicas_available{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", deployment="${name}"}[${__interval}]) + alias: Available - ${cluster_name} - ${namespace} - ${deployment} + - expression: max_over_time(kubernetes_state_deployment_replicas_unavailable{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", deployment="${name}"}[${__interval}]) + alias: Unavailable - ${cluster_name} - ${namespace} - ${deployment} + - expression: min_over_time(kubernetes_state_deployment_replicas_desired{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", deployment="${name}"}[${__interval}]) + alias: Desired - ${cluster_name} - ${namespace} - ${deployment} + scope: type = "deployment" and label = "stackpack:kubernetes" +---- + +image::k8s/k8s-replica-counts-multiple-timeseries.png[Metric binding with multiple metrics] + +=== Using metric labels in aliases + +When a single query returns multiple time series per component, this will show as multiple lines in the chart. But in the legend they will all use the same alias. To be able to see the difference between the different time series the alias can include references to the metric labels using the `+${label}+` syntax. For example here is a metric binding for the "Container restarts" metric on a pod, note that a pod can have multiple containers: + +---- +type: MetricBinding +chartType: line +enabled: true +id: -1 +identifier: urn:stackpack:my-stackpack:metric-binding:my-pod-restart-count +name: Container restarts +priority: MEDIUM +queries: +- alias: Restarts - ${container} + expression: max by (cluster_name, namespace, pod_name, container) (kubernetes_state_container_restarts{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}) +scope: (label = "stackpack:kubernetes" and type = "pod") +unit: short +---- + +Note that the `alias` references the `container` label of the metric. Make sure the label is present on the query result, when the label is missing the `+${container}+` will be rendered as literal text to help troubleshooting. + +=== Layouts + +Each component can be associated with various technologies or protocols such as k8s, networking, runtime environments (e.g., JVM), protocols (HTTP, AMQP), etc. +Consequently, a multitude of different metrics can be displayed for each component. For easier readability, {stackstate-product-name} can organize these charts into tabs and sections. +To display a chart (`MetricBinding`) within a specific tab or section, you need to configure the layout property. +Any MetricsBinding without a specified layout will be displayed in a tab and section named `Other`. + +Here is an example configuration: + +endif::ss-ff-stackpacks2_enabled[] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/schemas-ref.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/schemas-ref.adoc new file mode 100644 index 00000000..e27cf482 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/schemas-ref.adoc @@ -0,0 +1,79 @@ +ifdef::ss-ff-stackpacks2_enabled[] += Schema and reference for Metric Bindings +:revdate: 2026-01-09 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +This page describes the schemas for a `MetricBinding`, along with detailed explanations of constructs, expression syntax and semantics. + +== Schema for Metric Binding + +Each metric binding: + +* Selects components where it can be applied +* Defines PromQL query templates for retrieving data +* Specifies how the resulting data must be rendered +* Includes layout hints for selecting an appropriate place in the UI to show the chart + +[,yaml] +---- +_type: "MetricBinding" +name: string +chartType: "line" # "line" is the only type for now +unit?: string +scope: string # Topology scope - components to bind to +enabled: boolean # default: true +description?: string +valuation?: "higher-is-better" | "lower-is-better" +priority?: "HIGH" | "MEDIUM" | "LOW" | "NONE" # deprecated +queries: + - _type: "MetricBindingQuery" + expression: string # promql query + alias: string # name in legend + componentIdentifierTemplate?: string # URN template for linking + primary?: boolean # is this query the primary one +tags: + : +layout?: # where should chart be shown + metricPerspective?: # the metrics perspective for a component + tab: string + section: string + weight?: integer + componentHighlight?: # highlight perspective of a component + section: string + weight?: integer + componentSummary?: # summary - supporting panel on the right + weight?: integer +identifier?: string +---- + +* `_type`: {stackstate-product-name} needs to know this is a metric binding so, value always needs to be `MetricBinding` +* `name`: The name for the metric binding +* `chartType`: {stackstate-product-name} will support different chart types (`line`, `bar`, etc.), currently only `line` is supported +* `unit`: The unit of the values in the time series returned by the query or queries, used to render the Y-axis of the chart. See the xref:/develop/reference/k8sTs-chart-units.adoc[supported units] reference for all units +* `scope`: The topology scope of the metric binding, a topology query that selects the components on which this metric binding will be shown +* `enabled`: Set to `false` to keep the metric binding but not show it to users +* `description`: Optional description, displayed on-hover of the name +* `valuation`: Whether higher or lower values are "better" +* `priority`: [Deprecated] One of `HIGH`, `MEDIUM`, or `LOW`. Main sort order for metrics on a component (in the order they're mentioned here), secondary sort order is the `name`. +* `queries`: A list of queries to show in the chart for the metric binding (see xref:/setup/custom-integrations/metric-bindings/writing-promql.adoc[Writing PromQL queries]) +** `expression`: The (templated) promql query +** `alias`: Name for the query in the legend +** `componentIdentifierTemplate`: Template for identifier of related component, populated with labels of timeseries resulting from the query +** `primary`: Is this query the primary one +* `tags`: Will be used to organize metrics in the user interface, can be left empty using `{}` +* `layout`: How to groups charts on different perspective views, e.g. on xref:/use/views/k8s-metrics-perspective.adoc[Metrics perspective] +** `metricPerspective` - Defines metrics to display on `Metrics Perspective`. Metrics are grouped into tabs and then sections. +*** `tab` - Tab name. Tabs are sorted alphabetically +*** `section` - Section name. Sections are sorted alphabetically +*** `weight` - Metrics within a section are sorted primarily by weight (ascending) and secondarily by name (alphabetical) +** `componentHighlight` - Defines metrics to display on `Component Highlight`. Metrics are grouped in sections. +*** `section` - Section name. Sections are sorted alphabetically +*** `weight` - Metrics within a section are sorted primarily by weight (ascending) and secondarily by name (alphabetical) +** `componentSummary` - Specifies metrics to display in the `Components details` sidebar upon component selection. Charts appear only when this property is defined. +*** `weight` - This represents the weight of the chart. Charts are sorted in ascending order by weight and then displays first three charts. +* `identifier`: A URN (universal resource identifier), used as the unique identifier of the metric binding. It must start with `urn:stackpack::metric-binding:`, the remainder is free-format as long as it's unique amongst all metric bindings. + +endif::ss-ff-stackpacks2_enabled[] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/troubleshooting.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/troubleshooting.adoc new file mode 100644 index 00000000..686eff23 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/troubleshooting.adoc @@ -0,0 +1,45 @@ += Troubleshooting custom metric charts +:revdate: 2025-07-10 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +* <<_the_metric_chart_doesnt_show_on_the_highlights_page_of_a_component,The metric chart doesn't show on the Highlights page of a component>> +* <<_the_metric_chart_doesnt_show_on_the_metrics_perspective_of_a_component,The metric chart doesn't show on the metrics perspective of a component>> +* <<_the_metric_chart_on_a_component_remains_empty_no_data,The metric chart on a component remains empty ("no data")>> + +== The metric chart doesn't show on the Highlights page of a component + +At the moment it is not possible to customize the metric charts that are shown on a components Highlights page. The charts for custom metric bindings will be shown in the Metrics perspective only. + +== The metric chart doesn't show on the metrics perspective of a component + +The `scope` query on a metric binding is used to determine whether a component shows the metric binding. If a component doesn't show a metric binding check that the topology query in the scope matches the component. + +First check that the component indeed has the expected labels and/or component type on the component highlights page, name and type are at the top, the labels are in the "About" section. Make sure there are no spelling mistakes in label names or values. + +Check that the scope query has the correct syntax: + +. Open the explore view, via Views in the menu and the blue "Explore" button on the right. Or directly via the URL: `https:///#/views/explore` +. Open the filters and select `switch to STQL` +. Now copy/paste the query from the scope into the STQL field and run the query + +The overview now shows all components that match the query and that will get the chart. + +== The metric chart on a component remains empty ("no data") + +For the metric chart that has no data while data was expected open the inspector (the icon on the top-right corner of the chart). Toggle the "Show query" button to show the queries. + +Make sure the query doesn't contain any of the parameters anymore (i.e. all values like `${tags.cluster-name}` or `+${name}+`) have been replaced with the values for the component. If some parameters were left behind in the query the labels were not available on this component. So cross-check the names used (in this example `cluster-name`) against the labels available on the component. Also make sure there are no typos in the names. + +If all parameters are filled in there may be an issue with the PromQL query. To investigate that copy the PromQL query and open the Metrics explorer (via the main menu of {stackstate-product-name}). Paste the query into the metric explorer and run it. This should still give an empty result. + +Either the metric doesn't exist, it doesn't have one of the labels or the label does exist but there are no time series matching the value. The fastest method to resolve this is to rewrite the query to only its metric name and run that, if there are results the metric exists (so no typos). The table result can also be used to verify that all the labels that are used exist. Make sure there are no typos here either. + +If there are results, but just not for a specific value of a label (for example for the `pod_name` label) the query is ok but there is no time series for this specific metric for this specific component. Things to check in this case: + +* Is the data collected for this component (either via the {stackstate-product-name} agent or some other means)? +* Is the component even reporting the metric? + +How to do this depends on how data collection is configured. diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/writing-promql.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/writing-promql.adoc new file mode 100644 index 00000000..20769c56 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/metric-bindings/writing-promql.adoc @@ -0,0 +1,107 @@ += Writing PromQL queries for representative charts +:revdate: 2025-07-10 +:page-revdate: {revdate} +:description: SUSE Observability + +== Guidelines + +When {stackstate-product-name} shows data in a chart it almost always needs to change the resolution of the stored data to make it fit into the available space for the chart. To get the most representative charts possible follow these guidelines: + +* Don't query for the raw metric but always aggregate over time (using `*_over_time` or `rate` functions). +* Use the `+${__interval}+` parameter as the range for aggregations over time, it will automatically adjust with the resolution of the chart +* Use the `+${__rate_interval}+` parameter as the range for `rate` aggregations, it will also automatically adjust with the resolution of the chart but takes into account specific behaviors of `rate`. +* Project metrics to just the labels used by aggregating over different time series. + +Applying an aggregation often means that a trade-off is made to emphasize certain patterns in metrics more than others. For example, for large time windows `max_over_time` will show all peaks, but it won't show all troughs. While `min_over_time` does the exact opposite and `avg_over_time` will smooth out both peaks and troughs. To show this behavior, here is an example metric binding using the CPU usage of pods. To try it yourself, copy it to a YAML file and use the xref:/setup/custom-integrations/metric-bindings/index.adoc#_create_or_update_the_metric_binding_in_stackstate[CLI to apply it] in your own {stackstate-product-name} (you can remove it later). + +Projecting potentially multiple time-series to a subset of their labels folds time-series together that differ in an irrelevant detail. When creating a metric binding, only the labels that are used in the legend are relevant. Similarly, when creating monitors, only those labels that are needed for mapping to a (monitor status on a) component should be returned by the query. + +---- +- _type: MetricBinding + chartType: line + enabled: true + tags: {} + unit: short + name: CPU Usage (different aggregations and intervals) + priority: HIGH + identifier: urn:stackpack:my-stackpack:metric-binding:pod-cpu-usage-a + queries: + - expression: sum(max_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000 + alias: max_over_time dynamic interval + - expression: sum(min_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000 + alias: min_over_time dynamic interval + - expression: sum(avg_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000 + alias: avg_over_time dynamic interval + - expression: sum(last_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[${__interval}])) by (cluster_name, namespace, pod_name) /1000000000 + alias: last_over_time dynamic interval + - expression: sum(max_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000 + alias: max_over_time 1m interval + - expression: sum(min_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000 + alias: min_over_time 1m interval + - expression: sum(avg_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000 + alias: avg_over_time 1m interval + - expression: sum(last_over_time(container_cpu_usage{cluster_name="${tags.cluster-name}", namespace="${tags.namespace}", pod_name="${name}"}[1m])) by (cluster_name, namespace, pod_name) /1000000000 + alias: last_over_time 1m interval + scope: (label = "stackpack:kubernetes" and type = "pod") +---- + +After applying it, open the metrics perspective for a pod in {stackstate-product-name} (preferably a pod with some spikes and troughs in CPU usage). Enlarge the chart using the icon in its top-right corner to get a better view. Now you can also change the time window to see what the effects are from the different aggregations (30 minutes vs 24 hours for example). + +[CAUTION] +==== +When the metric binding doesn't specify an aggregation {stackstate-product-name} will automatically use the `last_over_time` aggregation to reduce the number of data points for a chart. See also xref:/setup/custom-integrations/metric-bindings/writing-promql.adoc#_why_is_aggregation_necessary[Why is aggregation necessary?] for an explanation. +==== + + +image::k8s/metric-aggregation-differences-30m.png[The chart for this metric binding for the last 30m, there are only a few lines in the chart visible because most time series are on top of each other] +image::k8s/metric-aggregation-differences-24h.png[The same chart, same component and same end time, but now for the last 24h. It shows, sometimes completely, different results for the different aggregations] + +== Why is aggregation necessary? + +First of all, why should you use an aggregation? It doesn't make sense to retrieve more data points from the metric store than fit in the chart. Therefore {stackstate-product-name} automatically determines the step needed between two data points to get a good result. For short time windows (for example a chart showing only one hour of data) this results in a small step (around 10 seconds). Metrics are often only collected every 30 seconds, so for 10 second steps the same value will repeat for three steps before changing to the next value. Zooming out to a one week time window, will require a much bigger step (around one hour, depending on the exact size of the chart on screen). + +When the steps become larger than the resolution of the collected data points, a decision needs to be made on how to summarize the data points of the one hour time range into a single value. When an aggregation over time is already specified in the query, it will be used to do that. However, if no aggregation is specified, or when the aggregation interval is smaller than the step, the `last_over_time` aggregation is used, with the `step` size as the interval. The result is that only the last data point for each hour is used to "summarize" the all data points in that hour. + +To summarize, when executing a PromQL query for a time range of one week with a step of one hour this query: + +---- +container_cpu_usage /1000000000 +---- + +is automatically converted to: + +---- +last_over_time(container_cpu_usage[1h]) /1000000000 +---- + +Try it for yourself on the https://observability.suse.com/#/metrics?alias=Pod%20%24%7Bpod_name%7D&promql=last_over_time%28container_cpu_usage%7Bnamespace%3D%22sock-shop%22%2Cpod_name%3D~%22carts.%2A%22%7D%5B1h%5D%29%20%2F1000000000&timeRange=LAST_7_DAYS[{stackstate-product-name} playground]. + +image::k8s/k8s-metric-queries-for-chart-last-over-time.png[Last over time] +image::k8s/k8s-metric-queries-for-chart-max-over-time-fixed-range.png[Max over time with fixed range] +image::k8s/k8s-metric-queries-for-chart-max-over-time-interval.png[Max over time with automatic range] + +Often this behavior isn't intended and it's better to decide for yourself what kind of aggregation is needed. Using different aggregation functions it's possible to emphasize certain behavior (at the cost of hiding other behavior). Is it more important to see peaks, troughs, a smooth chart etc.? Then use the `+${__interval}+` parameter for the range as it's automatically replaced with the `step` size used for the query. The result is that all the data points in the step are used. + +image::k8s/k8s-metric-queries-small-range.png[A fixed range, shorter than the data resolution] +image::k8s/k8s-metric-queries-interval-for-range.png[Automatic range, based on step but with a lower limit] + +The `+${__interval}+` parameter prevents another issue. When the `step` size and therefore the `+${__interval}+` value, would shrink to a smaller size than the resolution of the stored metric data this would result in gaps in the chart. + +Therefore `+${__interval}+` will never shrink smaller than the 2* the default scrape interval (default scrape interval is 30 seconds) of the {stackstate-product-name} agent. + +Finally the `rate()` function requires at least two data points to be in the interval to calculate a rate at all. With less than two data points the rate won't have a value. Therefore `+${__rate_interval}+` is guaranteed to always be at least 4 * the scrape interval. This guarantees no unexpected gaps or other strange behavior in rate charts, unless data is missing. + +There are some excellent blog posts on the internet that explain this in more detail: + +* https://www.robustperception.io/step-and-query_range/[Step and query range] +* https://www.robustperception.io/what-range-should-i-use-with-rate/[What range should I use with rate()?] +* https://grafana.com/blog/2020/09/28/new-in-grafana-7.2-%5F%5Frate_interval-for-prometheus-rate-queries-that-just-work/[Introduction of __rate_interval in Grafana] + +== See also + +Some more resources on understanding PromQL queries: + +* https://promlabs.com/blog/2020/06/18/the-anatomy-of-a-promql-query/[Anatomy of a PromQL Query] +* https://promlabs.com/blog/2020/07/02/selecting-data-in-promql/[Selecting Data in PromQL] +* https://iximiuz.com/en/posts/prometheus-vector-matching/[How to join multiple metrics] +* https://iximiuz.com/en/posts/prometheus-functions-agg-over-time/[Aggregation over time] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/monitors/cli.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/cli.adoc new file mode 100644 index 00000000..053b1187 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/cli.adoc @@ -0,0 +1,201 @@ += Using the CLI for monitors +:revdate: 2026-01-14 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +You can use the {stackstate-product-name} CLI to inspect, run and modify monitors: + +[source,bash] +---- +sts monitor +Manage, test and develop monitors. + +Usage: + sts monitor [command] + +Available Commands: + apply Create or edit a monitor from STY + clone Clone a monitor + delete Delete a monitor + describe Describe a monitor in STY format + disable Disable a monitor + edit Edit a monitor + enable Enable a monitor + list List all monitors + run Run a monitor + status Get the status of a monitor + +Use "sts monitor [command] --help" for more information about a command. +---- + +== Debugging Monitors + +You can use various {stackstate-product-name} CLI commands to debug a monitor. + +=== Listing Monitors + +The `sts monitor` command can list all monitors: + +[,bash] +---- +sts monitor list +ID | STATUS | IDENTIFIER | NAME | FUNCTION ID | TAGS +68172545000282 | ENABLED | urn:stackpack:aad-v2:shared:monitor:aad-http- | AAD: HTTP 4xx error rate (req/s) | 208329585629344 | [] + | | 4xx-error-rate | | | +129505918833814 | DISABLED | urn:stackpack:aad-v2:shared:monitor:aad-http- | AAD: HTTP 4xx response time (s) (95th percent | 208329585629344 | [] + | | 4xx-response-time-95th-percentile | ile) | | +214616421668585 | ENABLED | urn:stackpack:aad-v2:shared:monitor:aad-http- | AAD: HTTP 5xx error rate (req/s) | 208329585629344 | [] + | | 5xx-error-rate | | | +---- + +=== Describing Monitors + +You can get the definition of an existing monitor by using the `describe` command: + +[,bash] +---- +sts monitor describe --id 68172545000282 +_version: 1.0.93 +nodes: +- _type: Monitor + arguments: + telemetryQuery: sum(rate(podparty_http_requests_count{code='4xx', direction='incoming', intra_pod!='true', local_pod_ns='${tags.namespace}', __multi__="${properties.local_pod_metric_selector__+}"}[${__rate_interval}])) + topologyQuery: (label = "stackpack:kubernetes" and type = "service") + description: Consumes health states from the AAD. + function: urn:stackpack:aad-v2:shared:monitor-function:aad + id: -6 + identifier: urn:stackpack:aad-v2:shared:monitor:aad-http-4xx-error-rate + intervalSeconds: 60 + name: 'AAD: HTTP 4xx error rate (req/s)' + remediationHint: It's complicated. + status: ENABLED + tags: [] +timestamp: 2026-01-14T11:30:40.470998684Z[Etc/UTC] +---- + +=== Running Monitors + +Running a monitor gives you insight in the checkstates that are produced: + +[,bash] +---- +sts monitor run --id +appliedLimit: 100 +checkStates: + CheckStates: + checkStates: + - checkStateId: 91116839897294-preprod%1dev.preprod.stackstate.io-stackstate%1nightly-suse%1observability%1clickhouse%1shard0 + data: '{"displayTimeSeries":[{"name":"Metric and threshold","queries":[{"query":"sum by (cluster_name, namespace, statefulset) (increase(stackstate_clickhouse_backup_successful_uploads{kube_app_name=\"clickhouse\", cluster_name=\"preprod-dev.preprod.stackstate.io\", namespace=\"stackstate-nightly\", statefulset=\"suse-observability-clickhouse-shard0\"}[12h])) or sum by (cluster_name, namespace, statefulset) (stackstate_clickhouse_backup_number_backups_remote_expected{kube_app_name=\"clickhouse\", cluster_name=\"preprod-dev.preprod.stackstate.io\", namespace=\"stackstate-nightly\", statefulset=\"suse-observability-clickhouse-shard0\"}) * 0","alias":"Number of backups execution in 12h window","_type":"MonitorDisplayQuery"},{"query":"0.0","alias":"Threshold","_type":"MonitorDisplayQuery"}],"unit":"short","_type":"MonitorDisplayTimeSeries"}],"remediationHintTemplateData":{"componentUrnForUrl":"urn:kubernetes:%2Fpreprod-dev.preprod.stackstate.io:stackstate-nightly:statefulset%2Fsuse-observability-clickhouse-shard0","labels":{"cluster_name":"preprod-dev.preprod.stackstate.io","namespace":"stackstate-nightly","statefulset":"suse-observability-clickhouse-shard0"},"threshold":0.0},"_type":"MonitorSyncedCheckStateData"}' + health: CLEAR + name: Backup performed in the last 12 hours + topologyElementIdentifier: urn:kubernetes:/preprod-dev.preprod.stackstate.io:stackstate-nightly:statefulset/suse-observability-clickhouse-shard0 +... +---- + + +=== Enabling/disabling Monitors + +A monitor can be enabled or disabled. Enabled means the monitor will produce results, disabled means it will suppress all output. Use the following commands to enable/disable: + +[,bash] +---- +sts monitor disable --id 68172545000282 +✅ Monitor 68172545000282 has been disabled + +sts monitor enable --id 68172545000282 +✅ Monitor 68172545000282 has been enabled +---- + + +=== Monitor execution details + +Statistics of monitor runs and the latency in the processing of the resulting checkstates can be obtained with the `status` command. + +[,bash] +---- +sts monitor status --id 68172545000282 + +Monitor Health State count: 650 +Monitor Status: ENABLED +Monitor last run: 2026-01-14 12:16:25.979 +0000 UTC + +Monitor Stream errors: +No data to display. + +Monitor health states mapped to topology: +HEALTHSTATE | COUNT +CLEAR | 522 +DEVIATING | 0 +CRITICAL | 0 +UNKNOWN | 128 + +Monitor Stream metrics: +METRIC | VALUE BETWEEN NOW AND 300 SECONDS AGO | VALUE BETWEEN 300 AND 600 SECONDS AGO | VALUE BETWEEN 600 AND 900 SECONDS AGO +latency (Seconds) | 42.345412844036574 | 44.53073394495415 | 46.725688073394615 +messages processed (per second) | 10.833333333333 | 10.833333333333 | 10.833333333333 +monitor health states created (per second) | | | +monitor health states updated (per second) | 0.0033333333333333 | 0.0033333333333333 | 0.02 +monitor health states deleted (per second) | | | + +Monitor health states with identifier matching exactly 1 topology element: 650 +---- + +== Modifying Monitors + +[NOTE] +==== +The recommended way of working is to store monitors (and any other custom resources created in {stackstate-product-name}) as YAML files in xref:/setup/custom-integrations/overview.adoc[a StackPack]. From there changes can be manually applied or it can be fully automated by using the SUSE Observability CLI in a CI/CD system like GitHub actions or GitLab pipelines. +==== + +=== Create a Monitor + +You can create a monitor by applying a YAML file, say `monitor.yaml`. It must have the following outline: + +[source,bash] +---- +nodes: +- _type: "Monitor" + ... +---- + +[NOTE] +==== +Note the entrypoint `nodes:` on the first line. That's not needed for files in a stackpack but it must be there when using the `apply` command. +==== + +Use the xref:/setup/cli/cli-sts.adoc[SUSE Observability CLI] to create or update the monitor: + +[,bash] +---- +sts monitor apply -f monitor.yaml +---- + +You can xref:/setup/custom-integrations/monitors/index.adoc#_verifying_the_results_of_a_monitor[verify if the monitor produces the expected results] on the monitor overview page. + +[CAUTION] +==== +The identifier is used as the unique key of a monitor. Changing the identifier will create a new monitor instead of updating the existing one. +==== + +=== Delete a Monitor + +To delete a monitor use + +[,bash] +---- +sts monitor delete --id +---- + +=== Live Edit a Monitor + +To edit a monitor, edit the original of the monitor that was applied, and apply again. Or there is a `sts monitor edit` command to edit the configured monitor in the {stackstate-product-name} instance directly: + +[,bash] +---- +sts monitor edit --id +---- + +The `` in this command isn't the identifier but the number in the `Id` column of the `sts monitor list` output. + diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/monitors/derived-state-monitors.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/derived-state-monitors.adoc new file mode 100644 index 00000000..3076a5dd --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/derived-state-monitors.adoc @@ -0,0 +1,41 @@ += Derived State Monitors +:revdate: 2025-07-10 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +In Observability scenarios where logical (business) components lack direct monitors but are affected by issues in their technical dependencies, you can use the derived-state-monitor function to derive a state from the connected technical components for the logical component. +This monitor traverses component dependencies and selects the most critical health state based on direct observations (e.g., from metrics), ignoring any already-derived states. It will apply the derived state to all components selected through the `componentTypes` parameter. +During traversal, only components with observed (non-derived) health states are considered for health derivation. Components with derived states are skipped in evaluation but still traversed to reach deeper dependencies--for example, logical components depending on other logical components. + +== Derived Health State Monitor example + +A Monitor implemented using the `derived-state-monitor` function looks like: + +---- + - _type: "Monitor" + name: "Aggregated health state of a Deployment, StatefulSet, ReplicaSet and DaemonSet" + tags: + - deployments + - replicasets + - statefulsets + - daemonsets + - derived + - propagated + identifier: "urn:stackpack:my-stackpack:monitor:my-aggregated-health-monitor" + status: "DISABLED" + description: "Description" + function: {{ get "urn:stackpack:common:monitor-function:derived-state-monitor" }} + arguments: + componentTypes: "deployment, replicaset, statefulset, daemonset" + intervalSeconds: 30 + remediationHint: "Investigate component [{{ causeComponentName }}](/#/components/{{ causeComponentUrnForUrl }}) as is causing the workload to be unhealthy." +---- + +* The function has a single argument `componentTypes` where you can express the different component types as a single string of `,` separated values +* The function offers three values to use in the remediation guide + ** `componentName` being the name of the logical component. + ** `causeComponentName` being the component name where the state is propagated from and its `causeComponentUrnForUrl` to be able to create a link. + +The monitor can be implemented using the guide at xref:/setup/custom-integrations/monitors/index.adoc[Add a threshold monitor to components using the CLI] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/monitors/dynamic-threshold-monitors.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/dynamic-threshold-monitors.adoc new file mode 100644 index 00000000..48703644 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/dynamic-threshold-monitors.adoc @@ -0,0 +1,50 @@ += Dynamic Threshold Monitors +:revdate: 2025-07-10 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +For metrics that vary significantly over time and differ from service to service, a Dynamic Threshold monitor provides simple and performant anomaly detection. It uses data from 1, 2 or 3 weeks ago in addition to the recent past as context to compare current data to. + +Data from the "check window" is compared to that provided by the historic context using the Anderson-Darling test. This imposes very little assumptions on the data distribution. The test is particularly sensitive to outliers on the upper and lower ends of the distribution. The metric can be smooth, spiky or have a couple of "levels" - as data values are compared directly, without any model fitting, the Dynamic Threshold monitor is very robust. + +For metrics that vary smoothly over time (e.g. on a timescale of five minutes), the effective number of data points is smaller than the raw number. The DT compensates for this so the same monitor can be used for a wide range of metrics without the need for adjusting its parameters. + +There are a couple of parameters that can be set for the monitor function: + +* `falsePositiveRate`: say `!!float 1e-8` - the sensitivity of the monitor to deviating behavior. A lower value suppresses more (false) positives but may also lead to false negatives (unnoticed anomalies). +* `checkWindowMinutes`: say `10` minutes - the check window needs to be balanced between quick alerting (small values) and correctly identified anomalies (high values). A handful of data points works well in practice. +* `historicWindowMinutes`: say `120` (2 hours) - bracketed around the current time, but then one or more weeks ago - so from one hour before the current time to one hour after. Also the two hours before the check window are used. The dynamic threshold monitor compares the distribution of this historic data with the data points in the check window. +* `historySizeWeeks`: say `2` - the number of weeks that data is taken from for historic context. Can be `1`, `2` or `3`. +* `removeTrend`: for metrics that have trend behavior (say, number of requests), such that the absolute value differs from week to week, this trend (the average value) can be accounted for. +* `includePreviousDay`: typically `false` - for metrics that do not have a weekly but only a daily pattern, this allows the use of more recent data + +== Dynamic Threshold Monitor example + +A Monitor implemented using the Dynamic Threshold function looks like: + +---- + - _type: "Monitor" + name: "" + identifier: "urn:stackpack::monitor:" + status: "DISABLED" + description: "" + function: {{ get "urn:stackpack:aad-v2:shared:monitor-function:dt" }} + arguments: + telemetryQuery: + query: "" + unit: s + aliasTemplate: "" + topologyQuery: + falsePositiveRate: + checkWindowMinutes: + historicWindowMinutes: + historySizeWeeks: + includePreviousDay: + removeTrend: + intervalSeconds: 60 + remediationHint: "" +---- + +The monitor can be implemented using the guide at xref:/setup/custom-integrations/monitors/index.adoc[Add a threshold monitor to components using the CLI] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/monitors/index.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/index.adoc new file mode 100644 index 00000000..7b0138ea --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/index.adoc @@ -0,0 +1,120 @@ += Add a monitor to components +:revdate: 2025-07-10 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +{stackstate-product-name} provides xref:/use/alerting/k8s-monitors.adoc[monitors out of the box], which provide monitoring on common issues that can occur in a Kubernetes cluster. It's also possible to configure custom monitors for the metrics collected by {stackstate-product-name} or application metrics ingested xref:/use/metrics/k8s-prometheus-remote-write.adoc[from Prometheus]. + +== Creating a monitor + +Steps to create a monitor: + +. <<_write_the_outline_of_the_monitor,Write the outline of the monitor>> +. <<_bind_the_results_of_the_monitor_to_the_correct_components,Bind the results of the monitor to the correct component>> +. <<_write_the_remediation_hint,Write the remediation hint>> + +As an example the steps will add a monitor for the `Replica counts` of Kubernetes deployments. + +=== Write the outline of the monitor + +Open the `monitors.yaml` YAML file of your StackPack in your favorite code editor to change it throughout this guide. You can use the CLI to xref:/setup/custom-integrations/develop.adoc#_test_your_stackpack[Test Your StackPack]. + +For example, this could be the start for a monitor which monitors the available replicas of a deployment: + +---- +- _type: Monitor + arguments: + metric: + query: "kubernetes_state_deployment_replicas_available" + unit: "short" + aliasTemplate: "Deployment replicas" + comparator: "LTE" + threshold: 0.0 + failureState: "DEVIATING" + urnTemplate: + description: "Monitor whether a deployment has replicas. + function: {{ get "urn:stackpack:kubernetes-v2:shared:monitor-function:threshold" }} + identifier: urn:stackpack:my-stackpack:monitor:deployment-has-replicas + intervalSeconds: 30 + name: Deployment has replicas + remediationHint: + status: "ENABLED" + tags: + - "deployments" +---- + +The `urnTemplate` and `remediationHint` will be filled in the next steps. + +=== Bind the results of the monitor to the correct components + +The results of a monitor need to be bound to components in {stackstate-product-name}, to be visible and usable. The result of a monitor is bound to a component using the component `identifiers`. Each component in {stackstate-product-name} has one or more identifiers that uniquely identify the component. To bind a result of a monitor to a component, it's required to provide the `urnTemplate`. The `urnTemplate` substitutes the labels in the time series of the monitor result into the template, producing an identifier matching a component. This is best illustrated with the example: + +The metric that's used in this example is the `kubernetes_state_deployment_replicas_available` metric. Run the metric in the metric explorer to observe what labels are available on the time series: + +image::k8s/available-replicas-metric-inspector.png[The available replicas in the metric explorer] + +In the above table it's shown the metric has labels like `cluster_name`, `namespace` and `deployment`. + +Because the metric is observed on deployments, it's most logical to bind the monitor results to deployment components. To do this, it's required to understand how the identifiers for deployments are constructed: + +. In the UI, navigate to the `deployments` view and select a single deployment. +. Open the `Topology` view, and click the deployment component. +. When expanding the `Properties` in the right panel of the screen, the identifiers will show after hovering as shown below: + +image::k8s/component-identifier.png[Finding a component identifier] + +The identifier is shown as `urn:kubernetes:/preprod-dev.preprod.stackstate.io:calico-system:deployment/calico-typha`. This shows that the identifier is constructed based on the cluster name, namespace and deployment name. Knowing this, it's now possible to construct the `urnTemplate`: + +---- + ... + urnTemplate: "urn:kubernetes:/${cluster_name}:${namespace}:deployment/${deployment}" + ... +---- + +<<_verifying_the_results_of_a_monitor,To verify>> whether the `urnTemplate` is correct, is explained further below. + +=== Write the remediation hint + +The remediation hint is there to help users find the cause of an issue when a monitor fires. The remediation hint is written in https://en.wikipedia.org/wiki/Markdown[markdown]. It's also possible to use the labels that are on the time series of the monitor result using a handlebars template, as in the following example: + +---- + ... + remediationHint: |- + To remedy this issue with the deployment {{ labels.deployment }}, consider taking the following steps: + + 1. Look at the logs of the pods created by the deployment + ... +---- + +To offer a remediation experience that conforms to that offered by the standard {stackstate-product-name} monitors, follow the xref:/setup/custom-integrations/monitors/remediation-guide.adoc#_guidelines[Remediation guide guidelines]. + +== Testing the monitor + +After you have made a monitor, validate whether it produces the expected results. The following steps can be taken: + +. <<_create_or_update_the_monitor_in_suse_observability,Create or update the monitor in {stackstate-product-name}>> +. <<_verifying_the_results_of_a_monitor,Verify the monitor produces the expected result>> + +=== Create or update the monitor in {stackstate-product-name} + +- Use the xref:/setup/custom-integrations/develop.adoc#_test_your_stackpack[Test Your StackPack] command to deploy the StackPack. ++ +[source,bash] +---- +sts stackpack test -d ./my-stackpack --yes +---- + +=== Verifying the results of a monitor + +==== Verify the execution of the monitor + +Go to the monitor overview page (http://your-instance/#/monitors) and find your monitor. + +. Verify the `Status` column is in `Enabled` state. If the monitor is in `Disabled` state, xref:/setup/custom-integrations/monitors/cli.adoc#_enable_or_disable_the_monitor[enable it]. If the status is in `Error` state, you can xref:/setup/custom-integrations/monitors/troubleshooting.adoc#_the_monitor_is_showing_an_error_in_the_monitor_status_overview[troubleshoot the error] using the CLI. +. Verify you see the expected number of states in the `Clear`/`Deviating`/`Critical` column. If this number is significantly lower or higher than the number of components you meant to monitor, the PromQL query might be giving too many results. + +==== Verify the binding of the monitor + +Observe whether the monitor is producing a result on one of the components that it's meant to monitor for. If the monitor doesn't show up, follow xref:/setup/custom-integrations/monitors/troubleshooting.adoc#_the_result_of_the_monitor_isnt_showing_on_a_component[these steps] to remedy. diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/monitors/remediation-guide.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/remediation-guide.adoc new file mode 100644 index 00000000..3c7bbad1 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/remediation-guide.adoc @@ -0,0 +1,99 @@ += Write a remediation guide to help users troubleshoot issues +:revdate: 2025-07-10 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +{stackstate-product-name} provides xref:/use/alerting/k8s-monitors.adoc[monitors out of the box], which provide monitoring on common issues that can occur in a Kubernetes cluster. These monitors also contain out of the box remediation guides which are meant to guide users in accurately troubleshooting issues. They are created using the best practices and community knowledge. Follow the indications on this page to learn how to write an effective remediation guide yourself. + +== Guidelines + +* Provide step by step instructions to guide a user into solving the issue detected by the monitor; +* Make sure the instructions are ordered by the most likely cause(s). +* If possible, include links to relevant data and/or resources to speed up the investigation. +* Keep it short and on point: + ** Avoid over explaining - add links to supporting documentation if that's the case; + ** Avoid using a table of contents and similar content blocks; + ** Avoid having a summary of the same content; +* Try to format the guide in a structured way. Use: + ** bullet points + ** numbering + ** short sentences + ** paragraphs + ** inline formatted examples +* If there are open ends (there might be different causes which are still unknown), provide guidance for escalating the issue. E.g. provide the user with a support link/ number, etc. + +== Remediation guide example + +---- +When a Kubernetes container has errors, it can enter into a state called CrashLoopBackOff, where Kubernetes attempts to restart the container to resolve the issue. The container will continue to restart until the problem is resolved.Take the following steps to diagnose the problem: + +### Pod Events + +Check the pod events to identify any explicit errors or warnings. +1. Go to the "Events" section in the middle of the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) +2. Check if there is are events like "BackOff", "FailedScheduling", "FailedAttachVolume" or "OOMKilled" in the Alert Category by clicking on 'Alerts'. +3. You can see the details of the event (click on the event) to give more information about the issue. +4. If the 'Show related event' option is enabled all events of resources related to this resource like a deployment will also show up and can give you a clue if any change on them is causing this issue. You can see this by checking if there is a correlation between the time of a deployment and a change of behaviour seen by the metrics and events of this pod. +For easy correlation you can use 'shift'-'click' to add markers to the different graph, log and event widgets. + +### Container Logs +Check the container logs for any explicit errors or warnings +Inspect the [Logs](/#/components/\{{ componentUrnForUrl \}}#logs) of all the containers in this pod. +Search for hints in the logs by: +1. Looking for changes in logging pattern, by looking at the number of logs per time unit (The histogram bars). + In many cases the change in pattern will indicate what is going on. + You can click-drag on the histogram bars to narrow the logs displayed to that time-frame. +2. Searching for "Error" or "Fatal" in the search bar. +3. Looking at the logs around the time that the monitor triggered + +### Recent Changes +Look at the pod age in the "About" section on the [Pod highlight page](/#/components/\{{ componentUrnForUrl \}}) to identify any recent deployments that might have caused the issue +1. The "Age" is shown in the "About" section on the left side of the screen +2. If the "Age" and the time that the monitor was triggered are in close proximity then take a look at the most recent deployment by clicking on [Show last change](/#/components/\{{ componentUrnForUrl \}}#lastChange). +---- + +== Inserting links + +The syntax we use is different for "deep links" vs "in-page links". The "deep links" will redirect the user from the current page, whilst the "in-page links" will keep the user on the same page. + +=== Deep links + +To link to any perspective (e.g. "hightlights", "topology", "events", "metrics") of the current resource, use the following syntax: + +---- +[highlight page](/#/components/\{{ componentUrnForUrl \}}) +---- + +---- +[topology](/#/components/{{ componentUrnForUrl }}/topology) +---- + +---- +[events](/#/components/{{ componentUrnForUrl }}/events) +---- + +---- +[metrics](/#/components/{{ componentUrnForUrl }}/metrics) +---- + +=== In-page links + +To link to any additional data (e.g. "show logs", "show last change", "show status", "show configuration") on the current resource, use the following syntax: + +---- +[logs](/#/components/\{{ componentUrnForUrl \}}#logs) +---- + +---- +[last change](/#/components/\{{ componentUrnForUrl \}}#lastChange) +---- + +---- +[status](/#/components/\{{ componentUrnForUrl \}}#status) +---- + +---- +[configuration](/#/components/\{{ componentUrnForUrl \}}#configuration) +---- diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/monitors/schemas-ref.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/schemas-ref.adoc new file mode 100644 index 00000000..03e514d3 --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/schemas-ref.adoc @@ -0,0 +1,150 @@ +ifdef::ss-ff-stackpacks2_enabled[] += Schema and reference for Monitors +:revdate: 2026-01-09 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +This page describes the schemas for a `Monitor`, along with detailed explanations of constructs, expression syntax and semantics. + +== Schema for Monitor + +A monitor specifies how often a monitor function should run and with which arguments. + +[,yaml] +---- +_type: "Monitor" +name: string +description?: string +function: string +arguments: + +intervalSeconds: integer +remediationHint?: string +status?: "ENABLED" | "DISABLED" # defaults to "DISABLED" +tags: + : +identifier?: string +---- + +* `_type`: {stackstate-product-name} needs to know this is a monitor so, value always needs to be `Monitor` +* `name`: The name of the monitor +* `description`: A description of the monitor. +* `function`: A reference to the monitor function that will execute the monitor. +* `intervalSeconds`: The interval at which the monitor executes. For regular real-time metric 30 seconds is advised. For longer-running analytical metric queries a bigger interval is recommended. +* `remediationHint`: A description of what the user can do when the monitor fails. The format is markdown, with optionally use of handlebars variables to customize the hint based on time series or other data (xref:/setup/custom-integrations/monitors/index.adoc#_write_the_remediation_hint[more explanation below]). +* `status`: Either `"DISABLED"` or `"ENABLED"`. Determines whether the monitor will run or not. +* `tags`: Add tags to the monitor to help organize them in the monitors overview of your {stackstate-product-name} instance, http://your-instance/#/monitors +* `identifier`: An identifier of the form `+urn:stackpack::monitor:....+` which uniquely identifies the monitor when updating its configuration. + +== Monitor Functions + +=== Threshold + +Triggers a health state when a given threshold is exceeded for a specified metric query. Different thresholds can be set on particular resources with the xref:/use/alerting/k8s-override-monitor-arguments.adoc[help of annotations]. + +[,yaml] +---- +function: {{ get "urn:stackpack:common:monitor-function:threshold" }} +arguments: + metric: + query: string + unit: string + aliasTemplate: string + comparator: GTE | GT | LTE | LT # how to compare metric value to threshold + threshold: double + failureState: CRITICAL | DEVIATING | UNKNOWN + urnTemplate: string + titleTemplate: string +---- + +* `query`: A PromQL query. Use the xref:/use/metrics/k8sTs-explore-metrics.adoc[metric explorer] of your {stackstate-product-name} instance, http://your-instance/#/metrics, and use it to construct query for the metric of interest. +* `unit`: The unit of the values in the time series returned by the query or queries, used to render the Y-axis of the chart. See the xref:/develop/reference/k8sTs-chart-units.adoc[supported units] reference for all units. +* `aliasTemplate`: An alias for time series in the metric chart. This is a template that can substitute labels from the time series using the `+${my_label}+` placeholder. +* `comparator`: Choose one of LTE/LT/GTE/GT to compare the threshold against the metric. Time series for which ` ` holds true will produce the failure state. +* `threshold`: A numeric threshold to compare against. +* `failureState`: Either "CRITICAL" or "DEVIATING". "CRITICAL" will show as read in {stackstate-product-name} and "DEVIATING" as orange, to denote different severity. +* `urnTemplate`: A template to construct the urn of the component a result of the monitor will be xref:/setup/custom-integrations/monitors/index.adoc#_bind_the_results_of_the_monitor_to_the_correct_components[bound to]. +* `titleTemplate`: A title for the result of a monitor. Because multiple monitor results can bind to the same component, it's possible to substitute time series labels using the `+${my_label}+` placeholder. + +=== Derived State + +Derives its state from the dependencies of the components which health state is based on observations. It produces the most critical state of the top-most dependencies. +For details, see the xref:/setup/custom-integrations/monitors/derived-state-monitors.adoc[Derived State Monitors] page. + +[,yaml] +---- +function: {{ get "urn:stackpack:common:monitor-function:derived-state-monitor" }} +arguments: + componentTypes: string +---- + +* `componentTypes`: The component types that contribute to derived states. Specified as a single string of `,` (comma) separated values + +=== Topological Threshold + +Triggers a health state when a given threshold is exceeded for a specified metric query. The metric query can reference the name, tags and properties from the components returned by the topology query. Different thresholds can be set on particular resources with the xref:/use/alerting/k8s-override-monitor-arguments.adoc[help of annotations]. + +[,yaml] +---- +function: {{ get "urn:stackpack:common:monitor-function:topological-threshold" }} +arguments: + queries: + topologyQuery: string + promqlQuery: string + aliasTemplate: string + unit: string + comparator: GTE | GT | LTE | LT # how to compare metric value to threshold + threshold: double + failureState: CRITICAL | DEVIATING | UNKNOWN + titleTemplate: string +---- + +* `queries`: The queries to execute +** `topologyQuery`: STQL query to select components +** `promqlQuery`: PromQL query that can use labels and properties of components to select time series +** `unit`: The unit of the values in the time series returned by the query or queries, used to render the Y-axis of the chart. See the xref:/develop/reference/k8sTs-chart-units.adoc[supported units] reference for all units. +** `aliasTemplate`: An alias for time series in the metric chart. This is a template that can substitute labels from the time series using the `+${my_label}+` placeholder. +* `comparator`: Choose one of LTE/LT/GTE/GT to compare the threshold against the metric. Time series for which ` ` holds true will produce the failure state. +* `threshold`: A numeric threshold to compare against. +* `failureState`: Either "CRITICAL" or "DEVIATING". "CRITICAL" will show as read in {stackstate-product-name} and "DEVIATING" as orange, to denote different severity. +* `titleTemplate`: A title for the result of a monitor. Because multiple monitor results can bind to the same component, it's possible to substitute time series labels using the `+${my_label}+` placeholder. + +=== Dynamic Threshold + +Alerts when the current value is outside the predicted baseline interval, which is dynamically calculated based on historical data, taking into account weekly and daily seasonal patterns. +This monitor function is only available when the Autonomous Anomaly Detector stackpack is installed. + +For details, see the xref:/setup/custom-integrations/monitors/dynamic-threshold-monitors.adoc[Dynamic Threshold Monitors] page. + +[,yaml] +---- +function: {{ get "urn:stackpack:aad-v2:shared:monitor-function:dt" }} +arguments: + telemetryQuery: + query: string + unit: string + aliasTemplate: string + topologyQuery: string + falsePositiveRate: float + checkWindowMinutes: integer + historicWindowMinutes: integer + historySizeWeeks: 1 | 2 | 3 (integer) + includePreviousDay: boolean + removeTrend: boolean +---- + +* `telemetryQuery`: telemetry to evaluate +** `query`: PromQL query that is used for baselining and anomaly detection +** `unit`: The unit of the values in the time series returned by the query or queries, used to render the Y-axis of the chart. See the xref:/develop/reference/k8sTs-chart-units.adoc[supported units] reference for all units. +** `aliasTemplate`: An alias for time series in the metric chart. This is a template that can substitute labels from the time series using the `+${my_label}+` placeholder. +* `topologyQuery`: STQL query to select components +* `falsePositiveRate`: say `!!float 1e-8` - the sensitivity of the monitor to deviating behavior. A lower value suppresses more (false) positives but may also lead to false negatives (unnoticed anomalies). +* `checkWindowMinutes`: say `10` minutes - the check window needs to be balanced between quick alerting (small values) and correctly identified anomalies (high values). A handful of data points works well in practice. +* `historicWindowMinutes`: say `120` (2 hours) - bracketed around the current time, but then one or more weeks ago - so from 1 hour before the current time to 1 hour after. Also the 2 hours before the check window are used. The dynamic threshold monitor compares the distribution of this historic data with the data points in the check window. +* `historySizeWeeks`: say `2` - the number of weeks that data is taken from for historic context. Can be `1`, `2` or `3`. +* `removeTrend`: for metrics that have trend behavior (say, number of requests), such that the absolute value differs from week to week, this trend (the average value) can be accounted for. +* `includePreviousDay`: typically `false` - for metrics that do not have a weekly but only a daily pattern, this allows the use of more recent data + +endif::[] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/monitors/troubleshooting.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/troubleshooting.adoc new file mode 100644 index 00000000..ab94027f --- /dev/null +++ b/docs/latest/modules/en/pages/setup/custom-integrations/monitors/troubleshooting.adoc @@ -0,0 +1,32 @@ += Troubleshooting monitors +:revdate: 2026-01-14 +:page-revdate: {revdate} +:description: SUSE Observability + +== Overview + +* <<_the_result_of_the_monitor_isnt_showing_on_a_component,The result of the monitor isn't showing on a component>> +* <<_the_monitor_is_showing_an_error_in_the_monitor_status_overview,The monitor is showing an error in the monitor status overview>> + +=== The result of the monitor isn't showing on a component + +First check if the monitor is actually xref:/setup/custom-integrations/monitors/index.adoc#_verify_the_execution_of_the_monitor[producing results]. If this is the case but the monitor results do not show up on the components, there might be a problem with the binding. First use the following command to verify: + +[,bash] +---- +sts monitor status --id +---- + +If the output has `+Monitor health states with identifier which has no matching topology element (): ....+`, this shows that the `urnTemplate` may not generate an identifier matching the topology. To remedy this xref:/setup/custom-integrations/monitors/index.adoc#_bind_the_results_of_the_monitor_to_the_correct_components[revisit your urnTemplate]. + +=== The monitor is showing an error in the monitor status overview + +Get the status of the monitor through the CLI + +[,bash] +---- +sts monitor status --id +---- + +The section `Monitor Stream errors:` will show the errors happening on the monitor and offer further help. + diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/schemas-ref.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/schemas-ref.adoc index 93b8ad29..707f671f 100644 --- a/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/schemas-ref.adoc +++ b/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/schemas-ref.adoc @@ -386,7 +386,7 @@ In regex-based mappings, capture groups are substituted into the target. * `` - needs to return a string; can be one of: ** string literal (e.g., "hello") ** string expression, wrapped in `${...}` (e.g., "${resource.attributes['service.name']}") -** string interpolation (e.g. "urn:opentelemetry:namespace/${resource.attributes['namespace']}:service/${resource.attributes['service.name']}") - note: for string interpolation, the entire expression is not wrapped in `${...}` +** string interpolation (e.g. "urn:opentelemetry:namespace/$\{resource.attributes['namespace']}:service/$\{resource.attributes['service.name']}") - note: for string interpolation, the entire expression is not wrapped in `$\{...}` * `cel-boolean` - needs to return a boolean; can be one of: ** boolean literal (e.g., "true") ** boolean expression (e.g., "'namespace' in resource.attributes") - note: boolean expressions are not wrapped in `${...}` diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/troubleshooting.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/troubleshooting.adoc index 4807ce11..6f07e5d4 100644 --- a/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/troubleshooting.adoc +++ b/docs/latest/modules/en/pages/setup/custom-integrations/otelmappings/troubleshooting.adoc @@ -60,7 +60,7 @@ OTel service instance provided-by service relation | urn:stackpack:open-telemetr === Show an open telemetry component or relation mappings status -The open telemetry component mapping status command returns the aggregated latency, throughput metrics ana amount of topology elements created. This is helpful when debugging why a particular part of the topology takes a long time to be synchronised. The output includes a section `Mapping errors` that will signal any issues occurring when applying the mapping rules to the open telemetry data. +The open telemetry component mapping status command returns the aggregated latency, throughput metrics and number of topology elements created. This is helpful when debugging why a particular part of the topology takes a long time to be synchronised. The output includes a section `Mapping errors` that will signal any issues occurring when applying the mapping rules to the open telemetry data. [,sh] ---- @@ -96,4 +96,4 @@ latency seconds | 43.404 | 43.404 | 39.978 Otel Relation Mapping Errors: No otel relation mapping errors found. ---- -endif::ss-ff-stackpacks2_enabled[] \ No newline at end of file +endif::ss-ff-stackpacks2_enabled[] diff --git a/docs/latest/modules/en/pages/setup/custom-integrations/overview.adoc b/docs/latest/modules/en/pages/setup/custom-integrations/overview.adoc index c92c1350..ec66757e 100644 --- a/docs/latest/modules/en/pages/setup/custom-integrations/overview.adoc +++ b/docs/latest/modules/en/pages/setup/custom-integrations/overview.adoc @@ -56,4 +56,4 @@ helm upgrade --install \ ---- + -endif::ss-ff-stackpacks2_enabled[] \ No newline at end of file +endif::ss-ff-stackpacks2_enabled[] diff --git a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s-operator.adoc b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s-operator.adoc index 3077f262..400c785e 100644 --- a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s-operator.adoc +++ b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s-operator.adoc @@ -329,7 +329,13 @@ The Kubernetes Operator will inject these attributes by default into any telemet == Next steps +ifdef::ss-ff-stackpacks2_enabled[] +You can add new charts to components, for example, the service or service instance, for your application, by following xref:/setup/custom-integrations/metric-bindings/index.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] You can add new charts to components, for example the service or service instance, for your application, by following xref:/use/metrics/k8s-add-charts.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] The operator, the `OpenTelemetryCollector`, and the `Instrumentation` custom resource, have more options that are documented in the https://github.com/open-telemetry/opentelemetry-operator[readme of the operator repository]. For example it is possible to install an optional https://github.com/open-telemetry/opentelemetry-operator?tab=readme-ov-file#_target_allocator[target allocator] via the `OpenTelemetryCollector` resource, it can be used to configure the Prometheus receiver of the collector. This is especially useful when you want to replace Prometheus operator and are using its `ServiceMonitor` and `PodMonitor` custom resources. diff --git a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s.adoc b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s.adoc index 21770c7b..2cff784e 100644 --- a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s.adoc +++ b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-k8s.adoc @@ -199,7 +199,13 @@ This can be achieved by a configuration such as above. There the `kubernetesAtt == Next steps +ifdef::ss-ff-stackpacks2_enabled[] +You can add new charts to components, for example, the service or service instance, for your application, by following xref:/setup/custom-integrations/metric-bindings/index.adoc[our guide]. It is also possible to create xref:/setup/custom-integrations/monitors/index.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] You can add new charts to components, for example, the service or service instance, for your application, by following xref:/use/metrics/k8s-add-charts.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] == More info diff --git a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-lambda.adoc b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-lambda.adoc index 50f9a99d..482f87dd 100644 --- a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-lambda.adoc +++ b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-lambda.adoc @@ -168,7 +168,13 @@ After a short while and if your Lambda function(s) are getting some traffic you == Next steps +ifdef::ss-ff-stackpacks2_enabled[] +You can add new charts to components, for example the service or service instance, for your application, by following xref:/setup/custom-integrations/metric-bindings/index.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] You can add new charts to components, for example the service or service instance, for your application, by following xref:/use/metrics/k8s-add-charts.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] == More info diff --git a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-linux.adoc b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-linux.adoc index d9db02fc..8e6b7e4b 100644 --- a/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-linux.adoc +++ b/docs/latest/modules/en/pages/setup/otel/getting-started/getting-started-linux.adoc @@ -199,7 +199,13 @@ After a short while and if your application is processing some traffic you shoul == Next steps -You can add new charts to components, for example the service or service instance, for your application, by following xref:/use/metrics/k8s-add-charts.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +ifdef::ss-ff-stackpacks2_enabled[] +You can add new charts to components, for example the service or service instance, for your application, by following xref:/setup/custom-integrations/metric-bindings/index.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] +You can add new charts to components, for example, the service or service instance, for your application, by following xref:/use/metrics/k8s-add-charts.adoc[our guide]. It is also possible to create xref:/use/alerting/k8s-monitors.adoc[new monitors] using the metrics and setup xref:/use/alerting/notifications/configure.adoc[notifications] to get notified when your application is not available or having performance issues. +endif::[] == More info diff --git a/docs/latest/modules/en/pages/setup/otel/instrumentation/dot-net.adoc b/docs/latest/modules/en/pages/setup/otel/instrumentation/dot-net.adoc index 5d0df232..37493284 100644 --- a/docs/latest/modules/en/pages/setup/otel/instrumentation/dot-net.adoc +++ b/docs/latest/modules/en/pages/setup/otel/instrumentation/dot-net.adoc @@ -55,4 +55,10 @@ Make sure you use the OTLP exporter (this is the default) and https://openteleme == Metrics in SUSE Observability +ifdef::ss-ff-stackpacks2_enabled[] +For some .NET metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes,the charts are available on the pods. It is possible to xref:/setup/custom-integrations/metric-bindings/index.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] For some .NET metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes,the charts are available on the pods. It is possible to xref:/use/metrics/k8s-add-charts.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. +endif::[] diff --git a/docs/latest/modules/en/pages/setup/otel/instrumentation/java.adoc b/docs/latest/modules/en/pages/setup/otel/instrumentation/java.adoc index 0ab37e53..1e0cfd62 100644 --- a/docs/latest/modules/en/pages/setup/otel/instrumentation/java.adoc +++ b/docs/latest/modules/en/pages/setup/otel/instrumentation/java.adoc @@ -39,4 +39,10 @@ Make sure you use the OTLP exporter (this is the default) and https://openteleme == Metrics in SUSE Observability +ifdef::ss-ff-stackpacks2_enabled[] +For some Java metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes, the charts are available on the pods. It is possible to xref:/setup/custom-integrations/metric-bindings/index.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] For some Java metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes, the charts are available on the pods. It is possible to xref:/use/metrics/k8s-add-charts.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. +endif::[] diff --git a/docs/latest/modules/en/pages/setup/otel/languages/dot-net.adoc b/docs/latest/modules/en/pages/setup/otel/languages/dot-net.adoc index 0407620e..b3fe4096 100644 --- a/docs/latest/modules/en/pages/setup/otel/languages/dot-net.adoc +++ b/docs/latest/modules/en/pages/setup/otel/languages/dot-net.adoc @@ -55,4 +55,12 @@ Make sure you use the OTLP exporter (this is the default) and https://openteleme == Metrics in SUSE Observability +ifdef::ss-ff-stackpacks2_enabled[] +For some .NET metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes,the charts are available on the pods. It is possible to xref:/setup/custom-integrations/metric-bindings/index.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. +endif::[] + + +ifndef::ss-ff-stackpacks2_enabled[] For some .NET metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes,the charts are available on the pods. It is possible to xref:/use/metrics/k8s-add-charts.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. +endif::[] + diff --git a/docs/latest/modules/en/pages/setup/otel/languages/java.adoc b/docs/latest/modules/en/pages/setup/otel/languages/java.adoc index 67e553cc..eeb56865 100644 --- a/docs/latest/modules/en/pages/setup/otel/languages/java.adoc +++ b/docs/latest/modules/en/pages/setup/otel/languages/java.adoc @@ -39,4 +39,9 @@ Make sure you use the OTLP exporter (this is the default) and https://openteleme == Metrics in SUSE Observability +ifdef::ss-ff-stackpacks2_enabled[] +For some Java metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes, the charts are available on the pods. It is possible to xref:/setup/custom-integrations/metric-bindings/index.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. + +ifndef::ss-ff-stackpacks2_enabled[] For some Java metrics, for example, garbage collector metrics, SUSE Observability has defined charts on the related components. For Kubernetes, the charts are available on the pods. It is possible to xref:/use/metrics/k8s-add-charts.adoc[add charts for more metrics], this works for metrics from automatic instrumentation but also for application-specific metrics from manual instrumentation. +endif::[] diff --git a/docs/latest/modules/en/pages/setup/otel/sampling.adoc b/docs/latest/modules/en/pages/setup/otel/sampling.adoc index ec655201..89bc1b95 100644 --- a/docs/latest/modules/en/pages/setup/otel/sampling.adoc +++ b/docs/latest/modules/en/pages/setup/otel/sampling.adoc @@ -92,7 +92,7 @@ The example samples: For more details on the configuration options and different policies use the https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor[tail sampling readme]. -It is, however, not completely set-it-and-forget-it. If its resource usage starts growing you need to scale out to use multiple collectors to handle the tail sampling which will then also require link:https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/routingconnector/README.md[routing] to route traffic based on trace ID. +It is, however, not completely set-it-and-forget-it. If its resource usage starts growing you need to scale out to use multiple collectors to handle the tail sampling which will then also require https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/routingconnector/README.md[routing] to route traffic based on trace ID. == Sampling traces in combination with span metrics diff --git a/docs/latest/modules/en/pages/setup/security/rbac/rbac_permissions.adoc b/docs/latest/modules/en/pages/setup/security/rbac/rbac_permissions.adoc index 5fec9f76..7bafa80f 100644 --- a/docs/latest/modules/en/pages/setup/security/rbac/rbac_permissions.adoc +++ b/docs/latest/modules/en/pages/setup/security/rbac/rbac_permissions.adoc @@ -41,7 +41,12 @@ The following permissions are available in SUSE Observability: |metric-bindings |get, create, update, delete +ifdef::ss-ff-stackpacks2_enabled[] +|xref:/setup/custom-integrations/metric-bindings/index.adoc[Bindings to visualize data in the metrics perspective]. +endif::[] +ifndef::ss-ff-stackpacks2_enabled[] |xref:/use/metrics/k8s-add-charts.adoc[Bindings to visualize data in the metrics perspective]. +endif::[] |metrics |get^[1]^, update diff --git a/docs/latest/modules/en/pages/use/metrics/k8sTs-explore-metrics.adoc b/docs/latest/modules/en/pages/use/metrics/k8sTs-explore-metrics.adoc index 4d86625e..c26cba80 100644 --- a/docs/latest/modules/en/pages/use/metrics/k8sTs-explore-metrics.adoc +++ b/docs/latest/modules/en/pages/use/metrics/k8sTs-explore-metrics.adoc @@ -9,7 +9,13 @@ image::k8s/k8s-metrics-explorer.png[Metrics Explorer] == PromQL queries +ifdef::ss-ff-stackpacks2_enabled[] +The query input field has auto-suggestions for metric names, label names and values, and supported PromQL functions. See the Prometheus documentation for a complete https://prometheus.io/docs/prometheus/latest/querying/basics/[PromQL guide and reference]. SUSE Observability also adds 2 default parameters that can be used in any query: `+${__interval}+` and `+${__rate_interval}+`. They can be used to scale the aggregation interval automatically with the chart resolution (xref:/setup/custom-integrations/metric-bindings/writing-promql.adoc[more details]). +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] The query input field has auto-suggestions for metric names, label names and values, and supported PromQL functions. See the Prometheus documentation for a complete https://prometheus.io/docs/prometheus/latest/querying/basics/[PromQL guide and reference]. SUSE Observability also adds 2 default parameters that can be used in any query: `+${__interval}+` and `+${__rate_interval}+`. They can be used to scale the aggregation interval automatically with the chart resolution (xref:/use/metrics/k8s-writing-promql-for-charts.adoc[more details]). +endif::[] == See also diff --git a/docs/latest/modules/en/pages/use/views/k8s-metrics-perspective.adoc b/docs/latest/modules/en/pages/use/views/k8s-metrics-perspective.adoc index 169956d8..5a834308 100644 --- a/docs/latest/modules/en/pages/use/views/k8s-metrics-perspective.adoc +++ b/docs/latest/modules/en/pages/use/views/k8s-metrics-perspective.adoc @@ -13,4 +13,10 @@ Charts show metrics data for the selected components in near real-time - data is == Ordering +ifdef::ss-ff-stackpacks2_enabled[] +Metric charts are ordered on priority and name. Both are configured on the xref:/setup/custom-integrations/metric-bindings/index.adoc[metric binding]. +endif::[] + +ifndef::ss-ff-stackpacks2_enabled[] Metric charts are ordered on priority and name. Both are configured on the xref:/use/metrics/k8s-add-charts.adoc[metric binding]. +endif::[]