-
Notifications
You must be signed in to change notification settings - Fork 144
OTel working branch targeting OTel feature branch #4894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: janan07/otel-feature-branch
Are you sure you want to change the base?
Changes from 41 commits
fa3bfce
8b871cb
11e7bbd
57ed363
bd31f03
01ca53b
c3ac1fb
6af333e
fea7e6e
3915103
88763be
7b62f21
f3536e2
2623606
c4c3cd2
66aa80d
bff2796
89c6c2d
98fdae4
80f1629
76ef864
1cc90f5
2b4459c
f67b276
d71bd68
38d7ae4
e7cc615
770f30f
d90a030
1ffd218
4caee5c
daf14f1
bfed466
5f65bde
21c6c52
3604057
ab193e5
a23961a
4065e1d
ab2dc2d
cee746c
ce7dc2d
98ccea6
6add9b6
ce359dc
d312a9b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| # API ML Provided Observability Signals and Attributes | ||
|
|
||
| **TODO: Dev to provide Actual Signals and Attributes** | ||
|
|
||
| <!-- This could be included in this topic. Please review --> | ||
|
|
||
| ## Custom Telemetry Template | ||
| Use this template when requesting or defining new custom metrics for the API ML: | ||
|
|
||
| * **Signal Type**: (Metric / Trace / Log) | ||
| * **Name**: `zowe.apiml.[component].[functional_area]` | ||
| * **Description**: What does this signal represent? | ||
| * **Required Attributes**: | ||
| * `route.id`: Identifier of the routed service. | ||
| * `client.id`: (Optional) The ID of the consuming application. | ||
| * `zos.smf.id`: Automatically inherited from Resource. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| # Configuring OpenTelemetry Deployment Attributes | ||
|
|
||
| To configure deployment-specific resource attributes for the Zowe API ML. These attributes allow you to categorize telemetry data based on the lifecycle stage of the application, such as distinguishing between production, staging, or development environments. | ||
|
|
||
| Unlike z/OS attributes which are often discovered automatically, deployment attributes are strictly informative and are typically defined manually. These attributes do not affect the unique identity of the service but are essential for filtering and grouping data within your observability backend. By explicitly labeling your environment, you ensure that performance anomalies in a test environment do not trigger false alerts in production monitoring views. | ||
|
||
|
|
||
| ## Deployment Attribute Reference | ||
|
|
||
| The following attribute is used to describe the deployment of the single-service deployment of API ML: | ||
|
|
||
| * **deployment.environment.name** | ||
| Specifies the name of the deployment environment (Example: dev, test, staging, or production). Configuration Source: zowe.yaml | ||
|
|
||
| ## Configuration Example in zowe.yaml | ||
|
|
||
| To set the deployment environment, add the `deployment.environment.name` key to the `resource.attributes` section of your zowe.yaml file. | ||
|
|
||
| ``` | ||
| zowe: | ||
| observability: | ||
| enabled: true | ||
| resource: | ||
| attributes: | ||
| # Deployment Attribute (Manual Entry) | ||
| deployment.environment.name: "production" | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| # Configure OpenTelemetry Service Attributes | ||
|
|
||
| Services are identified via the service.name, service.namespace, and service.instance.id properties. Together, these attributes create a unique identity for API ML instances across your enterprise. | ||
|
||
|
|
||
| In complex mainframe environments, you may have multiple API ML installations across different Sysplexes or data centers. To monitor these effectively, you must balance Logical Grouping (viewing all API ML traffic as one functional unit) with Instance Differentiation (identifying exactly which specific Address Space is experiencing an issue). | ||
|
|
||
| ## The Hierarchy of Identification | ||
| OpenTelemetry uses a three-tier approach to define service identity: | ||
|
|
||
| * **service.name** (The Service) | ||
| Identifies the logical name of the service. This property value should be identical for all instances across your entire organization that perform the same function (e.g., zowe-apiml). Expected to be globally unique if `namespace` is not defined. | ||
|
|
||
| * **service.namespace** (The Environment/Site) | ||
| Groups services into logical sets. Use this property value to distinguish between different installations, such as sysplex-a vs. sysplex-b, or north-datacenter vs. south-datacenter. `service.name` is expected to be unique within the same `namespace`. | ||
|
|
||
| * **service.instance.id** (The Unique Instance) | ||
| Identifies a specific running process or Address Space. This value must be globally unique for every instance. As multiple z/OS systems can run identical Job Names, ensure that you combine the Job Name with a unique identifier (such as the LPAR name or a UUID) to ensure the instance can be isolated during troubleshooting. | ||
|
|
||
| <!-- Should we add service.version to this list of properties? --> | ||
|
|
||
| ## Configuration Examples | ||
|
|
||
| **Example 1: Single API ML Installation (High Availability)** | ||
|
|
||
| In this scenario, both instances share the same namespace because they belong to the same logical cluster on the same Sysplex. | ||
|
|
||
| | Attribute | Instance 1 | Instance 2 | | ||
| | :--- | :--- | :--- | | ||
| | **service.name** | `zowe-apiml` | `zowe-apiml` | | ||
| | **service.namespace** | `production-plex` | `production-plex` | | ||
| | **service.instance.id** | `APIML01` | `APIML02` | | ||
|
|
||
| **Instance 1 configuration** | ||
| ``` | ||
| zowe: | ||
| components: | ||
| api-mediation-layer: | ||
| observability: | ||
| enabled: true | ||
| resource: | ||
| attributes: | ||
| service.name: "zowe-apiml" | ||
| service.namespace: "production-plex" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest we put only examples of complete configuration instead of partial ones to specific attribute groups. For instance, here I would expect the 'production' to be set as deployment attribute
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you suggesting then that we remove all configuration examples in this article or replace them with the complete configuration? If the latter, can you provide me with the complete configuration?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we provide only the partial samples, can we be sure that the users will be able to navigate the docs to follow the right full sample? The use-case from the sysprog's perspective is to enable observability, not configure just some subset of attributes. They need to do it either all or nothing. We do not have the configuration yet ad the implementation is still in progress.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's fine, we can remove all of the examples and have a sample when the implementation is completed. |
||
| service.instance.id: "APIML01" | ||
| ``` | ||
| **Instance 2 configuration** | ||
| ``` | ||
| zowe: | ||
| components: | ||
| api-mediation-layer: | ||
| observability: | ||
| enabled: true | ||
| resource: | ||
| attributes: | ||
| service.name: "zowe-apiml" | ||
| service.namespace: "production-plex" | ||
| service.instance.id: "APIML02" | ||
| ``` | ||
|
|
||
| ## Example of Multi-Site Deployment | ||
|
|
||
| In this scenario, instances are separated by namespace to represent their physical data center locations. | ||
|
|
||
| | Attribute | Site 1 (Instance A) | Site 1 (Instance B) | Site 2 (Instance C) | | ||
| | :--- | :--- | :--- | :--- | | ||
| | **service.name** | `zowe-apiml` | `zowe-apiml` | `zowe-apiml` | | ||
| | **service.namespace** | `east-coast` | `east-coast` | `west-coast` | | ||
| | **service.instance.id** | `ZOWE-E1` | `ZOWE-E2` | `ZOWE-W1` | | ||
|
|
||
| **Site 1 (East Coast) Configuration:** | ||
|
|
||
| ``` | ||
| zowe: | ||
| components: | ||
| api-mediation-layer: | ||
| observability: | ||
| enabled: true | ||
| resource: | ||
| attributes: | ||
| service.name: "zowe-apiml" | ||
| service.namespace: "east-coast" | ||
| service.instance.id: "ZOWE-E1" | ||
| ``` | ||
| **Site 2 (West Coast) Configuration:** | ||
| ``` | ||
| zowe: | ||
| components: | ||
| api-mediation-layer: | ||
| observability: | ||
| enabled: true | ||
| resource: | ||
| attributes: | ||
| service.name: "zowe-apiml" | ||
| service.namespace: "west-coast" | ||
| service.instance.id: "ZOWE-W1" | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| # Configure OpenTelemetry z/OS Attributes | ||
|
|
||
| <!-- VALIDATE THIS CONTENT AFTER SUPPORT IS IMPLEMENTED. --> | ||
|
|
||
| z/OS-specific resource attributes for API ML provide essential mainframe context to your telemetry data, allowing you to correlate metrics, traces, and logs with specific system identifiers such as SMF IDs, Sysplex names, and LPARs. By providing z/OS platform context, mainframe performance data can be integrated into distributed observability backends. | ||
|
|
||
| ## How system discovery works | ||
|
|
||
| The z/OS attributes are primarily populated through an automated System Discovery process that occurs during the initialization of the API ML service. The integrated OpenTelemetry SDK executes platform-specific calls to query z/OS Control Blocks (such as the CVTSNAME or ECVT) and system variables. | ||
|
|
||
| ## z/OS Attribute Reference | ||
|
|
||
| The following attributes are captured during system discovery to describe the mainframe environment: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A couple of things:
|
||
|
|
||
janan07 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * **zos.smf.id** | ||
| The System Management Facility (SMF) Identifier that uniquely identifies a z/OS system within a SYSPLEX. | ||
| Configuration Source: System discovery | ||
|
|
||
| * **zos.sysplex.name** | ||
| The name of the SYSPLEX to which the z/OS system belongs. | ||
| Configuration Source: System discovery | ||
|
|
||
| * **mainframe.lpar.name** | ||
| Name of the LPAR that hosts the z/OS system. | ||
| Configuration Source: System discovery | ||
|
|
||
| * **os.type** | ||
| The operating system type, set to `zos`. | ||
| Configuration Source: Static | ||
|
|
||
| * **os.version** | ||
| The version string of the operating system (e.g., the release returned by `D IPLINFO`). | ||
| Configuration Source: System discovery | ||
|
|
||
| * **process.command** | ||
| The command or JOB name used to launch the Zowe process. | ||
| Configuration Source: System discovery | ||
|
|
||
| * **process.pid** | ||
| The Process Identifier, which on z/OS is set to the Address Space Identifier (ASID). | ||
| Configuration Source: System discovery | ||
|
|
||
| ## Overriding Discovered Attributes in zowe.yaml | ||
|
|
||
| While the discovery process handles most identifiers automatically, you may occasionally need to provide a manual override (for example, in shared environments where you wish to report a custom logical LPAR name). This is performed in the `resource.attributes` section of your zowe.yaml: | ||
|
|
||
| ``` | ||
| zowe: | ||
| observability: | ||
| enabled: true | ||
| resource: | ||
| attributes: | ||
| # Overriding discovered z/OS attributes | ||
| zos.smf.id: "MVS1" | ||
| zos.sysplex.name: "LOCALPLX" | ||
| mainframe.lpar.name: "PRODLPAR" | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| # Enabling API ML Observability in zowe.yaml | ||
|
|
||
| Review how to enable and configure the OpenTelemetry (OTel) integration within the Zowe API Mediation Layer (API ML) single-service deployment. Configure these parameters in `zowe.yaml` to enable API ML to export metrics, traces, and logs to an OpenTelemetry Collector. | ||
|
|
||
| ## Configuration Overview | ||
|
|
||
| The observability configuration is located under the API Mediation Layer `component` section of the zowe.yaml, under which there are three observability properties: | ||
|
|
||
| * **enabled** | ||
| Activates the OTel SDK. Set to `true` to initialize the OpenTelemetry SDK. | ||
|
||
|
|
||
| * **exporter** | ||
| Defines where the data is sent. Sub-properties of `exporter` include the following: | ||
|
|
||
| * **exporter.otlp.protocol** | ||
|
||
| The URL of your OTLP-compatible collector (e.g., z-Iris or Jaeger) | ||
|
||
|
|
||
| * **exporter.otlp.protocol** | ||
| The protocol is either `grpc` or `http/protobuf`. | ||
| **Default:** `grcp` | ||
|
||
|
|
||
| * **resource** | ||
| Defines the identity of the producer (Attributes). | ||
|
|
||
| * **resource.attributes** | ||
| A collection of key-value pairs used to identify the telemetry source. See the following sub-properties of `resource.attributes`: | ||
|
|
||
| * **service.name** | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here we duplicate the explanation of service attributes that we have in a separate md file just for them. Similarly for the deployment. It will be difficult to keep them in sync over time.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you think it's sufficient, I can link to the specific attribute article for these three resources. It just seems that if we have an article for enablement, the user should have an easy reference to what parms are being configured...
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, but we do not have specific list now as some of the attributes are discovered automatically. The purpose of this PR is to create a skeleton to be updated once we get the issues implemented.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct. This content is as a place-holder. I understand this could all change with the implementation. |
||
| Logical name of the service. Must be the same for all instances within the same HA deployment. Expected to be globally unique if `namespace` is not defined. | ||
|
|
||
| * **service.namespace** | ||
| The assigned value should help distinguish a group of services, such as the LPAR, or owner team. `service.name` is expected to be unique within the same `namespace`. | ||
|
|
||
| * **deployment.environment.name** | ||
| Specifies the name of the deployment environment (Example: dev, test, staging, or production). Configuration Source: zowe.yaml | ||
|
|
||
| To enable observability, configure the OpenTelemetry exporter and resource attributes within your `zowe.yaml` file with the following structure: | ||
|
|
||
| ``` | ||
| zowe: | ||
| observability: | ||
| enabled: true | ||
| exporter: | ||
| otlp: | ||
| endpoint: "http://otel-collector.your.domain:4317" | ||
| protocol: "grpc" | ||
janan07 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| timeout: 10000 | ||
| resource: | ||
| attributes: | ||
| service.name: "zowe-apiml" | ||
| service.namespace: "finance-production" | ||
| deployment.environment.name: "production" | ||
| ``` | ||
|
|
||
| ## How the Export Works | ||
|
|
||
| When `enabled: true` is set, the API ML single-service starts a background telemetry engine. This engine gathers all signals and bundles these signals with all Resource Attributes. These bundles are then pushed by means of the OTLP Exporter to your specified endpoint. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # Outline of API ML Observability Topics | ||
|
|
||
| The following files will be presented under Advanced server-side configuration under the **Install** tab: | ||
|
|
||
| * Configuring API ML Observability via OpenTelemetry | ||
| * [Configuring OpenTelemetry service attributes](configuring-otel-service-attributes.md) | ||
| * [Configuring OpenTelemetry deployment attributes](configuring-otel-deployment-attributes.md) | ||
| * [Configuring OpenTelemetry z/OS attributes](configuring-otel-zos-attributes.md) | ||
| * [Enabling Observability in zowe.yaml](enabling-observability-in-zowe.yaml.md) | ||
|
|
||
| The following files will be presented under Using Zowe API Mediation Layer under the **Use** tab: | ||
|
|
||
| * [Using your API ML OpenTelemetry metrics](using-your-otel-metrics.md) | ||
| * [API ML Provided Observability Signals and Attributes](apiml-provided-observability-signals-and-attributes.md) | ||
| * [Sample Output from API ML OpenTelemetry](sample-output-from-apiml-otel.md) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| # Sample Output from API ML OpenTelemetry | ||
|
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Using Your API ML OpenTelemetry Metrics | ||
|
|
||
| ## Examples of Useability of Telemetry data in API ML | ||
janan07 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| How a system administrator interacts with this data depends on the visualization tool used (e.g., Grafana, Jaeger, or Broadcom WatchTower). | ||
|
|
||
| ### Example 1: High-Level Health Monitoring (Metrics) | ||
| A system administrator views a Grafana dashboard. The administrator notices a spike in **`apiml.request.errors`**. | ||
| * **The View**: A red line graph shows a sudden jump from 0% to 15% error rate. | ||
| * **The Insight**: By filtering the dashboard using the attribute **`zos.smf.id`**, the admin realizes the errors are only occurring on **LPAR1**, while **LPAR2** remains healthy. This suggests a local configuration or connectivity issue on a specific system rather than a global software bug. | ||
|
|
||
|
|
||
| ### Example 2: Latency Troubleshooting (Traces) | ||
| A user reports that a specific API is "timing out." The admin finds the relevant **`traceId`** in the logs and opens it in a trace viewer. | ||
| * **The View**: A "Gantt chart" style visualization of the request. | ||
| * **The Insight**: | ||
| * `apiml.gateway.total`: 2005ms | ||
| * `apiml.auth.check`: 5ms | ||
| * `apiml.backend.proxy`: 2000ms | ||
| * **The Action**: The admin sees that the Modulith itself only spent 5ms on logic, but waited 2 seconds for the backend mainframe service to respond. The admin can now confidently contact the specific backend service team. | ||
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell how relevant it is. Do we provide similar templates to request new feature also for other areas?