Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/skywalking.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -664,6 +664,8 @@ jobs:
config: test/e2e-v2/cases/activemq/e2e.yaml
- name: Kong
config: test/e2e-v2/cases/kong/e2e.yaml
- name: Flink
config: test/e2e-v2/cases/flink/e2e.yaml

- name: UI Menu BanyanDB
config: test/e2e-v2/cases/menu/banyandb/e2e.yaml
Expand Down
1 change: 1 addition & 0 deletions docs/en/changes/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
* Support `hot/warm/cold` stages TTL query in the status API.
* PromQL Service: traffic query support `limit` and regex match.
* Fix an edge case of HashCodeSelector(Integer#MIN_VALUE causes ArrayIndexOutOfBoundsException).
* Support Flink monitoring.

#### UI

Expand Down
2 changes: 1 addition & 1 deletion docs/en/setup/backend/backend-clickhouse-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ the metrics to
.
2. Set up [OpenTelemetry Collector ](https://opentelemetry.io/docs/collector/getting-started/#docker). For details on
Prometheus Receiver in OpenTelemetry Collector, refer
to [here](../../../../test/e2e-v2/cases/mysql/prometheus-mysql-exporter/otel-collector-config.yaml).
to [here](../../../../test/e2e-v2/cases/clickhouse/clickhouse-prometheus-endpoint/otel-collector-config.yaml).
3. Config SkyWalking [OpenTelemetry receiver](opentelemetry-receiver.md).

### ClickHouse Monitoring
Expand Down
105 changes: 105 additions & 0 deletions docs/en/setup/backend/backend-flink-monitoring.md

Large diffs are not rendered by default.

5 changes: 4 additions & 1 deletion docs/en/setup/backend/opentelemetry-receiver.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,7 @@ for identification of the metric data.
| Metrics of ClickHouse | otel-rules/clickhouse/clickhouse-service.yaml | ClickHouse(embedded prometheus endpoint) -> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
| Metrics of RocketMQ | otel-rules/rocketmq/rocketmq-cluster.yaml | rocketmq-exporter -> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
| Metrics of RocketMQ | otel-rules/rocketmq/rocketmq-broker.yaml | rocketmq-exporter -> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
| Metrics of RocketMQ | otel-rules/rocketmq/rocketmq-topic.yaml | rocketmq-exporter -> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
| Metrics of RocketMQ | otel-rules/rocketmq/rocketmq-topic.yaml | rocketmq-exporter -> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
| Metrics of Flink | otel-rules/flink/flink-jobManager.yaml | flink jobManager -> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
| Metrics of Flink | otel-rules/flink/flink-taskManager.yaml | flink taskManager -> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
| Metrics of Flink | otel-rules/flink/flink-job.yaml | flink jobManager & flink taskManager-> OpenTelemetry Collector -- OTLP exporter --> SkyWalking OAP Server |
50 changes: 25 additions & 25 deletions docs/en/swip/SWIP-9.md

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/menu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,10 @@ catalog:
path: "/en/setup/backend/dashboards-so11y-java-agent"
- name: "SkyWalking Go Agent self telemetry"
path: "/en/setup/backend/dashboards-so11y-go-agent"
- name: "Data Processing Engine"
catalog:
- name: "Flink"
path: "/en/setup/backend/backend-flink-monitoring.md"
- name: "Configuration Vocabulary"
path: "/en/setup/backend/configuration-vocabulary"
- name: "Advanced Setup"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,12 @@ public enum Layer {
* The self observability of SkyWalking Go Agent,
* which provides the abilities to measure the tracing performance and error statistics of plugins.
*/
SO11Y_GO_AGENT(41, true);
SO11Y_GO_AGENT(41, true),

/**
* Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams
*/
FLINK(42, true);

private final int value;
/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ public class UITemplateInitializer {
Layer.SO11Y_JAVA_AGENT.name(),
Layer.KONG.name(),
Layer.SO11Y_GO_AGENT.name(),
Layer.FLINK.name(),
"custom"
};
private final UITemplateManagementService uiTemplateManagementService;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -370,7 +370,7 @@ receiver-otel:
selector: ${SW_OTEL_RECEIVER:default}
default:
enabledHandlers: ${SW_OTEL_RECEIVER_ENABLED_HANDLERS:"otlp-metrics,otlp-logs"}
enabledOtelMetricsRules: ${SW_OTEL_RECEIVER_ENABLED_OTEL_METRICS_RULES:"apisix,nginx/*,k8s/*,istio-controlplane,vm,mysql/*,postgresql/*,oap,aws-eks/*,windows,aws-s3/*,aws-dynamodb/*,aws-gateway/*,redis/*,elasticsearch/*,rabbitmq/*,mongodb/*,kafka/*,pulsar/*,bookkeeper/*,rocketmq/*,clickhouse/*,activemq/*,kong/*"}
enabledOtelMetricsRules: ${SW_OTEL_RECEIVER_ENABLED_OTEL_METRICS_RULES:"apisix,nginx/*,k8s/*,istio-controlplane,vm,mysql/*,postgresql/*,oap,aws-eks/*,windows,aws-s3/*,aws-dynamodb/*,aws-gateway/*,redis/*,elasticsearch/*,rabbitmq/*,mongodb/*,kafka/*,pulsar/*,bookkeeper/*,rocketmq/*,clickhouse/*,activemq/*,kong/*,flink/*"}

receiver-zipkin:
selector: ${SW_RECEIVER_ZIPKIN:-}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This will parse a textual representation of a duration. The formats
# accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS}
# with days considered to be exactly 24 hours.
# <p>
# Examples:
# <pre>
# "PT20.345S" -- parses as "20.345 seconds"
# "PT15M" -- parses as "15 minutes" (where a minute is 60 seconds)
# "PT10H" -- parses as "10 hours" (where an hour is 3600 seconds)
# "P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
# "P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes"
# "P-6H3M" -- parses as "-6 hours and +3 minutes"
# "-P6H3M" -- parses as "-6 hours and -3 minutes"
# "-P-6H+3M" -- parses as "+6 hours and -3 minutes"
# </pre>
filter: "{ tags -> tags.job_name == 'flink-jobManager-monitoring' || tags.job_name == 'flink-taskManager-monitoring' }" # The OpenTelemetry job name
expSuffix: tag({tags -> tags.cluster = 'flink::' + tags.cluster}).endpoint(['cluster'], ['flink_job_name'], Layer.FLINK)
metricPrefix: meter_flink_job

metricsRules:

- name: restart_number
exp: flink_jobmanager_job_numRestarts.sum(['cluster','flink_job_name'])
- name: runningTime
exp: flink_jobmanager_job_runningTime.sum(['cluster','flink_job_name'])
- name: restartingTime
exp: flink_jobmanager_job_restartingTime.sum(['cluster','flink_job_name'])
- name: cancellingTime
exp: flink_jobmanager_job_cancellingTime.sum(['cluster','flink_job_name'])

# checkpoints
- name: checkpoints_total
exp: flink_jobmanager_job_totalNumberOfCheckpoints.sum(['cluster','flink_job_name'])
- name: checkpoints_failed
exp: flink_jobmanager_job_numberOfFailedCheckpoints.sum(['cluster','flink_job_name'])
- name: checkpoints_completed
exp: flink_jobmanager_job_numberOfCompletedCheckpoints.sum(['cluster','flink_job_name'])
- name: checkpoints_inProgress
exp: flink_jobmanager_job_numberOfInProgressCheckpoints.sum(['cluster','flink_job_name'])
- name: lastCheckpointSize
exp: flink_jobmanager_job_lastCheckpointSize.sum(['cluster','flink_job_name'])
- name: lastCheckpointDuration
exp: flink_jobmanager_job_lastCheckpointDuration.sum(['cluster','flink_job_name'])

- name: currentEmitEventTimeLag
exp: flink_taskmanager_job_task_operator_currentEmitEventTimeLag.sum(['cluster','flink_job_name','operator_name'])

- name: numRecordsIn
exp: flink_taskmanager_job_task_operator_numRecordsIn.sum(['cluster','flink_job_name','operator_name'])
- name: numRecordsOut
exp: flink_taskmanager_job_task_operator_numRecordsOut.sum(['cluster','flink_job_name','operator_name'])
- name: numBytesInPerSecond
exp: flink_taskmanager_job_task_operator_numBytesInPerSecond.sum(['cluster','flink_job_name','operator_name'])
- name: numBytesOutPerSecond
exp: flink_taskmanager_job_task_operator_numBytesOutPerSecond.sum(['cluster','flink_job_name','operator_name'])


Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This will parse a textual representation of a duration. The formats
# accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS}
# with days considered to be exactly 24 hours.
# <p>
# Examples:
# <pre>
# "PT20.345S" -- parses as "20.345 seconds"
# "PT15M" -- parses as "15 minutes" (where a minute is 60 seconds)
# "PT10H" -- parses as "10 hours" (where an hour is 3600 seconds)
# "P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
# "P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes"
# "P-6H3M" -- parses as "-6 hours and +3 minutes"
# "-P6H3M" -- parses as "-6 hours and -3 minutes"
# "-P-6H+3M" -- parses as "+6 hours and -3 minutes"
# </pre>
filter: "{ tags -> tags.job_name == 'flink-jobManager-monitoring' }" # The OpenTelemetry job name
expSuffix: tag({tags -> tags.cluster = 'flink::' + tags.cluster}).service(['cluster'], Layer.FLINK)
metricPrefix: meter_flink_jobManager
metricsRules:

# job
- name: running_job_number
exp: flink_jobmanager_numRunningJobs.sum(['cluster'])

# task
- name: taskManagers_registered_number
exp: flink_jobmanager_numRegisteredTaskManagers.sum(['cluster'])
- name: taskManagers_slots_total
exp: flink_jobmanager_taskSlotsTotal.sum(['cluster'])
- name: taskManagers_slots_available
exp: flink_jobmanager_taskSlotsAvailable.sum(['cluster'])

#jvm
- name: jvm_cpu_load
exp: flink_jobmanager_Status_JVM_CPU_Load.sum(['cluster','jobManager_node'])*1000
- name: jvm_cpu_time
exp: flink_jobmanager_Status_JVM_CPU_Time.sum(['cluster','jobManager_node']).increase('PT1M')
- name: jvm_memory_heap_used
exp: flink_jobmanager_Status_JVM_Memory_Heap_Used.sum(['cluster','jobManager_node'])
- name: jvm_memory_heap_available
exp: flink_jobmanager_Status_JVM_Memory_Heap_Max.sum(['cluster','jobManager_node'])-flink_jobmanager_Status_JVM_Memory_Heap_Used.sum(['cluster','jobManager_node'])
- name: jvm_memory_nonHeap_used
exp: flink_jobmanager_Status_JVM_Memory_NonHeap_Used.sum(['cluster','jobManager_node'])
- name: jvm_memory_nonHeap_available
exp: flink_jobmanager_Status_JVM_Memory_NonHeap_Max.sum(['cluster','jobManager_node'])-flink_jobmanager_Status_JVM_Memory_NonHeap_Used.sum(['cluster','jobManager_node'])
- name: jvm_thread_count
exp: flink_jobmanager_Status_JVM_Threads_Count.sum(['cluster','jobManager_node'])
- name: jvm_memory_metaspace_used
exp: flink_jobmanager_Status_JVM_Memory_Metaspace_Used.sum(['cluster','jobManager_node'])
- name: jvm_memory_metaspace_available
exp: flink_jobmanager_Status_JVM_Memory_Metaspace_Max.sum(['cluster','jobManager_node'])-flink_jobmanager_Status_JVM_Memory_Metaspace_Used.sum(['cluster','jobManager_node'])

- name: jvm_g1_young_generation_count
exp: flink_jobmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Count.sum(['cluster','jobManager_node']).increase('PT1M')
- name: jvm_g1_old_generation_count
exp: flink_jobmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Count.sum(['cluster','jobManager_node']).increase('PT1M')
- name: jvm_g1_young_generation_time
exp: flink_jobmanager_Status_JVM_GarbageCollector_G1_Young_Generation_Time.sum(['cluster','jobManager_node']).increase('PT1M')
- name: jvm_g1_old_generation_time
exp: flink_jobmanager_Status_JVM_GarbageCollector_G1_Old_Generation_Time.sum(['cluster','jobManager_node']).increase('PT1M')

- name: jvm_all_garbageCollector_count
exp: flink_jobmanager_Status_JVM_GarbageCollector_All_Count.sum(['cluster','jobManager_node']).increase('PT1M')
- name: jvm_all_garbageCollector_time
exp: flink_jobmanager_Status_JVM_GarbageCollector_All_Time.sum(['cluster','jobManager_node']).increase('PT1M')
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# This will parse a textual representation of a duration. The formats
# accepted are based on the ISO-8601 duration format {@code PnDTnHnMn.nS}
# with days considered to be exactly 24 hours.
# <p>
# Examples:
# <pre>
# "PT20.345S" -- parses as "20.345 seconds"
# "PT15M" -- parses as "15 minutes" (where a minute is 60 seconds)
# "PT10H" -- parses as "10 hours" (where an hour is 3600 seconds)
# "P2D" -- parses as "2 days" (where a day is 24 hours or 86400 seconds)
# "P2DT3H4M" -- parses as "2 days, 3 hours and 4 minutes"
# "P-6H3M" -- parses as "-6 hours and +3 minutes"
# "-P6H3M" -- parses as "-6 hours and -3 minutes"
# "-P-6H+3M" -- parses as "+6 hours and -3 minutes"
# </pre>
filter: "{ tags -> tags.job_name == 'flink-taskManager-monitoring' }" # The OpenTelemetry job name
expSuffix: tag({tags -> tags.cluster = 'flink::' + tags.cluster}).instance(['cluster'], ['taskManager_node'], Layer.FLINK)
metricPrefix: meter_flink_taskManager
metricsRules:

# jvm
- name: jvm_cpu_load
exp: flink_taskmanager_Status_JVM_CPU_Load.sum(['cluster','taskManager_node'])*1000
- name: jvm_cpu_time
exp: flink_taskmanager_Status_JVM_CPU_Time.sum(['cluster','taskManager_node']).increase('PT1M')
- name: jvm_memory_heap_used
exp: flink_taskmanager_Status_JVM_Memory_Heap_Used.sum(['cluster','taskManager_node'])
- name: jvm_memory_heap_available
exp: flink_taskmanager_Status_JVM_Memory_Heap_Max.sum(['cluster','taskManager_node'])-flink_taskmanager_Status_JVM_Memory_Heap_Used.sum(['cluster','taskManager_node'])
- name: jvm_thread_count
exp: flink_taskmanager_Status_JVM_Threads_Count.sum(['cluster','taskManager_node'])
- name: jvm_memory_metaspace_available
exp: flink_taskmanager_Status_JVM_Memory_Metaspace_Max.sum(['cluster','taskManager_node'])-flink_taskmanager_Status_JVM_Memory_Metaspace_Used.sum(['cluster','taskManager_node'])
- name: jvm_memory_metaspace_used
exp: flink_taskmanager_Status_JVM_Memory_Metaspace_Used.sum(['cluster','taskManager_node'])

- name: jvm_memory_nonHeap_used
exp: flink_taskmanager_Status_JVM_Memory_NonHeap_Used.sum(['cluster','taskManager_node'])
- name: jvm_memory_nonHeap_available
exp: flink_taskmanager_Status_JVM_Memory_NonHeap_Max.sum(['cluster','taskManager_node'])-flink_taskmanager_Status_JVM_Memory_NonHeap_Used.sum(['cluster','taskManager_node'])

# # records
- name: numRecordsIn
exp: flink_taskmanager_job_task_numRecordsIn.sum(['cluster','taskManager_node','flink_job_name','task_name']).increase('PT1M')
- name: numRecordsOut
exp: flink_taskmanager_job_task_numRecordsOut.sum(['cluster','taskManager_node','flink_job_name','task_name']).increase('PT1M')
- name: numBytesInPerSecond
exp: flink_taskmanager_job_task_numBytesInPerSecond.sum(['cluster','taskManager_node','flink_job_name','task_name'])
- name: numBytesOutPerSecond
exp: flink_taskmanager_job_task_numBytesOutPerSecond.sum(['cluster','taskManager_node','flink_job_name','task_name'])
#
# # network
- name: netty_usedMemory
exp: flink_taskmanager_Status_Shuffle_Netty_UsedMemory.sum(['cluster','taskManager_node'])
- name: netty_availableMemory
exp: flink_taskmanager_Status_Shuffle_Netty_AvailableMemory.sum(['cluster','taskManager_node'])
- name: inPoolUsage
exp: flink_taskmanager_job_task_Shuffle_Netty_Input_Buffers_inPoolUsage.sum(['cluster','taskManager_node','flink_job_name','task_name'])*100
- name: outPoolUsage
exp: flink_taskmanager_job_task_Shuffle_Netty_Output_Buffers_outPoolUsage.sum(['cluster','taskManager_node','flink_job_name','task_name'])*100

# backPressured
- name: isBackPressured
exp: flink_taskmanager_job_task_isBackPressured.sum(['cluster','taskManager_node','flink_job_name','task_name'])
- name: idleTimeMsPerSecond
exp: flink_taskmanager_job_task_idleTimeMsPerSecond.sum(['cluster','taskManager_node','flink_job_name','task_name'])
- name: busyTimeMsPerSecond
exp: flink_taskmanager_job_task_busyTimeMsPerSecond.sum(['cluster','taskManager_node','flink_job_name','task_name'])
- name: softBackPressuredTimeMsPerSecond
exp: flink_taskmanager_job_task_softBackPressuredTimeMsPerSecond.sum(['cluster','taskManager_node','flink_job_name','task_name'])
- name: hardBackPressuredTimeMsPerSecond
exp: flink_taskmanager_job_task_hardBackPressuredTimeMsPerSecond.sum(['cluster','taskManager_node','flink_job_name','task_name'])


Loading
Loading