Skip to content

Commit 1e9766a

Browse files
authored
SWIP-9, support flink monitoring (#13167)
1 parent 2e8d6e3 commit 1e9766a

File tree

3 files changed

+97
-1
lines changed

3 files changed

+97
-1
lines changed

docs/en/changes/changes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
#### Documentation
1515

1616
* BanyanDB: Add `Data Lifecycle Stages(Hot/Warm/Cold)` documentation.
17+
* Add `SWIP-9 Support flink monitoring`.
1718

1819
All issues and pull requests are [here](https://github.com/apache/skywalking/milestone/230?closed=1)
1920

docs/en/swip/SWIP-9.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Support Flink Monitoring
2+
## Motivation
3+
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Now that Skywalking can monitor OpenTelemetry metrics, I want to add Flink monitoring via the OpenTelemetry Collector, which fetches metrics from its own Http Endpoint
4+
to expose metrics data for Prometheus.
5+
6+
## Architecture Graph
7+
There is no significant architecture-level change.
8+
9+
## Proposed Changes
10+
Flink expose its metrics via HTTP endpoint to OpenTelemetry collector, using SkyWalking openTelemetry receiver to receive these metrics。
11+
Provide cluster, instance, and endpoint dimensions monitoring.
12+
13+
### Flink Cluster Supported Metrics
14+
15+
| Monitoring Panel | Unit | Metric Name | Description | Data Source |
16+
|-------------------------------|-------|-------------------------------------------------------|---------------------------------------------------------------------------------------------------|------------------|
17+
| Running Jobs | Count | meter_flink_jobManager_running_job_number | The number of running jobs. | Flink JobManager |
18+
| TaskManagers | Count | meter_flink_jobManager_taskManagers_registered_number | The number of taskManagers. | Flink JobManager |
19+
| JVM CPU Load | % | meter_flink_jobManager_jvm_cpu_load | The number of the jobManager JVM CPU load. | Flink JobManager |
20+
| JVM thread count | Count | meter_flink_jobManager_jvm_thread_count | The total number of the jobManager JVM threads. | Flink JobManager |
21+
| JVM Memory Heap Used | MB | meter_flink_jobManager_jvm_memory_heap_used | The amount of the jobManager JVM memory heap used. | Flink JobManager |
22+
| JVM Memory NonHeap Used | MB | meter_flink_jobManager_jvm_memory_NonHeap_used | The amount of the jobManager JVM nonHeap memory used. | Flink JobManager |
23+
| Task Managers Slots Total | Count | meter_flink_jobManager_taskManagers_slots_total | The number of total slots. | Flink JobManager |
24+
| Task Managers Slots Available | Count | meter_flink_jobManager_taskManagers_slots_available | The number of available slots. | Flink JobManager |
25+
| JVM CPU Time | ms | meter_flink_jobManager_jvm_cpu_time | The jobManager CPU time used by the JVM. | Flink JobManager |
26+
| JVM Memory Heap Available | MB | meter_flink_jobManager_jvm_memory_heap_available | The amount of the jobManager available JVM memory Heap. | Flink JobManager |
27+
| JVM Memory NoHeap Available | MB | meter_flink_jobManager_jvm_memory_nonHeap_available | The amount of the jobManager available JVM memory noHeap. | Flink JobManager |
28+
| JVM Memory Metaspace Used | MB | meter_flink_jobManager_jvm_memory_metaspace_used | The amount of the jobManager Used JVM metaspace memory. | Flink JobManager |
29+
| JVM Metaspace Available | MB | meter_flink_jobManager_jvm_memory_metaspace_available | The amount of the jobManager available JVM Metaspace Memory. | Flink JobManager |
30+
| JVM G1 Young Generation Count | Count | meter_flink_jobManager_jvm_g1_young_generation_count | The number of the jobManager JVM g1 young generation count. | Flink JobManager |
31+
| JVM G1 Old Generation Count | Count | meter_flink_jobManager_jvm_g1_old_generation_count | The number of the jobManager JVM g1 old generation count. | Flink JobManager |
32+
| JVM G1 Young Generation Time | Count | meter_flink_jobManager_jvm_g1_young_generation_time | The time of the jobManager JVM g1 young generation. | Flink JobManager |
33+
| JVM G1 Old Generation Time | ms | meter_flink_jobManager_jvm_g1_old_generation_time | The time of JVM g1 old generation. | Flink JobManager |
34+
| JVM G1 Old Generation Count | Count | meter_flink_jobManager_jvm_all_garbageCollector_count | The number of the jobManager JVM all garbageCollector count. | Flink JobManager |
35+
| JVM All GarbageCollector Time | ms | meter_flink_jobManager_jvm_all_garbageCollector_time | The time spent performing garbage collection for the given (or all) collector for the jobManager. | Flink JobManager |
36+
37+
38+
### Flink taskManager Supported Metrics
39+
40+
| Monitoring Panel | Unit | Metric Name | Description | Data Source |
41+
|----------------------------------|---------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|
42+
| JVM CPU Load | % | meter_flink_taskManager_jvm_cpu_load | The number of the JVM CPU load. | Flink TaskManager |
43+
| JVM Thread Count | Count | meter_flink_taskManager_jvm_thread_count | The total number of JVM threads. | Flink TaskManager |
44+
| JVM Memory Heap Used | MB | meter_flink_taskManager_jvm_memory_heap_used | The amount of JVM memory heap used. | Flink TaskManager |
45+
| JVM Memory NonHeap Used | MB | meter_flink_taskManager_jvm_memory_nonHeap_used | The amount of JVM nonHeap memory used. | Flink TaskManager |
46+
| JVM CPU Time | ms | meter_flink_taskManager_jvm_cpu_time | The CPU time used by the JVM. | Flink TaskManager |
47+
| JVM Memory Heap Available | MB | meter_flink_taskManager_jvm_memory_heap_available | The amount of available JVM memory Heap. | Flink TaskManager |
48+
| JVM Memory NonHeap Available | MB | meter_flink_taskManager_jvm_memory_nonHeap_available | The amount of available JVM memory nonHeap. | Flink TaskManager |
49+
| JVM Memory Metaspace Used | MB | meter_flink_taskManager_jvm_memory_metaspace_used | The amount of Used JVM metaspace memory. | Flink TaskManager |
50+
| JVM Metaspace Available | MB | meter_flink_taskManager_jvm_memory_metaspace_available | The amount of Available JVM Metaspace Memory. | Flink TaskManager |
51+
| NumRecordsIn | Count | meter_flink_taskManager_numRecordsIn | The total number of records this task has received. | Flink TaskManager |
52+
| NumRecordsOut | Count | meter_flink_taskManager_numRecordsOut | The total number of records this task has emitted. | Flink TaskManager |
53+
| NumBytesInPerSecond | Bytes/s | meter_flink_taskManager_numBytesInPerSecond | The number of bytes received per second. | Flink TaskManager |
54+
| NumBytesOutPerSecond | Bytes/s | meter_flink_taskManager_numBytesOutPerSecond | The number of bytes this task emits per second. | Flink TaskManager |
55+
| Netty UsedMemory | MB | meter_flink_taskManager_netty_usedMemory | The amount of used netty memory. | Flink TaskManager |
56+
| Netty AvailableMemory | MB | meter_flink_taskManager_netty_availableMemory | The amount of available netty memory. | Flink TaskManager |
57+
| IsBackPressured | Count | meter_flink_taskManager_isBackPressured | Whether the task is back-pressured. | Flink TaskManager |
58+
| InPoolUsage | % | meter_flink_taskManager_inPoolUsage | An estimate of the input buffers usage. (ignores LocalInputChannels). | Flink TaskManager |
59+
| OutPoolUsage | % | meter_flink_taskManager_outPoolUsage | An estimate of the output buffers usage. The pool usage can be > 100% if overdraft buffers are being used. | Flink TaskManager |
60+
| SoftBackPressuredTimeMsPerSecond | ms | meter_flink_taskManager_softBackPressuredTimeMsPerSecond | The time this task is softly back pressured per second.Softly back pressured task will be still responsive and capable of for example triggering unaligned checkpoints. | Flink TaskManager |
61+
| HardBackPressuredTimeMsPerSecond | ms | meter_flink_taskManager_hardBackPressuredTimeMsPerSecond | The time this task is back pressured in a hard way per second.During hard back pressured task is completely blocked and unresponsive preventing for example unaligned checkpoints from triggering. | Flink TaskManager |
62+
| IdleTimeMsPerSecond | ms | meter_flink_taskManager_idleTimeMsPerSecond | The time this task is idle (has no data to process) per second. Idle time excludes back pressured time, so if the task is back pressured it is not idle. | Flink TaskManager |
63+
| BusyTimeMsPerSecond | ms | meter_flink_taskManager_busyTimeMsPerSecond | The time this task is busy (neither idle nor back pressured) per second. Can be NaN, if the value could not be calculated. | Flink TaskManager |
64+
65+
66+
### Flink Job Supported Metrics
67+
68+
| Monitoring Panel | Unit | Metric Name | Description | Data Source |
69+
|-------------------------|---------|-----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|
70+
| Job RunningTime | min | meter_flink_job_runningTime | The job running time. | Flink JobManager |
71+
| Job Restart Number | Count | meter_flink_job_restart_number | The number of job restart. | Flink JobManager |
72+
| Job RestartingTime | min | meter_flink_job_restartingTime | The job restarting Time. | Flink JobManager |
73+
| Job CancellingTime | min | meter_flink_job_cancellingTime | The job cancelling time. | Flink JobManager |
74+
| Checkpoints Total | Count | meter_flink_job_checkpoints_total | The total number of checkpoints. | Flink JobManager |
75+
| Checkpoints Failed | Count | meter_flink_job_checkpoints_failed | The number of failed checkpoints. | Flink JobManager |
76+
| Checkpoints Completed | Count | meter_flink_job_checkpoints_completed | The number of completed checkpoints. | Flink JobManager |
77+
| Checkpoints InProgress | Count | meter_flink_job_checkpoints_inProgress | The number of inProgress checkpoints. | Flink JobManager |
78+
| CurrentEmitEventTimeLag | ms | meter_flink_job_currentEmitEventTimeLag | The latency between a data record's event time and its emission time from the source. | Flink TaskManager |
79+
| NumRecordsIn | Count | meter_flink_job_numRecordsIn | The total number of records this operator/task has received. | Flink TaskManager |
80+
| NumRecordsOut | Count | meter_flink_job_numRecordsOut | The total number of records this operator/task has emitted. | Flink TaskManager |
81+
| NumBytesInPerSecond | Bytes/s | meter_flink_job_numBytesInPerSecond | The number of bytes this task received per second. | Flink TaskManager |
82+
| NumBytesOutPerSecond | Bytes/s | meter_flink_job_numBytesOutPerSecond | The number of bytes this task emits per second. | Flink TaskManager |
83+
| LastCheckpointSize | Bytes | meter_flink_job_lastCheckpointSize | The checkPointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled. | Flink JobManager |
84+
| LastCheckpointDuration | ms | meter_flink_job_lastCheckpointDuration | The time it took to complete the last checkpoint. | Flink JobManager |
85+
86+
## Imported Dependencies libs and their licenses.
87+
No new dependency.
88+
89+
## Compatibility
90+
no breaking changes.
91+
92+
## General usage docs
93+
94+
This feature is out of the box.

docs/en/swip/readme.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,10 +68,11 @@ All accepted and proposed SWIPs can be found in [here](https://github.com/apache
6868

6969
## Known SWIPs
7070

71-
Next SWIP Number: 9
71+
Next SWIP Number: 10
7272

7373
### Accepted SWIPs
7474

75+
- [SWIP-9 Support Flink Monitoring](SWIP-9.md)
7576
- [SWIP-8 Support Kong Monitoring](SWIP-8.md)
7677
- [SWIP-6 Support ActiveMQ Monitoring](SWIP-6.md)
7778
- [SWIP-5 Support ClickHouse Monitoring](SWIP-5.md)

0 commit comments

Comments
 (0)