|
| 1 | +# Support Flink Monitoring |
| 2 | +## Motivation |
| 3 | +Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Now that Skywalking can monitor OpenTelemetry metrics, I want to add Flink monitoring via the OpenTelemetry Collector, which fetches metrics from its own Http Endpoint |
| 4 | +to expose metrics data for Prometheus. |
| 5 | + |
| 6 | +## Architecture Graph |
| 7 | +There is no significant architecture-level change. |
| 8 | + |
| 9 | +## Proposed Changes |
| 10 | +Flink expose its metrics via HTTP endpoint to OpenTelemetry collector, using SkyWalking openTelemetry receiver to receive these metrics。 |
| 11 | +Provide cluster, instance, and endpoint dimensions monitoring. |
| 12 | + |
| 13 | +### Flink Cluster Supported Metrics |
| 14 | + |
| 15 | +| Monitoring Panel | Unit | Metric Name | Description | Data Source | |
| 16 | +|-------------------------------|-------|-------------------------------------------------------|---------------------------------------------------------------------------------------------------|------------------| |
| 17 | +| Running Jobs | Count | meter_flink_jobManager_running_job_number | The number of running jobs. | Flink JobManager | |
| 18 | +| TaskManagers | Count | meter_flink_jobManager_taskManagers_registered_number | The number of taskManagers. | Flink JobManager | |
| 19 | +| JVM CPU Load | % | meter_flink_jobManager_jvm_cpu_load | The number of the jobManager JVM CPU load. | Flink JobManager | |
| 20 | +| JVM thread count | Count | meter_flink_jobManager_jvm_thread_count | The total number of the jobManager JVM threads. | Flink JobManager | |
| 21 | +| JVM Memory Heap Used | MB | meter_flink_jobManager_jvm_memory_heap_used | The amount of the jobManager JVM memory heap used. | Flink JobManager | |
| 22 | +| JVM Memory NonHeap Used | MB | meter_flink_jobManager_jvm_memory_NonHeap_used | The amount of the jobManager JVM nonHeap memory used. | Flink JobManager | |
| 23 | +| Task Managers Slots Total | Count | meter_flink_jobManager_taskManagers_slots_total | The number of total slots. | Flink JobManager | |
| 24 | +| Task Managers Slots Available | Count | meter_flink_jobManager_taskManagers_slots_available | The number of available slots. | Flink JobManager | |
| 25 | +| JVM CPU Time | ms | meter_flink_jobManager_jvm_cpu_time | The jobManager CPU time used by the JVM. | Flink JobManager | |
| 26 | +| JVM Memory Heap Available | MB | meter_flink_jobManager_jvm_memory_heap_available | The amount of the jobManager available JVM memory Heap. | Flink JobManager | |
| 27 | +| JVM Memory NoHeap Available | MB | meter_flink_jobManager_jvm_memory_nonHeap_available | The amount of the jobManager available JVM memory noHeap. | Flink JobManager | |
| 28 | +| JVM Memory Metaspace Used | MB | meter_flink_jobManager_jvm_memory_metaspace_used | The amount of the jobManager Used JVM metaspace memory. | Flink JobManager | |
| 29 | +| JVM Metaspace Available | MB | meter_flink_jobManager_jvm_memory_metaspace_available | The amount of the jobManager available JVM Metaspace Memory. | Flink JobManager | |
| 30 | +| JVM G1 Young Generation Count | Count | meter_flink_jobManager_jvm_g1_young_generation_count | The number of the jobManager JVM g1 young generation count. | Flink JobManager | |
| 31 | +| JVM G1 Old Generation Count | Count | meter_flink_jobManager_jvm_g1_old_generation_count | The number of the jobManager JVM g1 old generation count. | Flink JobManager | |
| 32 | +| JVM G1 Young Generation Time | Count | meter_flink_jobManager_jvm_g1_young_generation_time | The time of the jobManager JVM g1 young generation. | Flink JobManager | |
| 33 | +| JVM G1 Old Generation Time | ms | meter_flink_jobManager_jvm_g1_old_generation_time | The time of JVM g1 old generation. | Flink JobManager | |
| 34 | +| JVM G1 Old Generation Count | Count | meter_flink_jobManager_jvm_all_garbageCollector_count | The number of the jobManager JVM all garbageCollector count. | Flink JobManager | |
| 35 | +| JVM All GarbageCollector Time | ms | meter_flink_jobManager_jvm_all_garbageCollector_time | The time spent performing garbage collection for the given (or all) collector for the jobManager. | Flink JobManager | |
| 36 | + |
| 37 | + |
| 38 | +### Flink taskManager Supported Metrics |
| 39 | + |
| 40 | +| Monitoring Panel | Unit | Metric Name | Description | Data Source | |
| 41 | +|----------------------------------|---------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| |
| 42 | +| JVM CPU Load | % | meter_flink_taskManager_jvm_cpu_load | The number of the JVM CPU load. | Flink TaskManager | |
| 43 | +| JVM Thread Count | Count | meter_flink_taskManager_jvm_thread_count | The total number of JVM threads. | Flink TaskManager | |
| 44 | +| JVM Memory Heap Used | MB | meter_flink_taskManager_jvm_memory_heap_used | The amount of JVM memory heap used. | Flink TaskManager | |
| 45 | +| JVM Memory NonHeap Used | MB | meter_flink_taskManager_jvm_memory_nonHeap_used | The amount of JVM nonHeap memory used. | Flink TaskManager | |
| 46 | +| JVM CPU Time | ms | meter_flink_taskManager_jvm_cpu_time | The CPU time used by the JVM. | Flink TaskManager | |
| 47 | +| JVM Memory Heap Available | MB | meter_flink_taskManager_jvm_memory_heap_available | The amount of available JVM memory Heap. | Flink TaskManager | |
| 48 | +| JVM Memory NonHeap Available | MB | meter_flink_taskManager_jvm_memory_nonHeap_available | The amount of available JVM memory nonHeap. | Flink TaskManager | |
| 49 | +| JVM Memory Metaspace Used | MB | meter_flink_taskManager_jvm_memory_metaspace_used | The amount of Used JVM metaspace memory. | Flink TaskManager | |
| 50 | +| JVM Metaspace Available | MB | meter_flink_taskManager_jvm_memory_metaspace_available | The amount of Available JVM Metaspace Memory. | Flink TaskManager | |
| 51 | +| NumRecordsIn | Count | meter_flink_taskManager_numRecordsIn | The total number of records this task has received. | Flink TaskManager | |
| 52 | +| NumRecordsOut | Count | meter_flink_taskManager_numRecordsOut | The total number of records this task has emitted. | Flink TaskManager | |
| 53 | +| NumBytesInPerSecond | Bytes/s | meter_flink_taskManager_numBytesInPerSecond | The number of bytes received per second. | Flink TaskManager | |
| 54 | +| NumBytesOutPerSecond | Bytes/s | meter_flink_taskManager_numBytesOutPerSecond | The number of bytes this task emits per second. | Flink TaskManager | |
| 55 | +| Netty UsedMemory | MB | meter_flink_taskManager_netty_usedMemory | The amount of used netty memory. | Flink TaskManager | |
| 56 | +| Netty AvailableMemory | MB | meter_flink_taskManager_netty_availableMemory | The amount of available netty memory. | Flink TaskManager | |
| 57 | +| IsBackPressured | Count | meter_flink_taskManager_isBackPressured | Whether the task is back-pressured. | Flink TaskManager | |
| 58 | +| InPoolUsage | % | meter_flink_taskManager_inPoolUsage | An estimate of the input buffers usage. (ignores LocalInputChannels). | Flink TaskManager | |
| 59 | +| OutPoolUsage | % | meter_flink_taskManager_outPoolUsage | An estimate of the output buffers usage. The pool usage can be > 100% if overdraft buffers are being used. | Flink TaskManager | |
| 60 | +| SoftBackPressuredTimeMsPerSecond | ms | meter_flink_taskManager_softBackPressuredTimeMsPerSecond | The time this task is softly back pressured per second.Softly back pressured task will be still responsive and capable of for example triggering unaligned checkpoints. | Flink TaskManager | |
| 61 | +| HardBackPressuredTimeMsPerSecond | ms | meter_flink_taskManager_hardBackPressuredTimeMsPerSecond | The time this task is back pressured in a hard way per second.During hard back pressured task is completely blocked and unresponsive preventing for example unaligned checkpoints from triggering. | Flink TaskManager | |
| 62 | +| IdleTimeMsPerSecond | ms | meter_flink_taskManager_idleTimeMsPerSecond | The time this task is idle (has no data to process) per second. Idle time excludes back pressured time, so if the task is back pressured it is not idle. | Flink TaskManager | |
| 63 | +| BusyTimeMsPerSecond | ms | meter_flink_taskManager_busyTimeMsPerSecond | The time this task is busy (neither idle nor back pressured) per second. Can be NaN, if the value could not be calculated. | Flink TaskManager | |
| 64 | + |
| 65 | + |
| 66 | +### Flink Job Supported Metrics |
| 67 | + |
| 68 | +| Monitoring Panel | Unit | Metric Name | Description | Data Source | |
| 69 | +|-------------------------|---------|-----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| |
| 70 | +| Job RunningTime | min | meter_flink_job_runningTime | The job running time. | Flink JobManager | |
| 71 | +| Job Restart Number | Count | meter_flink_job_restart_number | The number of job restart. | Flink JobManager | |
| 72 | +| Job RestartingTime | min | meter_flink_job_restartingTime | The job restarting Time. | Flink JobManager | |
| 73 | +| Job CancellingTime | min | meter_flink_job_cancellingTime | The job cancelling time. | Flink JobManager | |
| 74 | +| Checkpoints Total | Count | meter_flink_job_checkpoints_total | The total number of checkpoints. | Flink JobManager | |
| 75 | +| Checkpoints Failed | Count | meter_flink_job_checkpoints_failed | The number of failed checkpoints. | Flink JobManager | |
| 76 | +| Checkpoints Completed | Count | meter_flink_job_checkpoints_completed | The number of completed checkpoints. | Flink JobManager | |
| 77 | +| Checkpoints InProgress | Count | meter_flink_job_checkpoints_inProgress | The number of inProgress checkpoints. | Flink JobManager | |
| 78 | +| CurrentEmitEventTimeLag | ms | meter_flink_job_currentEmitEventTimeLag | The latency between a data record's event time and its emission time from the source. | Flink TaskManager | |
| 79 | +| NumRecordsIn | Count | meter_flink_job_numRecordsIn | The total number of records this operator/task has received. | Flink TaskManager | |
| 80 | +| NumRecordsOut | Count | meter_flink_job_numRecordsOut | The total number of records this operator/task has emitted. | Flink TaskManager | |
| 81 | +| NumBytesInPerSecond | Bytes/s | meter_flink_job_numBytesInPerSecond | The number of bytes this task received per second. | Flink TaskManager | |
| 82 | +| NumBytesOutPerSecond | Bytes/s | meter_flink_job_numBytesOutPerSecond | The number of bytes this task emits per second. | Flink TaskManager | |
| 83 | +| LastCheckpointSize | Bytes | meter_flink_job_lastCheckpointSize | The checkPointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled. | Flink JobManager | |
| 84 | +| LastCheckpointDuration | ms | meter_flink_job_lastCheckpointDuration | The time it took to complete the last checkpoint. | Flink JobManager | |
| 85 | + |
| 86 | +## Imported Dependencies libs and their licenses. |
| 87 | +No new dependency. |
| 88 | + |
| 89 | +## Compatibility |
| 90 | +no breaking changes. |
| 91 | + |
| 92 | +## General usage docs |
| 93 | + |
| 94 | +This feature is out of the box. |
0 commit comments