Skip to content

Commit 052f4b9

Browse files
committed
node: memmgr: add metrics information
Similar to cpu manager (e1d1af1) there are possible metrics that can be added to memory manager in order to improve observability. The metrics are: 1. memory_manager_pinning_requests_total - increment the counter for every guaranteed memory assigned to workloads. 2. memory_manager_assignment_errors_total - increment the counter for every memory allocation failure. Signed-off-by: Talor Itzhak <[email protected]>
1 parent e5bdeb6 commit 052f4b9

File tree

2 files changed

+18
-9
lines changed

2 files changed

+18
-9
lines changed

keps/sig-node/1769-memory-manager/README.md

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -635,9 +635,9 @@ _This section must be completed when targeting beta graduation to a release._
635635

636636

637637
###### What specific metrics should inform a rollback?
638-
639638
The pod may fail with the admission error because the kubelet can not provide all resources aligned from the same NUMA node.
640639
You can see the error message under the pod events.
640+
"memory_manager_assignment_errors_total".
641641

642642
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
643643
Tested it manually by replacing the kubelet binary on the node with the `Static` memory manager policy, but I failed
@@ -650,13 +650,17 @@ fields of API types, flags, etc.?**
650650

651651
### Monitoring Requirements
652652

653-
_This section must be completed when targeting beta graduation to a release._
653+
Monitor the metrics
654+
- "memory_manager_pinning_requests_total"
655+
- "memory_manager_assignment_errors_total"
654656

655657
###### How can an operator determine if the feature is in use by workloads?
656658
In order for workloads to request exclusive memory allocation, the pod QoS must be "guaranteed".
657659
The memory manager data will be available under pod resources API.
658660
When it is configured with the static policy,
659661
you will see memory related data during call to the pod resources API List method under the container.
662+
In addition, in case the workload is guaranteed, the metric named memory_manager_pinning_requests_total should
663+
be incremented.
660664

661665
###### How can someone using this feature know that it is working for their instance?
662666

@@ -683,20 +687,23 @@ _This section must be completed when targeting beta graduation to a release._
683687
<!--
684688
Pick one more of these and delete the rest.
685689
-->
686-
`<TODO>`
687-
- [ ] Metrics
690+
691+
- [X] Metrics
688692
- Metric name:
689-
- [Optional] Aggregation method:
690-
- Components exposing the metric:
691-
- [ ] Other (treat as last resort)
692-
- Details:
693+
- memory_manager_pinning_requests_total
694+
- memory_manager_assignment_errors_total
693695

694696
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
695697
<!--
696698
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
697699
implementation difficulties, etc.).
698700
-->
699-
Currently, for the pod author, it is impossible to know containers NUMA pinning without access to the node.
701+
- "memory_manager_pinning_requests_total"
702+
- "memory_manager_assignment_errors_total"
703+
704+
The addition of these metrics will be done before moving to GA
705+
issue - `<TBD>`
706+
PR - `<TBD>`
700707

701708
### Dependencies
702709

keps/sig-node/1769-memory-manager/kep.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,5 @@ disable-supported: true
4242

4343
# The following PRR answers are required at beta release
4444
metrics:
45+
- "memory_manager_pinning_requests_total"
46+
- "memory_manager_pinning_errors_total"

0 commit comments

Comments
 (0)