Skip to content

Commit e689946

Browse files
authored
Merge pull request #2770 from splunk/adasplunk-O11YDOCS-6517-new
[O11YDOCS-6517-new] More updates and review comments
2 parents c421a85 + 846e72e commit e689946

File tree

8 files changed

+412
-466
lines changed

8 files changed

+412
-466
lines changed

private-preview/aopt/aopt-derived-metrics.rst

Lines changed: 0 additions & 430 deletions
This file was deleted.

private-preview/aopt/aopt-glossary.rst

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ The ratio of how many days of information is available for a workload compared t
2828
Application Optimization calculates an overall confidence level by taking the lowest confidence level across all containers, where each container's confidence level is an average of the separate confidence levels for CPU and memory.
2929

3030

31+
Why the confidence level matters
32+
----------------------------------------------------------
33+
34+
It's a good idea to match the confidence level to your workload's importance or criticality. In other words, if your workload is a test or you just need to preview the recommendations, a confidence level of :guilabel:`Low` is okay. But if your workload is a production or business critical workload, it's best to wait for a confidence level of :guilabel:`High` before applying the recommendations.
35+
36+
3137
.. _aopt-glossary-efficiency:
3238

3339
Efficiency
@@ -37,9 +43,19 @@ The balance between over-provisioning and under-provisioning to optimize resourc
3743

3844
Application Optimization is a powerful tool for achieving and maintaining efficiency. It calculates efficiency as the average of the pod-wide usage of a resource's ``request`` setting, capped at 100%. Its calculation only includes metrics within the analysis window, which is the lesser of 14 days and the time since the last resource change (or the initial deployment). Note that rather than finding the utilization (usage over requests) of each container within a pod, all of the containers' usage and requests are added up first. The averages for each CPU and memory ``request`` setting are then weight-averaged based on the assumed resource cost weights.
3945

46+
Best practices call for resource utilization in the 60-80% range. Having efficiency above 70-80% presents resource starvation risks.
47+
4048
When values are unset for a particular resource, this tool assumes those ``request`` settings to be at usage (in other words, 100% efficient) to more accurately weigh multi-container rates.
4149

42-
When the main container has an unset resource, this tool considers the efficiency rate to be nullified.
50+
When the main container has an unset resource, this tool considers the efficiency rate to be undefined.
51+
52+
53+
.. _aopt-glossary-resource-footprint:
54+
55+
Resource Footprint
56+
==========================================================
57+
58+
A workload's resource footprint is the sum of its pods' ``request`` settings for that resource or its average usage if it exceeds its ``request`` settings. If the ``request`` value is not set, the footprint represents the sum of actual usage instead. This tile displays the sum of all resource footprints of all the pods of all your workloads.
4359

4460

4561
.. _aopt-glossary-starvation-risk:
@@ -49,12 +65,17 @@ Starvation risk
4965

5066
A workload's average risk of running out of CPU or memory:
5167

52-
* :guilabel:`High`: Any container in which usage is greater than or equal to 95% of its ``limit`` settings.
68+
* :guilabel:`High`: The workload has tried to use more resources than were available, so its performance and reliability have likely been impacted. Application Optimization marks any container in which usage is greater than or equal to 95% of its ``limit`` settings as :guilabel:`High`.
5369

54-
* :guilabel:`Medium`: At least one resource (CPU or memory) of one container is not defined OR (all ``request`` settings are defined AND actual usage of at least one resource of one container exceeds its ``request`` setting for any time slot).
70+
* :guilabel:`Medium`: The workload has used more than its allocated resources (``request`` settings). While this may not have an impact on its performance and reliability due to Kubernetes bursting into additional resources, future occurrences of overusage may have an impact, since extra resources are not guaranteed to exist.
71+
72+
Application Optimization sets :guilabel:`Starvation risk` to :guilabel:`Medium` for any container in which either of these is true:
73+
74+
* At least one resource (CPU or memory) of is undefined.
75+
76+
* All ``request`` settings are defined and actual usage of at least one resource of exceeds its ``request`` setting for any time slot.
5577

56-
* :guilabel:`Low`: For either CPU or memory, the recommendation is greater than the baseline value. For example, the usage is greater than target utilization (0.85).
78+
* :guilabel:`Low`: The workload hasn't exceeded its allocated resources but doesn't have enough headroom to absorb spikes or delays in scale-out when traffic increases. Application Optimization marks any container in which, for either CPU or memory, the recommendation is greater than the baseline value. For example, the usage is greater than target utilization (0.85).
5779

5880
* :guilabel:`Minimal`: None of the above conditions are detected. In other words, all containers have ``request`` settings for both CPU and memory, and neither of these resources has had usage exceeding its target utilization.
5981

60-

private-preview/aopt/aopt-intro.rst

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,12 @@ By using Application Optimization together with :new-page:`Splunk Infrastructure
1717
Key features
1818
==========================================================
1919

20-
* :guilabel:`Kubernetes Profiler` (the :guilabel:`Application Optimization` dashboard) provides comprehensive insights into the efficiency of your CPU and memory settings for your Kubernetes workloads.
20+
* :guilabel:`Kubernetes Profiler` (the :guilabel:`Application Optimization` dashboard) provides insights into the efficiency of your CPU and memory settings for your Kubernetes workloads. You can use the profiler to:
2121

22-
* :guilabel:`Instant Recommendations` provide suggestions for CPU and memory settings based on historical utilization data (metrics you've sent to Splunk IM). You can apply these suggestions directly to your pods using the YAML snippets it provides.
22+
* Obtain insight summaries by applying filters of your choice (environment, cluster, namespace, and so on).
23+
* Find the workloads most in need of attention based on your priorities, which could be performance, reliability, or cost savings.
24+
25+
* :guilabel:`Instant Recommendations` provide suggestions for CPU and memory settings based on historical utilization across all pods of a workload. Utilization data comes from the metrics you've sent to Splunk IM. You can apply these suggestions directly to your workloads using the YAML snippets it provides. For workloads that use autoscaling, it also suggests how to update the autoscaling configuration.
2326

2427

2528
Requirements
@@ -33,7 +36,7 @@ Requirements
3336

3437
* All metrics that the :new-page:`Splunk IM Kubernetes cluster receiver collects by default <https://docs.splunk.com/observability/en/gdi/opentelemetry/collector-kubernetes/install-k8s.html#helm-chart-supported-distros>` must be present in your data. Since these metrics are enabled by default on your Kubernetes collector you don't need to take any action unless you've disabled them.
3538

36-
* Horizontal pod autoscaler (HPA) telemetry: Optional, but if you do have HPAs and you send :new-page:`k8s.hpa.* metrics <https://docs.splunk.com/observability/en/gdi/opentelemetry/components/kubernetes-cluster-receiver.html>` to Splunk IM, :guilabel:`Instant Recommendations` can help you to improve them.
39+
* Horizontal pod autoscaler (HPA) telemetry: Optional, but if you do have HPAs and you send :new-page:`k8s.hpa.* metrics <https://docs.splunk.com/observability/en/gdi/opentelemetry/components/kubernetes-cluster-receiver.html>` to Splunk IM, :guilabel:`Instant Recommendations` can help you to improve them. See :ref:`aopt-workload-hpa`.
3740

3841

3942
Enable Application Optimization
Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
:orphan:
2+
3+
.. _aopt-metric-reference:
4+
5+
.. include:: /private-preview/aopt/toc.rst
6+
:start-after: :orphan:
7+
8+
**********************************************************
9+
Metric reference
10+
**********************************************************
11+
12+
Application Optimization's workload analysis produces the following metics. All metrics have at least the same dimensions as the workload metrics (for example ``aws-region`` and so on) and use the same attribute names and values.
13+
14+
.. find out what the prefix is and add it to the metric name. ask daniel for the name.
15+
16+
All metric names have a prefix of either ``sf`` or ``o11y``.
17+
18+
.. note::
19+
Memory is specified in GiB. CPU is specified in cores.
20+
21+
22+
23+
.. list-table::
24+
:widths: 40 5 55
25+
:width: 100%
26+
:header-rows: 1
27+
28+
-
29+
30+
- **Metric**
31+
- **Scope\***
32+
- **Description**
33+
-
34+
35+
- ``sf.report.available``
36+
- W
37+
- Synthetic metric. The value is ``0`` for failed, ``1`` for success. This metric may have additional attributes that represent the report outcome as a whole. At least a ``aopt.profile_report.error_reason`` code.
38+
-
39+
40+
- ``sf.report.window_days``
41+
- W
42+
- Number of days (possibly fractional) that were considered in the analysis. In general, this is the smaller of 14 and the number of days since the last resource configuration change for the workload. This is used to determine the validity and confidence level of the report.
43+
-
44+
45+
- ``sf.report.coverage_ratio``
46+
- W
47+
- Window coverage with metrics: the ratio of number of actual metrics values found compared to the number of timeslots in the window. This should represent the worst case value (in other words, the minimum of the coverage of each input timeseries we use). This is used to determine the validity and confidence level of the report.
48+
-
49+
50+
- ``sf.report.average_replicas``
51+
- W
52+
- Average number of replicas during the analysis window. Does not include pods that allocate resources, such as those scheduled but not started.
53+
-
54+
55+
- ``sf.report.pod.qos_class``
56+
- W
57+
- Pod's quality of service (QoS) class, as defined in Kubernetes docs, encoded as an integer.
58+
-
59+
60+
- ``sf.report.footprint.cpu_cores``
61+
- W
62+
- Number of the allocated CPU cores for all replicas (averaged based on ``average_replicas``). Does not account for usage above request (bursting).
63+
-
64+
65+
- ``sf.report.footprint.memory_gib``
66+
- W
67+
- GiB allocated memory for all replicas (averaged based on the average_replicas). Does not account for usage above request (bursting).
68+
-
69+
70+
- ``sf.report.efficiency_rate``
71+
- W
72+
- Resource efficiency rate, as percent. Weighted average of resource utilization of CPU and memory. CPU and memory weights according to AWS on-demand cost. Capped at 100%, rounded to whole percent.
73+
-
74+
75+
- ``sf.report.starvation_risk``
76+
- W
77+
- Resource starvation risk: Minimal, Low, Medium, High (encoded as ``0``, ``1``, ``2``, ``3`` respectively).
78+
Risk levels defined elsewhere:
79+
- Minimal: no starvation detected
80+
- Low: could benefit from more overhead
81+
- Medium: actually bursting but not being limited
82+
- High: CPU throttled and/or at resource limits.
83+
-
84+
85+
- ``sf.recommendation.available``
86+
- W
87+
- Indicates whether a recommendation is available for at least one container. This value is ``0`` or ``1``.
88+
-
89+
90+
- ``sf.recommendation.confidence_level``
91+
- W
92+
- Recommendations overall confidence level: Unknown, Low, Medium, High (encoded as ``0``, ``1``, ``2``, ``3`` respectively). Aggregated from ``container.confidence_level``, by taking the lowest confidence value (or the confidence value of the main or largest container).
93+
-
94+
95+
- ``sf.recommendation.container.available``
96+
- C
97+
- Indicates whether a recommendation is available: ``0`` or ``1``. A recommendation that matches the baseline is considered available.
98+
-
99+
100+
- ``sf.recommendation.container.confidence_level``
101+
- C
102+
- Recommendation confidence level: Unknown, Low, Medium, High (encoded as ``0``, ``1``, ``2``, ``3`` respectively).
103+
-
104+
105+
- ``sf.recommendation.container.cpu_request``
106+
- C
107+
- Per-container recommendation.
108+
-
109+
110+
- ``sf.recommendation.container.memory_request``
111+
- C
112+
- Per-container recommendation.
113+
-
114+
115+
- ``sf.recommendation.container.cpu_limit``
116+
- C
117+
- Per-container recommendation.
118+
-
119+
120+
- ``sf.recommendation.container.memory_limit``
121+
- C
122+
- Per-container recommendation.
123+
-
124+
125+
- ``sf.recommendation.footprint.cpu_cores``
126+
- W
127+
- Total footprint of recommendation.
128+
-
129+
130+
- ``sf.recommendation.footprint.memory_gib``
131+
- W
132+
- Total footprint of recommendation.
133+
-
134+
135+
- ``sf.recommendation.footprint_change.cpu_cores``
136+
- W
137+
- Footprint change of CPU requests, assuming the CPU request recommendations are applied for all containers. May be ``0``, ``missing``, or ``NaN`` if requests are not defined.
138+
-
139+
140+
- ``sf.recommendation.footprint_change.memory_gib``
141+
- W
142+
- Footprint change of memory requests, assuming the memory request recommendations are applied for all containers. May be ``0``, ``missing``, or ``NaN`` if requests are not defined.
143+
-
144+
145+
- ``sf.baseline.pod.cpu_request``
146+
- W
147+
- Pod-level sum of the baseline for the configuration being analyzed. The ``request`` for a container is considered defined if its ``limit`` is defined, even if the ``request`` is reported as missing or ``0``.
148+
-
149+
150+
- ``sf.baseline.pod.memory_request``
151+
- W
152+
- Pod-level sum of the baseline for the configuration being analyzed. The ``request`` for a container is considered defined if its ``limit`` is defined, even if the ``request`` is reported as missing or ``0``.
153+
-
154+
155+
- ``sf.baseline.pod.cpu_limit``
156+
- W
157+
- Pod-level sum of the baseline for the configuration being analyzed. This value is ``0`` or ``NaN`` if at least one ``limit`` is missing; as a result, the whole pod doesn't have a ``limit`` for this resource.
158+
-
159+
160+
- ``sf.baseline.pod.memory_limit``
161+
- W
162+
- Pod-level sum of the baseline for the configuration being analyzed. This value is ``0`` or ``NaN`` if at least one ``limit`` is missing; as a result, the whole pod doesn't have a ``limit`` for this resource.
163+
-
164+
165+
- ``sf.baseline.container.cpu_request``
166+
- C
167+
- Per-container baseline for the configuration being analyzed.
168+
-
169+
170+
- ``sf.baseline.container.memory_request``
171+
- C
172+
- Per-container baseline for the configuration being analyzed.
173+
-
174+
175+
- ``sf.baseline.container.cpu_limit``
176+
- C
177+
- Per-container baseline for the configuration being analyzed.
178+
-
179+
180+
- ``sf.baseline.container.memory_limit``
181+
- C
182+
- Per-container baseline for the configuration being analyzed.
183+
184+
185+
186+
\*Scope is W for workload and C for container. See :ref:`Dimensions <aopt-derived-metrics_dimensions>` for attributes that apply to each scope.
187+
188+
189+
190+
.. _aopt-derived-metrics_dimensions:
191+
192+
Dimensions
193+
==========================================================
194+
195+
196+
Workload-level attributes
197+
----------------------------------------------------------
198+
199+
The following dimensions are applied to all metrics (both workload and container scope):
200+
201+
.. list-table::
202+
:widths: 40 60
203+
:width: 100%
204+
:header-rows: 1
205+
206+
-
207+
208+
- **Attribute name**
209+
- **Description**
210+
-
211+
212+
- ``environment``
213+
- Splunk Observability Cloud-specific attribute.
214+
-
215+
216+
- ``k8s.cluster.name``
217+
- Kubernetes cluster name.
218+
-
219+
220+
- ``k8s.namespace.name``
221+
-
222+
-
223+
224+
- ``k8s.workload.name``
225+
- This is our own generic workload info.
226+
-
227+
228+
- ``k8s.workload.kind``
229+
- Kind of workload: ``deployment``, ``statefulset`` or ``daemonset``. This is our own generic workload info.
230+
-
231+
232+
- ``k8s.workload.uid``
233+
- This is our own generic workload info.
234+
-
235+
236+
- ``k8s.deployment.name``
237+
- Present only for ``workload.kind`` == ``deployment``. Same as ``k8s.workload.name``.
238+
-
239+
240+
- ``k8s.deployment.uid``
241+
- Present only for ``workload.kind`` == ``deployment``. Same as ``k8s.object_uid``.
242+
-
243+
244+
- ``k8s.statefulset.name``
245+
- Present only for ``workload.kind`` == ``statefulset``. Same as ``k8s.workload.name``.
246+
-
247+
248+
- ``k8s.statefulset.uid``
249+
- Present only for ``workload.kind`` == ``statefulset``. Same as ``k8s.object_uid``.
250+
-
251+
252+
- ``k8s.daemonset.name``
253+
- Present only for ``workload.kind`` == ``daemonset``. Same as ``k8s.workload.name``.
254+
-
255+
256+
- ``k8s.daemonset.uid``
257+
- Present only for ``workload.kind`` == ``daemonset``. Same as ``k8s.object_uid``.
258+
-
259+
260+
- ``k8s.pod.qos``
261+
- Pod-level QoS
262+
-
263+
264+
- ``aopt.profiler_report.success``
265+
- Whether the analysis was successful and a report is provided. Values: ``0`` or ``1``.
266+
-
267+
268+
- ``aopt.instant_recommendation.present``
269+
- Whether there is a valid recommendation. Values: ``0`` or ``1``.
270+
271+
272+
273+
274+
275+
Container-level attributes
276+
----------------------------------------------------------
277+
278+
The following additional dimensions are applied to per-container metrics (in other words, any metric named ``*.container.*``):
279+
280+
281+
.. list-table::
282+
:widths: 40 60
283+
:width: 100%
284+
:header-rows: 1
285+
286+
-
287+
288+
- **Attribute name**
289+
- **Description**
290+
291+
-
292+
293+
- ``k8s.container.name``
294+
-
295+
-
296+
297+
- ``k8s.container.pseudo_qos``
298+
- Container-level pseudo-QoS.
299+
300+
301+
.. note::
302+
This set of additional attributes matches the set of additional attributes that per-container ``k8s`` metrics (such as memory and CPU utilization), provide on top of workload-level metrics (such as replica count). This excludes metadata attributes that are per pod instance (such ``as k8s.replica.set`` and ``k8s.pod.id``, since we always aggregate metrics across instances), as well as per container instance (such as ``k8s.container.id``) for the same reason.
303+

0 commit comments

Comments
 (0)