Merge pull request #2765 from splunk/adasplunk-O11YDOCS-6517

adasplunk · web-flow · commit 509c7a4a14e1 · 2025-05-05T08:45:05.000-07:00
[O11YDOCS-6517] Application Optimization preview docs
diff --git a/private-preview/aopt/aopt-dashboard-sorted-scaled.png b/private-preview/aopt/aopt-dashboard-sorted-scaled.png
diff --git a/private-preview/aopt/aopt-derived-metrics.rst b/private-preview/aopt/aopt-derived-metrics.rst
diff --git a/private-preview/aopt/aopt-glossary.rst b/private-preview/aopt/aopt-glossary.rst
@@ -0,0 +1,60 @@
+:orphan:
+
+.. _aopt-glossary:
+
+.. include:: /private-preview/aopt/toc.rst
+    :start-after: :orphan:
+
+**********************************************************
+Glossary
+**********************************************************
+
+.. _aopt-glossary-confidence-level:
+
+Confidence level
+==========================================================
+
+The ratio of how many days of information is available for a workload compared to how many days worth of information Application Optimization needs in order to analyze it (14 contiguous days). If your data spans 14 contiguous days, the confidence will be high. If your data spans fewer than 14 days or since the initial deployment, the confidence level will be lower. For example, if you've made a configuration change such as a change to CPU or memory limits, an addition of a container, and so on. Definition of confidence levels:
+
+* :guilabel:`High`: Greater than 90% of needed information is available.
+
+* :guilabel:`Medium`: 50-89% of needed information is available.
+
+* :guilabel:`Low`: Only 5-49% of needed information is available.
+
+* :guilabel:`Unknown`: Less than 5% of needed information is available.
+
+
+Application Optimization calculates an overall confidence level by taking the lowest confidence level across all containers, where each container's confidence level is an average of the separate confidence levels for CPU and memory.
+
+
+.. _aopt-glossary-efficiency:
+
+Efficiency
+==========================================================
+
+The balance between over-provisioning and under-provisioning to optimize resource utilization without compromising performance or stability. Highly efficient workloads use resources in a way that aligns closely with their actual consumption, reducing waste and maximizing your cluster's capacity to run other workloads. 
+
+Application Optimization is a powerful tool for achieving and maintaining efficiency. It calculates efficiency as the average of the pod-wide usage of a resource's ``request`` setting, capped at 100%. Its calculation only includes metrics within the analysis window, which is the lesser of 14 days and the time since the last resource change (or the initial deployment). Note that rather than finding the utilization (usage over requests) of each container within a pod, all of the containers' usage and requests are added up first. The averages for each CPU and memory ``request`` setting are then weight-averaged based on the assumed resource cost weights.
+
+When values are unset for a particular resource, this tool assumes those ``request`` settings to be at usage (in other words, 100% efficient) to more accurately weigh multi-container rates.
+
+When the main container has an unset resource, this tool considers the efficiency rate to be nullified.
+
+
+.. _aopt-glossary-starvation-risk:
+
+Starvation risk
+==========================================================
+
+A workload's average risk of running out of CPU or memory:
+
+* :guilabel:`High`: Any container in which usage is greater than or equal to 95% of its ``limit`` settings.
+
+* :guilabel:`Medium`: At least one resource (CPU or memory) of one container is not defined OR (all ``request`` settings are defined AND actual usage of at least one resource of one container exceeds its ``request`` setting for any time slot).
+
+* :guilabel:`Low`: For either CPU or memory, the recommendation is greater than the baseline value. For example, the usage is greater than target utilization (0.85).
+
+* :guilabel:`Minimal`: None of the above conditions are detected. In other words, all containers have ``request`` settings for both CPU and memory, and neither of these resources has had usage exceeding its target utilization. 
+
+
diff --git a/private-preview/aopt/aopt-intro.rst b/private-preview/aopt/aopt-intro.rst
@@ -0,0 +1,42 @@
+:orphan:
+
+.. _aopt-intro:
+
+.. include:: /private-preview/aopt/toc.rst
+    :start-after: :orphan:
+
+**********************************************************
+What is Application Optimization?
+**********************************************************
+
+Application Optimization is a component within Splunk Observability Cloud. It provides insights into the way you're allocating cloud-native infrastructure :new-page:`resources <https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/>` to your cloud-based services. With Application Optimization, you can identify resource overprovisioning that may be costing your organization extra money and underprovisioning that may be causing performance or reliability problems and costing your organization lost revenue or customer loyalty. Along with insights in the form of metrics, Application Optimization offers simple, immediate actions which you can take to right-size your resource allocations and to improve the overall performance and reliability of your services.
+
+By using Application Optimization together with :new-page:`Splunk Infrastructure Monitoring (IM) <https://docs.splunk.com/observability/en/infrastructure/intro-to-infrastructure.html>`, FinOps teams, DevOps teams, and business leaders are empowered with a holistic view into the correlation between their cloud costs and their business objectives.
+
+
+Key features
+==========================================================
+
+* :guilabel:`Kubernetes Profiler` (the :guilabel:`Application Optimization` dashboard) provides comprehensive insights into the efficiency of your CPU and memory settings for your Kubernetes workloads.	
+
+* :guilabel:`Instant Recommendations` provide suggestions for CPU and memory settings based on historical utilization data (metrics you've sent to Splunk IM). You can apply these suggestions directly to your pods using the YAML snippets it provides. 
+
+
+Requirements
+==========================================================
+
+* Supported cloud platforms:  Amazon Elastic Kubernetes Service (EKS). Since Application Optimization is a consumer of data you send to Splunk IM, see :new-page:`Splunk IM's statement on supported Amazon EKS versions <https://docs.splunk.com/observability/en/gdi/opentelemetry/collector-kubernetes/install-k8s.html#helm-chart-supported-distros>`.
+
+* Supported Kubernetes workload kinds: Deployment, StatefulSet, DaemonSet
+
+* Minimum amount of infrastructure metrics you must send to Splunk IM: 14 contiguous days. This isn't strictly a requirement because instant recommendationss are still generated for less data, but with lower confidence scores
+
+* All metrics that the :new-page:`Splunk IM Kubernetes cluster receiver collects by default <https://docs.splunk.com/observability/en/gdi/opentelemetry/collector-kubernetes/install-k8s.html#helm-chart-supported-distros>` must be present in your data. Since these metrics are enabled by default on your Kubernetes collector you don't need to take any action unless you've disabled them. 
+
+* Horizontal pod autoscaler (HPA) telemetry: Optional, but if you do have HPAs and you send :new-page:`k8s.hpa.* metrics <https://docs.splunk.com/observability/en/gdi/opentelemetry/components/kubernetes-cluster-receiver.html>` to Splunk IM, :guilabel:`Instant Recommendations` can help you to improve them.
+
+
+Enable Application Optimization
+==========================================================
+
+For those participating in the private preview, Splunk will enable Application Optimization on your Splunk Observability Cloud account for you.
diff --git a/private-preview/aopt/aopt-scenarios.rst b/private-preview/aopt/aopt-scenarios.rst
@@ -0,0 +1,48 @@
+:orphan:
+
+.. _aopt-scenarios:
+
+.. include:: /private-preview/aopt/toc.rst
+    :start-after: :orphan:
+
+**********************************************************
+Scenarios
+**********************************************************
+
+Here are common use case scenarios and how to use Application Optimization for them.
+
+
+Which workloads need to be optimized?
+==========================================================
+
+To gain a comprehensive overview of all workloads and identify optimization opportunities by detecting over-provisioned or under-provisioned resources:
+
+#. Navigate to the :guilabel:`Application Optimization` dashboard.
+
+#. Take a quick look at the :guilabel:`Workloads by Starvation Risk` tile. If there are no workloads at medium or high risk of starvation, you don't need to take any action right now.
+
+#. Scroll down to the :guilabel:`Kubernetes Workloads` table and sort it by :guilabel:`Starvation Risk`.
+
+#. For each workload at high or medium starvation risk:
+
+   #. Select that workload to navigate to its :guilabel:`Workload Details` page.
+
+   #. On the :guilabel:`Workload Details` page, scroll down to :guilabel:`Instant Recommendations`.
+
+   #. If the :guilabel:`Confidence level` at the top of the page is high, copy and paste the YAML snippets in :guilabel:`Instant Recommendations` into that workload's container configuration.
+
+
+How can I improve efficiency and performance by right-sizing workloads based on actual usage metrics?
+=========================================================================================================
+
+To see actual usage metrics for an individual workload:
+
+#. Navigate to the :guilabel:`Application Optimization` dashboard.
+
+#. Find the target workload in the :guilabel:`Kubernetes Workloads` table. :ref:`Sort, search, or filter this table <aopt-workloads-sort-search>` as needed.  
+
+#. Select the target workload in the table to navigate to the :guilabel:`Workloads Details` page. This page displays actual usage metrics for the target workload, divided into sections for each container in the workload.
+
+#. In the :guilabel:`Instant Recommendations` section, expand each container's section and apply the YAML snippets in the rightmost column to that container's configuration.
+
+
diff --git a/private-preview/aopt/aopt-workload-details-scaled.png b/private-preview/aopt/aopt-workload-details-scaled.png
diff --git a/private-preview/aopt/aopt-workload-details.rst b/private-preview/aopt/aopt-workload-details.rst
@@ -0,0 +1,57 @@
+:orphan:
+
+.. _aopt-workload-details:
+
+.. include:: /private-preview/aopt/toc.rst
+    :start-after: :orphan:
+
+**********************************************************
+Workload Details
+**********************************************************
+
+When you select a workload in the :guilabel:`Kubernetes Workloads` table, you navigate to its :guilabel:`Workload Details` page. This page displays the efficiency analysis and instant recommendations for the particular workload you selected.
+
+
+..  image:: /private-preview/aopt/aopt-workload-details-scaled.png
+    :width: 90%
+    :alt: Application Optimization details about a specific workload
+
+
+Efficiency Analysis
+==========================================================
+
+:guilabel:`Efficiency Analysis` is based on the workload's resource efficiency.
+
+* :guilabel:`Confidence level`: Look for the confidence level under the :guilabel:`Efficiency Analysis` label. If the confidence level is something other than high, this probably means that your cluster hasn't sent enough metrics to :new-page:`Splunk Infrastructure Monitoring (IM) <https://docs.splunk.com/observability/en/infrastructure/intro-to-infrastructure.html>` since you created the workload. In this case, for highly critical business workflows or those that have high variations, wait a few days for the confidence level to increase before you apply the recommendations. :ref:`See details on how this is calculated <aopt-glossary-confidence-level>`.
+
+* :guilabel:`Resource Starvation Risk`: This workload's average risk of running out of CPU or memory.
+
+* :guilabel:`Average Pod Count`: The number of pods (:new-page:`replicas <https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/>`) running for this workload averaged over the analysis period. 
+
+* :guilabel:`Resource Footprint`: The percentage of CPU and memory that this workload's pods requested, averaged over the analysis period. The footprint may exceed 100% when the pods use more than their requested values.
+
+* :guilabel:`Resource Efficiency`: The ratio of resource usage to resource allocation. This is a percentage relative to allocated resources. The higher the percentage, the better. :ref:`See details on how this is calculated <aopt-glossary-efficiency>`.
+
+
+Instant Recommendations
+==========================================================
+
+:guilabel:`Instant Recommendations` offers simple, actionable changes to a workload's pods which you can implement quickly and easily to improve its resource utilization. 
+
+
+Why are these recommendations given?
+----------------------------------------------------------
+
+If a workload has had a medium or high starvation risk over the past 14 days, ignoring spikes, :guilabel:`Instant Recommendations` suggests an increase in CPU or memory ``request`` or ``limit`` settings to mitigate that risk. 
+
+
+Workload Breakdown
+==========================================================
+
+Your workload is broken down into its containers, and within the section for each container, there are specific recommendations for CPU and memory adjustments, a chart visualizing its historical resource usage, and in the rightmost column (:guilabel:`Recommended K8s Spec`), YAML snippets you can copy to improve its settings. 
+
+
+HPA Recommendation
+==========================================================
+
+If you have a horizontal pod autoscaler (HPA) associated with this workload and you're sending HPA metrics to Splunk IM, this section will provide any recommended adjustments to your HPA.
diff --git a/private-preview/aopt/aopt-workloads.rst b/private-preview/aopt/aopt-workloads.rst
@@ -0,0 +1,64 @@
+:orphan:
+
+.. _aopt-workloads:
+
+.. include:: /private-preview/aopt/toc.rst
+    :start-after: :orphan:
+
+**********************************************************
+The Application Optimization dashboard
+**********************************************************
+
+To load the dashboard, select :guilabel:`Application Optimization` in the left navigation menu.
+
+
+..  image:: /private-preview/aopt/aopt-dashboard-sorted-scaled.png
+    :width: 90%
+    :alt: Application Optimization workloads
+
+
+The :guilabel:`Application Optimization` dashboard provides a high-level view of metrics from your Kubernetes infrastructure and a table of Kubernetes workloads. Metrics are grouped into tiles which are described below.
+
+
+Workloads
+==========================================================
+
+* :guilabel:`Total`: The total number of workloads, of all kinds, for which you're sending metrics to :new-page:`Splunk Infrastructure Monitoring (IM) <https://docs.splunk.com/observability/en/infrastructure/intro-to-infrastructure.html>`. 
+
+* :guilabel:`Processed`: The number of workloads that Application Optimization has processed and that are older than 24 hours. This number doesn't include: 
+
+  * Workload kinds that Application Optimization doesn't support (cronjobs and jobs).
+
+  * Workloads that Application Optimization had an error in processing.
+
+  * Workloads that you added less than 24 hours ago; since Application Optimization processes data once a day, new workloads might have missed the processsing window.
+
+
+Workloads by Starvation Risk
+==========================================================
+
+This is a good tile to check first to see if any of your workloads are at high risk of starvation and need immediate attention. You can also find starving workloads by sorting the :guilabel:`Kubernetes Workloads` table by :guilabel:`Starvation Risk`. :ref:`See details on how this risk is calculated <aopt-glossary-starvation-risk>`.
+
+
+Resource Footprint
+==========================================================
+
+A workload's resource footprint is the sum of its pods' ``request`` settings for that resource (or utilization if resources are unset or average usage exceeds requests) plus its actual overage utilization of that resource. This tile displays the sum of all resource footprints of all the pods of all your workloads. It then compares your current ``request`` settings for CPU and memory to  recommended CPU and memory ``request`` settings based on data from the past 14 days. 
+
+.. note::
+    This tile aggregates data from all of your workloads, so you may not find a direct correlation to individual workloads in the :guilabel:`Kubernetes Workloads` table.
+
+
+.. _aopt-workloads-sort-search:
+
+Kubernetes Workloads
+==========================================================
+
+The :guilabel:`Kubernetes Workloads` table lists all workloads for which you're sending metrics to Splunk IM. Narrow this list by:
+
+* Searching: You can search this table by workload or cluster name.
+
+* Filtering: Select from the :guilabel:`Environment`, :guilabel:`Cluster`, :guilabel:`Namespace`, :guilabel:`Workload Kind`, or :guilabel:`Add filters` menus at the top of the page.
+
+* Sorting the table by any of its columns. To find workloads most in need of attention, sort by :guilabel:`Starvation Risk` or :guilabel:`Efficiency`.
+
diff --git a/private-preview/aopt/toc.rst b/private-preview/aopt/toc.rst
@@ -0,0 +1,25 @@
+:orphan:
+
+.. _toc:
+
+.. admonition:: Preview: Application Optimization
+
+    Preview features described in this document are provided by Splunk to you "as is" 
+    without any warranties, maintenance and support, or service-level commitments. 
+    Splunk makes this preview feature available in its sole discretion and may 
+    discontinue it at any time. These documents are not yet publicly available and 
+    we ask that you keep such information confidential. Use of preview features is 
+    subject to the Splunk Pre-Release Agreement for Hosted Services 
+    (https://www.splunk.com/en_us/legal/pre-release-agreement-for-hosted-services.html)`.
+
+
+.. admonition:: Table of contents
+
+    Use these links to navigate to topics within this private preview: 
+        * :ref:`What is Application Optimization? <aopt-intro>`
+        * :ref:`The Application Optimization dashboard <aopt-workloads>`
+        * :ref:`Workload Details <aopt-workload-details>`
+        * :ref:`Scenarios <aopt-scenarios>`
+        * :ref:`Glossary <aopt-glossary>`
+        * :ref:`Derived metrics <aopt-derived-metrics>`
+