Azure Monitor query optimization headers

bwren · bwren · commit dc945d5cc1b9 · 2020-03-16T04:57:44.000-07:00
diff --git a/articles/azure-monitor/log-query/query-optimization.md b/articles/azure-monitor/log-query/query-optimization.md
@@ -55,6 +55,8 @@ Query processing time is spent on:
 
 Other than time spent in the query processing nodes, there is additional time that is spend by Azure Monitor Logs to: authenticate the user and verify that they are permitted to access this data, locate the data store, parse the query, and allocate the query processing nodes. This time is not included in the query total CPU time.
 
+### Early filtering of records prior of using high CPU functions
+
 Some of the query commands and functions are heavy in their CPU consumption. This is especially true for commands that parse JSON and XML or extract complex regular expressions. Such parsing can happen explicitly via [parse_json()](/azure/kusto/query/parsejsonfunction) or [parse_xml()](/azure/kusto/query/parse-xmlfunction) functions or implicitly when referring to dynamic columns.
 
 These functions consume CPU in proportion to the number of rows they are processing. The most efficient optimization is to add where conditions early in the query that can filter out as many records as possible before the CPU intensive function is executed.
@@ -82,6 +84,8 @@ SecurityEvent
 | where FileHash != "" // No need to filter out %SYSTEM32 here as it was removed before
 ```
 
+### Avoid using evaluated where clauses
+
 Queries that contain [where](/azure/kusto/query/whereoperator) clauses on an evaluated column rather than on columns that are physically present in the dataset lose efficiency. Filtering on evaluated columns prevents some system optimizations when large sets of data are handled.
 For example, the following queries produce exactly the same result but the second one is more efficient as the [where](/azure/kusto/query/whereoperator) condition refers to built-in column
 
@@ -100,6 +104,8 @@ Heartbeat
 | summarize count() by Computer
 ```
 
+### Use effective aggregation commands and dimmentions in summarize and join
+
 While some aggregation commands like [max()](/azure/kusto/query/max-aggfunction), [sum()](/azure/kusto/query/sum-aggfunction), [count()](/azure/kusto/query/count-aggfunction), and [avg()](/azure/kusto/query/avg-aggfunction) have low CPU impact due to their logic, other are  more complex and include heuristics and estimations that allow them to be executed efficiently. For example, [dcount()](/azure/kusto/query/dcount-aggfunction) uses the HyperLogLog algorithm to provide close estimation to distinct count of large sets of data without actually counting each value; the percentile functions are doing similar approximations using the nearest rank percentile algorithm. Several of the commands include optional parameters to reduce their impact. For example, the [makeset()](/azure/kusto/query/makeset-aggfunction) function has an optional parameter to define the maximum set size, which significantly affects the CPU and memory.
 
 [Join](/azure/kusto/query/joinoperator?pivots=azuremonitor) and [summarize](/azure/kusto/query/summarizeoperator) commands may cause high CPU utilization when they are processing a large set of data. Their complexity is directly related to the number of possible values, referred to as *cardinality*, of the columns that are using as the `by` in summarize or as the join attributes. For explanation and optimization of join and summarize, see their documentation articles and optimization tips.
@@ -150,9 +156,12 @@ Heartbeat
 ## Data used for processed query
 
 A critical factor in the processing of the query is the volume of data that is scanned and used for the query processing. Azure Data Explorer uses aggressive optimizations that dramatically reduce the data volume compared to other data platforms. Still, there are critical factors in the query that can impact the data volume that is used.
+
 In Azure Monitor Logs, the **TimeGenerated** column is used as a way to index the data. Restricting the **TimeGenerated** values to as narrow a range as possible will make a significant improvement to query performance by significantly limiting the amount of data that has to be processed.
 
-Another factor that increases the data that is process is the use of large number of tables. This usually happens when `search *` and `union *` commands are used. These commands force the system to evaluate and scan data from all tables in the workspace. In some cases, there might be hundreds of tables in the workspace. Try to avoid as much as possible using “search *” or any search without scoping it to a specific table.
+### Avoid unnecessary use of search and union operators
+
+Another factor that increases the data that is process is the use of large number of tables. This usually happens when `search *` and `union *` commands are used. These commands force the system to evaluate and scan data from all tables in the workspace. In some cases, there might be hundreds of tables in the workspace. Try to avoid as much as possible using "search *" or any search without scoping it to a specific table.
 
 For example, the following queries produce exactly the same result but the last one is by far the most efficient:
 
@@ -174,6 +183,8 @@ Perf
 | summarize count(), avg(CounterValue)  by Computer
 ```
 
+### Add early filters to the query
+
 Another method to reduce the data volume is to have [where](/azure/kusto/query/whereoperator) conditions early in the query. The Azure Data Explorer platform includes a cache that lets it know which partitions include data that is relevant for a specific where condition. For example, if a query contains `where EventID == 4624` then it would distribute the query only to nodes that handle partitions with matching events.
 
 The following example queries produce exactly the same result but the second one is more efficient:
@@ -190,7 +201,9 @@ SecurityEvent
 | summarize LoginSessions = dcount(LogonGuid) by Account
 ```
 
-Since Azure Data Explorer is a columnar data store, retrieval of every column is independent of the others. The number of columns that are retrieved directly influences the overall data volume. You should only include the columns in the output that are needed by [summarizing](/azure/kusto/query/summarizeoperator) the results or [projecting](/azure/kusto/query/projectoperator) the specific columns. Azure Data Explorer has several optimizations to reduce the number of retrieved columns. If it determines that a column isn’t needed, for example if it's not referenced in the [summarize](/azure/kusto/query/summarizeoperator) command, it won’t retrieve it.
+### Reduce the number of columns that is retrieved
+
+Since Azure Data Explorer is a columnar data store, retrieval of every column is independent of the others. The number of columns that are retrieved directly influences the overall data volume. You should only include the columns in the output that are needed by [summarizing](/azure/kusto/query/summarizeoperator) the results or [projecting](/azure/kusto/query/projectoperator) the specific columns. Azure Data Explorer has several optimizations to reduce the number of retrieved columns. If it determines that a column isn't needed, for example if it's not referenced in the [summarize](/azure/kusto/query/summarizeoperator) command, it won't retrieve it.
 
 For example, the second query may process three times more data since it needs to fetch not one column but three:
 
@@ -214,6 +227,8 @@ The time range can be set using the time range selector in the Log Analytics scr
 An alternative method is to explicitly include a [where](/azure/kusto/query/whereoperator) condition on **TimeGenerated** in the query. You should use this method as it assures that the time span is fixed, even when the query is used from a different interface.
 You should ensure that all parts of the query have **TimeGenerated** filters. When a query has sub-queries fetching data from various tables or the same table, each has to include its own [where](/azure/kusto/query/whereoperator) condition.
 
+### Make sure all sub-queries have TimeGenerated filter
+
 For example, in the following query, while the **Perf** table will be scanned only for the last day, the **Heartbeat** table will be scanned for all of its history, which might be up to two years:
 
 ```Kusto
@@ -283,6 +298,8 @@ Heartbeat
 | summarize min(TimeGenerated) by Computer
 ```
 
+### Time span measurement limitations
+
 The measurement is always larger than the actual time specified. For example, if the filter on the query is 7 days, the system might scan 7.5 or 8.1 days. This is because the system is partitioning the data into chunks in variable size. To assure that all relevant records are scanned, it scans the entire partition that might cover several hours and even more than a day.
 
 There are several cases where the system cannot provide an accurate measurement of the time range. This happens in most of the cases where the query's span less than a day or in multi-workspace queries.
@@ -298,9 +315,9 @@ While some queries require usage of old data, there are cases where old data is
 
 Such cases can be for example:
 
-- Not setting the time range in Log Analytics with a sub-query that isn’t limited. See example above.
+- Not setting the time range in Log Analytics with a sub-query that isn't limited. See example above.
 - Using the API without the time range optional parameters.
-- Using a client that doesn’t force a time range such as the Power BI connector.
+- Using a client that doesn't force a time range such as the Power BI connector.
 
 See examples and notes in the pervious section as they are also relevant in this case.
 
@@ -337,10 +354,10 @@ To efficiently execute a query, it is partitioned and distributed to compute nod
 Query behaviors that can reduce parallelism include:
 
 - Use of serialization and window functions such as the [serialize operator](/azure/kusto/query/serializeoperator), [next()](/azure/kusto/query/nextfunction), [prev()](/azure/kusto/query/prevfunction), and the [row](/azure/kusto/query/rowcumsumfunction) functions. Time series and user analytics functions can be used in some of these cases. Inefficient serialization may also happen if the following operators are used not at the end of the query: [range](/azure/kusto/query/rangeoperator), [sort](/azure/kusto/query/sortoperator), [order](/azure/kusto/query/orderoperator), [top](/azure/kusto/query/topoperator), [top-hitters](/azure/kusto/query/tophittersoperator), [getschema](/azure/kusto/query/getschemaoperator).
--	Usage of [dcount()](/azure/kusto/query/dcount-aggfunction) aggregation function force the system to have central copy of the distinct values. When the scale of data is high, consider using the dcount function optional parameters to reduced accuracy.
--	In many cases, the [join](/azure/kusto/query/joinoperator?pivots=azuremonitor) operator lowers overall parallelism. Examine shuffle join as an alternative when performance is problematic.
--	In resource-scope queries, the pre-execution RBAC checks may linger in situations where there is very large number of RBAC assignments. This may lead to longer checks that would result in lower parallelism. For example, a query is executed on a subscription where there are thousands of resources and each resource has many role assignments in the resource level, not on the subscription or resource group.
--	If a query is processing small chunks of data, its parallelism will be low as the system will not spread it across many compute nodes.
+-    Usage of [dcount()](/azure/kusto/query/dcount-aggfunction) aggregation function force the system to have central copy of the distinct values. When the scale of data is high, consider using the dcount function optional parameters to reduced accuracy.
+-    In many cases, the [join](/azure/kusto/query/joinoperator?pivots=azuremonitor) operator lowers overall parallelism. Examine shuffle join as an alternative when performance is problematic.
+-    In resource-scope queries, the pre-execution RBAC checks may linger in situations where there is very large number of RBAC assignments. This may lead to longer checks that would result in lower parallelism. For example, a query is executed on a subscription where there are thousands of resources and each resource has many role assignments in the resource level, not on the subscription or resource group.
+-    If a query is processing small chunks of data, its parallelism will be low as the system will not spread it across many compute nodes.