Skip to content
This repository was archived by the owner on Nov 16, 2023. It is now read-only.

Commit f113daa

Browse files
authored
Merge pull request #186 from microsoft/mjmelone-patch-33
Create Episode 3 - Summarizing, Pivoting, and Joining.csl
2 parents f7627ba + 621a4b1 commit f113daa

File tree

1 file changed

+186
-0
lines changed

1 file changed

+186
-0
lines changed
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
print Series = 'Tracking the Adversary with MTP Advanced Hunting', EpisodeNumber = 3, Topic = 'Summarizing, Pivoting, and Visualizing Data', Presenters = 'Michael Melone, Tali Ash', Company = 'Microsoft'
2+
3+
// summarize
4+
// The summarize operator enables you to perform
5+
// a variety of calculations on data.
6+
7+
// The output of summarize will be a table with one
8+
// column for each row value you pivoted on as well
9+
// as one column for each pivot you performed.
10+
11+
// In the following example, we will calculate the number of e-mails based on whether
12+
// Office ATP identified them as being malware.
13+
14+
// SQL Equivalent: SELECT MalwareFilterVerdict, Count(*) FROM EmailAttachmentInfo GROUP BY MalwareFilterVerdict
15+
16+
EmailAttachmentInfo
17+
| summarize count() by MalwareFilterVerdict
18+
19+
// --------------------------------------------
20+
21+
// Summarize can also be used to create 2 column pivots by simply adding another
22+
// column name after the "by" clause. For example, we will now count the number
23+
// of e-mails received by sender and recipient combo
24+
25+
// You will also notice in this example that the count_ has been renamed to Emails
26+
// to make the query easier to understand.
27+
28+
// SQL Equivalent: SELECT TOP 100 SenderFromAddress, RecipientEmailAddress, Emails = Count(*) FROM EmailEvents GROUP BY SenderFromAddress, RecipientEmailAddress ORDER BY Emails DESC
29+
30+
EmailEvents
31+
| summarize Emails = count() by SenderFromAddress, RecipientEmailAddress
32+
| top 100 by Emails desc
33+
34+
// --------------------------------------------
35+
36+
// min() - obtains the minimim value from the set
37+
// max() - obtains the maximum value from the set
38+
39+
// SQL Equivalent:
40+
// SELECT
41+
// Earliest = min(Timestamp)
42+
// , Latest = max(Timestamp)
43+
// , Count = count()
44+
// , AccountName
45+
// FROM AlertEvidence
46+
// WHERE AccountName LIKE '%'
47+
// GROUP BY AccountName
48+
// ORDER BY Count desc
49+
50+
AlertEvidence
51+
| where isnotempty(AccountName)
52+
| summarize Earliest = min(Timestamp), Latest = max(Timestamp), Count = count() by AccountName
53+
| order by Count desc
54+
55+
// AlertEvidence
56+
// Contains information on entities and evidence involved in an alert, such as devices, accounts, and emails
57+
//--------------------------
58+
59+
// Now let's get a bit more advanced. Using the bin() function you can group events by a period of time.
60+
// Let's take a look at some logon statistics on a daily basis
61+
62+
// Using render we can automatically create a chart. Let's look at account logon activity over time on a
63+
// daily basis by UPN.
64+
65+
IdentityLogonEvents
66+
| where isnotempty(AccountUpn)
67+
| summarize NumberOfLogons = count() by AccountUpn, bin(Timestamp, 1d)
68+
| render timechart
69+
70+
// render - creates a chart
71+
72+
// We can also use this bin'ed data to determine min, max, and average daily logons
73+
74+
IdentityLogonEvents
75+
| where isnotempty(AccountUpn)
76+
| summarize NumberOfLogons = count()
77+
by AccountUpn
78+
, bin(Timestamp, 1d)
79+
| summarize TotalLogons = sum(NumberOfLogons)
80+
, AverageDailyLogons = avg(NumberOfLogons)
81+
, FewestLogonsInADay = min(NumberOfLogons)
82+
, MostLogonsInADay = max(NumberOfLogons)
83+
by AccountUpn
84+
| top 10 by TotalLogons desc
85+
| render columnchart
86+
87+
--------------------------
88+
89+
// You can also use summarize to get the latest event from each category.
90+
// For example, let's say you want to get the latest check-in information
91+
// for each device in your instance
92+
93+
// the arg_max() function will maximize the specified argument in the column
94+
// set based on the "by" parameter. You can either specify the columns you
95+
// want back as parameters, or just use * to get the entire row.
96+
97+
DeviceInfo
98+
| where isnotempty(OSPlatform) // checkins can be partial or full - this filters out partials
99+
| summarize arg_max(Timestamp, *) by DeviceId
100+
101+
// Let's say you now wanted to use this summarized list to create a report of devices
102+
// by operating system - but you didn't want to lose the individual device names.
103+
// Good news - we can also use summarize to build arrays!
104+
105+
DeviceInfo
106+
| where isnotempty(OSPlatform) // checkins can be partial or full - this filters out partials
107+
| summarize arg_max(Timestamp, *) by DeviceId
108+
| summarize Devices = count(), DeviceList = make_set(DeviceName) by OSPlatform
109+
110+
// -----------------------------
111+
112+
// Another way to perform aggregations is using make-series. The make-series
113+
// command is similar to summarize except it is designed to calculate on
114+
// a periodic basis, providing zeros for empty datasets for consistency.
115+
116+
EmailEvents
117+
| make-series count() on Timestamp from ago(30d) to now() step 1d by SenderFromDomain
118+
119+
// With this we can identify outlier programmatically. Let's see if we can find
120+
// any sudden increases or decreases in activity relating to mail from a specific
121+
// domain using one of our time series analysis capabilities.
122+
123+
// geek stuff warning
124+
125+
EmailEvents
126+
| make-series MailCount = count() on Timestamp from ago(30d) to now() step 1d by SenderFromDomain
127+
| extend (flag, score, baseline) = series_decompose_anomalies(MailCount)
128+
| project-reorder flag, score, baseline
129+
130+
// series_decompose_anomalies adds three new columns
131+
// - flag: is the datapoint normal, an abnormal increase (1), or an abnormal decrease (-1)
132+
// - score: how anomalous is this data point?
133+
// - baseline: the forecaseted value the algorithm expected
134+
135+
// Let's look for spikes in e-mail traffic from a domain. To do this, we need to expand
136+
// the flag column. Expanding takes an array and creates one row for each value in it.
137+
// For SQL people, this is like using CROSS APPLY
138+
139+
EmailEvents
140+
| make-series MailCount = count() on Timestamp from ago(30d) to now() step 1d by SenderFromDomain
141+
| extend (flag, score, baseline) = series_decompose_anomalies(MailCount)
142+
| project-reorder flag, score, baseline
143+
| mv-expand flag
144+
145+
// now we can filter to only 1's. Note that our lists of values all look like strings. We need
146+
// to tell KQL that we want these to be int's for accurate comparison.
147+
148+
EmailEvents
149+
| make-series MailCount = count() on Timestamp from ago(30d) to now() step 1d by SenderFromDomain
150+
| extend (flag, score, baseline) = series_decompose_anomalies(MailCount)
151+
| mv-expand flag to typeof(int) // expand flag and tell KQL it needs to be an int
152+
| where flag == 1 // filter to only rows that have a 1
153+
| project-reorder flag, score, baseline
154+
155+
// Next, we'll look for the top 5 most anomalous domain spikes and graph the result
156+
157+
let interval = 12h;
158+
EmailEvents
159+
| make-series MailCount = count() on Timestamp from ago(30d) to now() step interval by SenderFromDomain
160+
| extend (flag, score, baseline) = series_decompose_anomalies(MailCount)
161+
| mv-expand flag to typeof(int)
162+
| where flag == 1 // filter to only incremental anomalies
163+
| mv-expand score to typeof(double) // expand the score array to a double
164+
| summarize MaxScore = max(score) by SenderFromDomain // get the max score value from each domain
165+
| top 5 by MaxScore desc // Get the top 5 highest scoring domains
166+
| join kind=rightsemi EmailEvents on SenderFromDomain // Filter EmailEvents to only these domains
167+
| summarize count() by SenderFromDomain, bin(Timestamp, interval) // build a new summarization for the graph
168+
| render timechart // graph it!
169+
170+
// Aha! I know someone out there sees my bug. Technically, one of these datasets can have both a spike and
171+
// a valley and the valley score could be what we're keying off of. Let's try again using logons, but this time
172+
// we'll get the specific score associated with the spike instead of just assuming that they're the same.
173+
174+
let interval = 12h;
175+
IdentityLogonEvents
176+
| where isnotempty(AccountUpn)
177+
| make-series LogonCount = count() on Timestamp from ago(30d) to now() step interval by AccountUpn
178+
| extend (flag, score, baseline) = series_decompose_anomalies(LogonCount)
179+
| mv-expand with_itemindex = FlagIndex flag to typeof(int) // Expand, but this time include the index in the array as FlagIndex
180+
| where flag == 1 // Once again, filter only to spikes
181+
| extend SpikeScore = todouble(score[FlagIndex]) // This will get the specific score associated with the detected spike
182+
| summarize MaxScore = max(SpikeScore) by AccountUpn
183+
| top 5 by MaxScore desc
184+
| join kind=rightsemi IdentityLogonEvents on AccountUpn
185+
| summarize count() by AccountUpn, bin(Timestamp, interval)
186+
| render timechart

0 commit comments

Comments
 (0)