You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/en/scenarios/5-understand-impact/1-build-application.md
+10-7Lines changed: 10 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ weight: 1
11
11
For this workshop, we'll be using a microservices-based application. This application is for an online retailer and normally includes more than a dozen services. However, to keep the workshop simple, we'll be focusing on two services used by the retailer as part of their payment processing workflow: the credit check service and the credit processor service.
12
12
13
13
## Pre-requisites
14
-
You will start with an EC2 environment that already has some useful components, but we will perform some [initial steps](#initial-steps) in order to get to the following state:
14
+
You will start with a t2.medium EC2 instance with 20 GB of disk storage, and perform some [initial steps](#initial-steps) in order to get to the following state:
15
15
* Install Kubernetes (k3s) and Docker
16
16
* Deploy the **Splunk distribution of the OpenTelemetry Collector**
17
17
* Build and deploy `creditcheckservice` and `creditprocessorservice`
@@ -33,6 +33,9 @@ cd observability-workshop/workshop/tagging
33
33
34
34
# Exit and ssh back to this instance
35
35
36
+
# return to the same directory as before
37
+
cd observability-workshop/workshop/tagging
38
+
36
39
./2-deploy-otel-collector.sh
37
40
./3-deploy-creditcheckservice.sh
38
41
./4-deploy-creditprocessorservice.sh
@@ -41,34 +44,34 @@ cd observability-workshop/workshop/tagging
41
44
42
45
## View your application in Splunk Observability Cloud
43
46
44
-
Now that the setup is complete, let's confirm that it's sending data to **Splunk Observability Cloud**.
47
+
Now that the setup is complete, let's confirm that it's sending data to **Splunk Observability Cloud**. Note that when the application is deployed for the first time, it may take a few minutes for the data to appear.
45
48
46
49
Navigate to APM, then use the Environment dropdown to select your environment (i.e. `tagging-workshop-name`).
47
50
48
51
If everything was deployed correctly, you should see `creditprocessorservice` and `creditcheckservice` displayed in the list of services:
49
52
50
53

51
54
52
-
Click on Explore on the right-hand side to view the service map. We can see that the `creditcheckservice` makes calls to the `creditprocessorservice`, with an average response time of around 3.5 seconds:
55
+
Click on **Explore** on the right-hand side to view the service map. We can see that the `creditcheckservice` makes calls to the `creditprocessorservice`, with an average response time of at least 3 seconds:
53
56
54
57

55
58
56
-
Next, click on Traces on the right-hand side to see the traces captured for this application. You'll see that some traces run relatively fast (i.e. just a few milliseconds), whereas others take a few seconds.
59
+
Next, click on **Traces** on the right-hand side to see the traces captured for this application. You'll see that some traces run relatively fast (i.e. just a few milliseconds), whereas others take a few seconds.
57
60
58
61

59
62
60
-
You'll also notice that some traces have errors:
63
+
If you toggle **Errors only** to `on`, you'll also notice that some traces have errors:
61
64
62
65

63
66
64
-
Sort the traces by duration then click on one of the longer running traces. In this example, the trace took five seconds, and we can see that most of the time was spent calling the `/runCreditCheck` operation, which is part of the `creditprocessorservice`.
67
+
Toggle **Errors only** back to `off` and sort the traces by duration, then click on one of the longer running traces. In this example, the trace took five seconds, and we can see that most of the time was spent calling the `/runCreditCheck` operation, which is part of the `creditprocessorservice`.
Currently, we don't have enough details in our traces to understand why some requests finish in a few milliseconds, and others take several seconds. To provide the best possible customer experience, this will be critical for us to understand.
69
72
70
73
We also don't have enough information to understand why some requests result in errors, and others don't. For example, if we look at one of the error traces, we can see that the error occurs when the `creditprocessorservice` attempts to call another service named `otherservice`. But why do some requests results in a call to `otherservice`, and others don't?
71
74
72
-

75
+

73
76
74
77
We'll explore these questions and more in the workshop.
To understand why some requests have errors or slow performance, we'll need to add context to our traces. We'll do this by adding tags.
9
+
To understand why some requests have errors or slow performance, we'll need to add context to our traces. We'll do this by adding tags. But first, let's take a moment to discuss what tags are, and why they're so important for observability.
Copy file name to clipboardExpand all lines: content/en/scenarios/5-understand-impact/3-capture-tags.md
+15-11Lines changed: 15 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,19 +9,23 @@ Let's add some tags to our traces, so we can find out why some customers receive
9
9
10
10
## Identify Useful Tags
11
11
12
-
We'll start by reviewing the code for the `credit_check` function of `creditcheckservice` (which can be found in the `main.py` file):
12
+
We'll start by reviewing the code for the `credit_check` function of `creditcheckservice` (which can be found in the `/home/ubuntu/observability-workshop/workshop/tagging/creditcheckservice/main.py` file):
current_span.set_attribute("credit.check.result", checkResult) # <--- ADDED BY WORKSHOP
80
84
81
85
return checkResult
82
86
````
@@ -91,7 +95,7 @@ Once these changes are made, let's run the following script to rebuild the Docke
91
95
92
96
## Confirm Tag is Captured Successfully
93
97
94
-
After a few minutes, return to **Splunk Observability Cloud** and load one of the traces to confirm that the tags were captured successfully:
98
+
After a few minutes, return to **Splunk Observability Cloud** and load one of the latest traces to confirm that the tags were captured successfully (hint: sort by duration to find the latest traces):
95
99
96
100
****
Now that we've captured several tags from our application, lets explore some of the trace data we've captured that include this additional context, and see if we can identify what's causing poor user experience in some cases.
9
+
Now that we've captured several tags from our application, let's explore some of the trace data we've captured that include this additional context, and see if we can identify what's causing poor user experience in some cases.
10
10
11
11
## Use Trace Analyzer
12
12
@@ -26,6 +26,8 @@ Let's remove the credit score filter and toggle **Errors only** to on, which res
26
26
27
27
Click on a few of these traces, and look at the tags we captured. Do you notice any patterns?
28
28
29
-
If you found a pattern - great job! But keep in mind that this is a difficult way to troubleshoot, as it requires you to look through many traces and remember what you saw in each one to see if you can identify a pattern.
29
+
Next, toggle **Errors only** to off, and sort traces by duration. Look at a few of the slowest running traces, and compare them to the fastest running traces. Do you notice any patterns?
30
+
31
+
If you found a pattern that explains the slow performance and errors - great job! But keep in mind that this is a difficult way to troubleshoot, as it requires you to look through many traces and mentally keep track of what you saw, so you can identify a pattern.
30
32
31
33
Thankfully, **Splunk Observability Cloud** provides a more efficient way to do this, which we'll explore next.
Copy file name to clipboardExpand all lines: content/en/scenarios/5-understand-impact/5-index-tags.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ To use advanced features in **Splunk Observability Cloud** such as **Tag Spotlig
11
11
12
12
To do this, navigate to **Settings** -> **APM MetricSets**. Then click the **+ New MetricSet** button.
13
13
14
-
Let's index the `credit.score.category` tag to start with by providing the following details:
14
+
Let's index the `credit.score.category` tag by entering the following details (**note**: since everyone in the workshop is using the same organization, the instructor will do this step on your behalf):
@@ -29,37 +29,37 @@ Once analysis is complete, click on the checkmark in the **Actions** column.
29
29
30
30
Why did we choose to index the `credit.score.category` tag and not the others?
31
31
32
-
To understand this, let's review the primary use cases for attributes:
32
+
To understand this, let's review the primary use cases for tags:
33
33
34
34
* Filtering
35
35
* Grouping
36
36
37
37
### Filtering
38
38
39
-
With the filtering use case, we can use the **Trace Analyzer** capability of **Splunk Observability Cloud** to filter on traces that match a particular attribute value.
39
+
With the filtering use case, we can use the **Trace Analyzer** capability of **Splunk Observability Cloud** to filter on traces that match a particular tag value.
40
40
41
41
We saw an example of this earlier, when we filtered on traces where the credit score started with "7".
42
42
43
-
Or if a customer called in to complain about slow service, we could use **Trace Analyzer** to locate all traces with that particular customer number.
43
+
Or if a customer calls in to complain about slow service, we could use **Trace Analyzer** to locate all traces with that particular customer number.
44
44
45
-
Attributes used for filtering use cases are generally high-cardinality, meaning that there could be thousands or even hundreds of thousands of unique values. In fact, **Splunk Observability Cloud** can handle an effectively infinite number of unique attribute values! Filtering using these attributes allows us to rapidly locate the traces of interest.
45
+
Tags used for filtering use cases are generally high-cardinality, meaning that there could be thousands or even hundreds of thousands of unique values. In fact, **Splunk Observability Cloud** can handle an effectively infinite number of unique tag values! Filtering using these tags allows us to rapidly locate the traces of interest.
46
46
47
47
Note that we aren't required to index tags to use them for filtering with **Trace Analyzer**.
48
48
49
49
### Grouping
50
50
51
-
With the grouping use case, we can surface trends for attributes that we collect using the powerful **Tag Spotlight** feature in **Splunk Observability Cloud**, which we'll see in action shortly.
51
+
With the grouping use case, we can surface trends for tags that we collect using the powerful **Tag Spotlight** feature in **Splunk Observability Cloud**, which we'll see in action shortly.
52
52
53
-
Attributes used for grouping use cases should be low to medium-cardinality, with hundreds of unique values.
53
+
Tags used for grouping use cases should be low to medium-cardinality, with hundreds of unique values.
54
54
55
-
For custom attributes to be used with **Tag Spotlight**, they first need to be indexed.
55
+
For custom tags to be used with **Tag Spotlight**, they first need to be indexed.
56
56
57
57
We decided to index the `credit.score.category` tag because it has a few distinct values that would be useful for grouping. In contrast, the customer number and credit score tags have hundreds or thousands of unique values, and are more valuable for filtering use cases rather than grouping.
58
58
59
59
## Troubleshooting vs. Monitoring MetricSets
60
60
61
-
You may have noticed that, to index this tag, we created something called a **Troubleshooting MetricSet**. It's named this was because a Troubleshooting MetricSet, or TMS, allows us to troubleshoot issues with this tag using features such as **Tag Spotlight**.
61
+
You may have noticed that, to index this tag, we created something called a **Troubleshooting MetricSet**. It's named this way because a Troubleshooting MetricSet, or TMS, allows us to troubleshoot issues with this tag using features such as **Tag Spotlight**.
62
62
63
-
You may have also noticed that there's another option which we didn't choose called a **Monitoring MetricSet** (or MMS). Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting and dashboards. We'll explore this later in the workshop.
63
+
You may have also noticed that there's another option which we didn't choose called a **Monitoring MetricSet** (or MMS). Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting and dashboards. We'll explore this concept later in the workshop.
Copy file name to clipboardExpand all lines: content/en/scenarios/5-understand-impact/6-use-tags.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,19 +20,19 @@ With **Tag Spotlight**, we can see 100% of credit score requests that result in
20
20
21
21
This illustrates the power of **Tag Spotlight**! Finding this pattern would be time-consuming without it, as we'd have to manually look through hundreds of traces to identify the pattern (and even then, there's no guarantee we'd find it).
22
22
23
-
We've looked at errors, but what about latency? Let's click on **Latency** near the top of the screen.
23
+
We've looked at errors, but what about latency? Let's click on **Latency** near the top of the screen to find out.
24
24
25
25
Here, we can see that the requests with a `poor` credit score request are running slowly, with P50, P90, and P99 times of around 3 seconds, which is too long for our users to wait, and much slower than other requests.
26
26
27
-
We can also see that some requests with an `exceptional` credit score request are running slowly, with P99 times of around 5 seconds, though the P50 and P90 response times are relatively quick.
27
+
We can also see that some requests with an `exceptional` credit score request are running slowly, with P99 times of around 5 seconds, though the P50 response time is relatively quick.
28
28
29
29
****
30
30
31
31
## Using Dynamic Service Maps
32
32
33
33
Now that we know the credit score category associated with the request can impact performance and error rates, let's explore another feature that utilizes indexed tags: **Dynamic Service Maps**.
34
34
35
-
With Dynamic Service Maps, we can breakdown a particular service by an attribute. For example, let's click on **APM**, then click **Explore** to view the service map.
35
+
With Dynamic Service Maps, we can breakdown a particular service by a tag. For example, let's click on **APM**, then click **Explore** to view the service map.
36
36
37
37
Click on `creditcheckservice`. Then, on the right-hand menu, click on the drop-down that says **Breakdown**, and select the `credit.score.category` tag.
38
38
@@ -44,14 +44,14 @@ This view makes it clear that performance for `good` and `fair` credit scores is
44
44
45
45
## Summary
46
46
47
-
**Tag Spotlight** has uncovered several interesting patterns that we need to explore further:
47
+
**Tag Spotlight** has uncovered several interesting patterns for the engineers that own this service to explore further:
48
48
49
49
* Why are all the `impossible` credit score requests resulting in error?
50
50
* Why are all the `poor` credit score requests running slowly?
51
51
* Why do some of the `exceptional` requests run slowly?
52
52
53
-
As an SRE, passing this context to the service owner would be extremely helpful for their investigation, as it would allow them to track down the issue much more quickly than if we only told them that the service was "sometimes slow".
53
+
As an SRE, passing this context to the engineering team would be extremely helpful for their investigation, as it would allow them to track down the issue much more quickly than if we simply told them that the service was "sometimes slow".
54
54
55
55
If you're curious, have a look at the source code for the `creditprocessorservice`. You'll see that requests with impossible, poor, and exceptional credit scores are handled differently, thus resulting in the differences in error rates and latency that we uncovered.
56
56
57
-
The behavior we saw with our application is typical for modern cloud-native applications, where different inputs passed to a service lead to different code paths, some of which result in slower performance or errors. For example, in a real credit check service, requests resulting in low credit scores may be sent to another downstream service to further evaluate risk, and may perform more slowly than requests resulting in higher scores.
57
+
The behavior we saw with our application is typical for modern cloud-native applications, where different inputs passed to a service lead to different code paths, some of which result in slower performance or errors. For example, in a real credit check service, requests resulting in low credit scores may be sent to another downstream service to further evaluate risk, and may perform more slowly than requests resulting in higher scores, or encounter higher error rates.
0 commit comments