Skip to content

Commit fc7d02c

Browse files
committed
updated tagging workshop
1 parent 8952cda commit fc7d02c

File tree

10 files changed

+64
-43
lines changed

10 files changed

+64
-43
lines changed

content/en/scenarios/5-understand-impact/1-build-application.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ weight: 1
1111
For this workshop, we'll be using a microservices-based application. This application is for an online retailer and normally includes more than a dozen services. However, to keep the workshop simple, we'll be focusing on two services used by the retailer as part of their payment processing workflow: the credit check service and the credit processor service.
1212

1313
## Pre-requisites
14-
You will start with an EC2 environment that already has some useful components, but we will perform some [initial steps](#initial-steps) in order to get to the following state:
14+
You will start with a t2.medium EC2 instance with 20 GB of disk storage, and perform some [initial steps](#initial-steps) in order to get to the following state:
1515
* Install Kubernetes (k3s) and Docker
1616
* Deploy the **Splunk distribution of the OpenTelemetry Collector**
1717
* Build and deploy `creditcheckservice` and `creditprocessorservice`
@@ -33,6 +33,9 @@ cd observability-workshop/workshop/tagging
3333
3434
# Exit and ssh back to this instance
3535
36+
# return to the same directory as before
37+
cd observability-workshop/workshop/tagging
38+
3639
./2-deploy-otel-collector.sh
3740
./3-deploy-creditcheckservice.sh
3841
./4-deploy-creditprocessorservice.sh
@@ -41,34 +44,34 @@ cd observability-workshop/workshop/tagging
4144

4245
## View your application in Splunk Observability Cloud
4346

44-
Now that the setup is complete, let's confirm that it's sending data to **Splunk Observability Cloud**.
47+
Now that the setup is complete, let's confirm that it's sending data to **Splunk Observability Cloud**. Note that when the application is deployed for the first time, it may take a few minutes for the data to appear.
4548

4649
Navigate to APM, then use the Environment dropdown to select your environment (i.e. `tagging-workshop-name`).
4750

4851
If everything was deployed correctly, you should see `creditprocessorservice` and `creditcheckservice` displayed in the list of services:
4952

5053
![APM Overview](../images/apm_overview.png)
5154

52-
Click on Explore on the right-hand side to view the service map. We can see that the `creditcheckservice` makes calls to the `creditprocessorservice`, with an average response time of around 3.5 seconds:
55+
Click on **Explore** on the right-hand side to view the service map. We can see that the `creditcheckservice` makes calls to the `creditprocessorservice`, with an average response time of at least 3 seconds:
5356

5457
![Service Map](../images/service_map.png)
5558

56-
Next, click on Traces on the right-hand side to see the traces captured for this application. You'll see that some traces run relatively fast (i.e. just a few milliseconds), whereas others take a few seconds.
59+
Next, click on **Traces** on the right-hand side to see the traces captured for this application. You'll see that some traces run relatively fast (i.e. just a few milliseconds), whereas others take a few seconds.
5760

5861
![Traces](../images/traces.png)
5962

60-
You'll also notice that some traces have errors:
63+
If you toggle **Errors only** to `on`, you'll also notice that some traces have errors:
6164

6265
![Traces](../images/traces_with_errors.png)
6366

64-
Sort the traces by duration then click on one of the longer running traces. In this example, the trace took five seconds, and we can see that most of the time was spent calling the `/runCreditCheck` operation, which is part of the `creditprocessorservice`.
67+
Toggle **Errors only** back to `off` and sort the traces by duration, then click on one of the longer running traces. In this example, the trace took five seconds, and we can see that most of the time was spent calling the `/runCreditCheck` operation, which is part of the `creditprocessorservice`.
6568

6669
![Long Running Trace](../images/long_running_trace.png)
6770

6871
Currently, we don't have enough details in our traces to understand why some requests finish in a few milliseconds, and others take several seconds. To provide the best possible customer experience, this will be critical for us to understand.
6972

7073
We also don't have enough information to understand why some requests result in errors, and others don't. For example, if we look at one of the error traces, we can see that the error occurs when the `creditprocessorservice` attempts to call another service named `otherservice`. But why do some requests results in a call to `otherservice`, and others don't?
7174

72-
![Long Running Trace](../images/error_trace.png)
75+
![Trace with Errors](../images/error_trace.png)
7376

7477
We'll explore these questions and more in the workshop.

content/en/scenarios/5-understand-impact/2-what-are-tags.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ weight: 2
66

77
{{% badge icon="clock" style="primary" %}}3 minutes{{% /badge %}}
88

9-
To understand why some requests have errors or slow performance, we'll need to add context to our traces. We'll do this by adding tags.
9+
To understand why some requests have errors or slow performance, we'll need to add context to our traces. We'll do this by adding tags. But first, let's take a moment to discuss what tags are, and why they're so important for observability.
1010

1111
## What are tags?
1212

content/en/scenarios/5-understand-impact/3-capture-tags.md

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,23 @@ Let's add some tags to our traces, so we can find out why some customers receive
99

1010
## Identify Useful Tags
1111

12-
We'll start by reviewing the code for the `credit_check` function of `creditcheckservice` (which can be found in the `main.py` file):
12+
We'll start by reviewing the code for the `credit_check` function of `creditcheckservice` (which can be found in the `/home/ubuntu/observability-workshop/workshop/tagging/creditcheckservice/main.py` file):
1313

1414
````
15+
@app.route('/check')
1516
def credit_check():
1617
customerNum = request.args.get('customernum')
17-
18+
1819
# Get Credit Score
1920
creditScoreReq = requests.get("http://creditprocessorservice:8899/getScore?customernum=" + customerNum)
21+
creditScoreReq.raise_for_status()
2022
creditScore = int(creditScoreReq.text)
23+
2124
creditScoreCategory = getCreditCategoryFromScore(creditScore)
2225
2326
# Run Credit Check
2427
creditCheckReq = requests.get("http://creditprocessorservice:8899/runCreditCheck?customernum=" + str(customerNum) + "&score=" + str(creditScore))
28+
creditCheckReq.raise_for_status()
2529
checkResult = str(creditCheckReq.text)
2630
2731
return checkResult
@@ -41,42 +45,42 @@ We start by adding importing the trace module by adding an import statement to t
4145
import requests
4246
from flask import Flask, request
4347
from waitress import serve
44-
from opentelemetry import trace # <--- ADD THIS
48+
from opentelemetry import trace # <--- ADDED BY WORKSHOP
4549
...
4650
````
4751

4852
Next, we need to get a reference to the current span so we can add an attribute (aka tag) to it:
4953

5054
````
5155
def credit_check():
52-
current_span = trace.get_current_span()
56+
current_span = trace.get_current_span() # <--- ADDED BY WORKSHOP
5357
customerNum = request.args.get('customernum')
54-
current_span.set_attribute("customer.num", customerNum)
58+
current_span.set_attribute("customer.num", customerNum) # <--- ADDED BY WORKSHOP
5559
...
5660
````
5761

5862
That was pretty easy, right? Let's capture some more, with the final result looking like this:
5963

6064
````
6165
def credit_check():
62-
current_span = trace.get_current_span()
66+
current_span = trace.get_current_span() # <--- ADDED BY WORKSHOP
6367
customerNum = request.args.get('customernum')
64-
current_span.set_attribute("customer.num", customerNum)
68+
current_span.set_attribute("customer.num", customerNum) # <--- ADDED BY WORKSHOP
6569
6670
# Get Credit Score
6771
creditScoreReq = requests.get("http://creditprocessorservice:8899/getScore?customernum=" + customerNum)
6872
creditScoreReq.raise_for_status()
6973
creditScore = int(creditScoreReq.text)
70-
current_span.set_attribute("credit.score", creditScore)
74+
current_span.set_attribute("credit.score", creditScore) # <--- ADDED BY WORKSHOP
7175
7276
creditScoreCategory = getCreditCategoryFromScore(creditScore)
73-
current_span.set_attribute("credit.score.category", creditScoreCategory)
77+
current_span.set_attribute("credit.score.category", creditScoreCategory) # <--- ADDED BY WORKSHOP
7478
7579
# Run Credit Check
7680
creditCheckReq = requests.get("http://creditprocessorservice:8899/runCreditCheck?customernum=" + str(customerNum) + "&score=" + str(creditScore))
7781
creditCheckReq.raise_for_status()
7882
checkResult = str(creditCheckReq.text)
79-
current_span.set_attribute("credit.check.result", checkResult)
83+
current_span.set_attribute("credit.check.result", checkResult) # <--- ADDED BY WORKSHOP
8084
8185
return checkResult
8286
````
@@ -91,7 +95,7 @@ Once these changes are made, let's run the following script to rebuild the Docke
9195

9296
## Confirm Tag is Captured Successfully
9397

94-
After a few minutes, return to **Splunk Observability Cloud** and load one of the traces to confirm that the tags were captured successfully:
98+
After a few minutes, return to **Splunk Observability Cloud** and load one of the latest traces to confirm that the tags were captured successfully (hint: sort by duration to find the latest traces):
9599

96100
**![Trace with Attributes](../images/trace_with_attributes.png)**
97101

content/en/scenarios/5-understand-impact/4-explore-trace-data.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ weight: 4
66

77
{{% badge icon="clock" style="primary" %}}5 minutes{{% /badge %}}
88

9-
Now that we've captured several tags from our application, lets explore some of the trace data we've captured that include this additional context, and see if we can identify what's causing poor user experience in some cases.
9+
Now that we've captured several tags from our application, let's explore some of the trace data we've captured that include this additional context, and see if we can identify what's causing poor user experience in some cases.
1010

1111
## Use Trace Analyzer
1212

@@ -26,6 +26,8 @@ Let's remove the credit score filter and toggle **Errors only** to on, which res
2626

2727
Click on a few of these traces, and look at the tags we captured. Do you notice any patterns?
2828

29-
If you found a pattern - great job! But keep in mind that this is a difficult way to troubleshoot, as it requires you to look through many traces and remember what you saw in each one to see if you can identify a pattern.
29+
Next, toggle **Errors only** to off, and sort traces by duration. Look at a few of the slowest running traces, and compare them to the fastest running traces. Do you notice any patterns?
30+
31+
If you found a pattern that explains the slow performance and errors - great job! But keep in mind that this is a difficult way to troubleshoot, as it requires you to look through many traces and mentally keep track of what you saw, so you can identify a pattern.
3032

3133
Thankfully, **Splunk Observability Cloud** provides a more efficient way to do this, which we'll explore next.

content/en/scenarios/5-understand-impact/5-index-tags.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ To use advanced features in **Splunk Observability Cloud** such as **Tag Spotlig
1111

1212
To do this, navigate to **Settings** -> **APM MetricSets**. Then click the **+ New MetricSet** button.
1313

14-
Let's index the `credit.score.category` tag to start with by providing the following details:
14+
Let's index the `credit.score.category` tag by entering the following details (**note**: since everyone in the workshop is using the same organization, the instructor will do this step on your behalf):
1515

1616
![Create Troubleshooting MetricSet](../images/create_troubleshooting_metric_set.png)
1717

@@ -29,37 +29,37 @@ Once analysis is complete, click on the checkmark in the **Actions** column.
2929

3030
Why did we choose to index the `credit.score.category` tag and not the others?
3131

32-
To understand this, let's review the primary use cases for attributes:
32+
To understand this, let's review the primary use cases for tags:
3333

3434
* Filtering
3535
* Grouping
3636

3737
### Filtering
3838

39-
With the filtering use case, we can use the **Trace Analyzer** capability of **Splunk Observability Cloud** to filter on traces that match a particular attribute value.
39+
With the filtering use case, we can use the **Trace Analyzer** capability of **Splunk Observability Cloud** to filter on traces that match a particular tag value.
4040

4141
We saw an example of this earlier, when we filtered on traces where the credit score started with "7".
4242

43-
Or if a customer called in to complain about slow service, we could use **Trace Analyzer** to locate all traces with that particular customer number.
43+
Or if a customer calls in to complain about slow service, we could use **Trace Analyzer** to locate all traces with that particular customer number.
4444

45-
Attributes used for filtering use cases are generally high-cardinality, meaning that there could be thousands or even hundreds of thousands of unique values. In fact, **Splunk Observability Cloud** can handle an effectively infinite number of unique attribute values! Filtering using these attributes allows us to rapidly locate the traces of interest.
45+
Tags used for filtering use cases are generally high-cardinality, meaning that there could be thousands or even hundreds of thousands of unique values. In fact, **Splunk Observability Cloud** can handle an effectively infinite number of unique tag values! Filtering using these tags allows us to rapidly locate the traces of interest.
4646

4747
Note that we aren't required to index tags to use them for filtering with **Trace Analyzer**.
4848

4949
### Grouping
5050

51-
With the grouping use case, we can surface trends for attributes that we collect using the powerful **Tag Spotlight** feature in **Splunk Observability Cloud**, which we'll see in action shortly.
51+
With the grouping use case, we can surface trends for tags that we collect using the powerful **Tag Spotlight** feature in **Splunk Observability Cloud**, which we'll see in action shortly.
5252

53-
Attributes used for grouping use cases should be low to medium-cardinality, with hundreds of unique values.
53+
Tags used for grouping use cases should be low to medium-cardinality, with hundreds of unique values.
5454

55-
For custom attributes to be used with **Tag Spotlight**, they first need to be indexed.
55+
For custom tags to be used with **Tag Spotlight**, they first need to be indexed.
5656

5757
We decided to index the `credit.score.category` tag because it has a few distinct values that would be useful for grouping. In contrast, the customer number and credit score tags have hundreds or thousands of unique values, and are more valuable for filtering use cases rather than grouping.
5858

5959
## Troubleshooting vs. Monitoring MetricSets
6060

61-
You may have noticed that, to index this tag, we created something called a **Troubleshooting MetricSet**. It's named this was because a Troubleshooting MetricSet, or TMS, allows us to troubleshoot issues with this tag using features such as **Tag Spotlight**.
61+
You may have noticed that, to index this tag, we created something called a **Troubleshooting MetricSet**. It's named this way because a Troubleshooting MetricSet, or TMS, allows us to troubleshoot issues with this tag using features such as **Tag Spotlight**.
6262

63-
You may have also noticed that there's another option which we didn't choose called a **Monitoring MetricSet** (or MMS). Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting and dashboards. We'll explore this later in the workshop.
63+
You may have also noticed that there's another option which we didn't choose called a **Monitoring MetricSet** (or MMS). Monitoring MetricSets go beyond troubleshooting and allow us to use tags for alerting and dashboards. We'll explore this concept later in the workshop.
6464

6565

content/en/scenarios/5-understand-impact/6-use-tags.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -20,19 +20,19 @@ With **Tag Spotlight**, we can see 100% of credit score requests that result in
2020

2121
This illustrates the power of **Tag Spotlight**! Finding this pattern would be time-consuming without it, as we'd have to manually look through hundreds of traces to identify the pattern (and even then, there's no guarantee we'd find it).
2222

23-
We've looked at errors, but what about latency? Let's click on **Latency** near the top of the screen.
23+
We've looked at errors, but what about latency? Let's click on **Latency** near the top of the screen to find out.
2424

2525
Here, we can see that the requests with a `poor` credit score request are running slowly, with P50, P90, and P99 times of around 3 seconds, which is too long for our users to wait, and much slower than other requests.
2626

27-
We can also see that some requests with an `exceptional` credit score request are running slowly, with P99 times of around 5 seconds, though the P50 and P90 response times are relatively quick.
27+
We can also see that some requests with an `exceptional` credit score request are running slowly, with P99 times of around 5 seconds, though the P50 response time is relatively quick.
2828

2929
**![Tag Spotlight with Latency](../images/tag_spotlight_latency.png)**
3030

3131
## Using Dynamic Service Maps
3232

3333
Now that we know the credit score category associated with the request can impact performance and error rates, let's explore another feature that utilizes indexed tags: **Dynamic Service Maps**.
3434

35-
With Dynamic Service Maps, we can breakdown a particular service by an attribute. For example, let's click on **APM**, then click **Explore** to view the service map.
35+
With Dynamic Service Maps, we can breakdown a particular service by a tag. For example, let's click on **APM**, then click **Explore** to view the service map.
3636

3737
Click on `creditcheckservice`. Then, on the right-hand menu, click on the drop-down that says **Breakdown**, and select the `credit.score.category` tag.
3838

@@ -44,14 +44,14 @@ This view makes it clear that performance for `good` and `fair` credit scores is
4444

4545
## Summary
4646

47-
**Tag Spotlight** has uncovered several interesting patterns that we need to explore further:
47+
**Tag Spotlight** has uncovered several interesting patterns for the engineers that own this service to explore further:
4848

4949
* Why are all the `impossible` credit score requests resulting in error?
5050
* Why are all the `poor` credit score requests running slowly?
5151
* Why do some of the `exceptional` requests run slowly?
5252

53-
As an SRE, passing this context to the service owner would be extremely helpful for their investigation, as it would allow them to track down the issue much more quickly than if we only told them that the service was "sometimes slow".
53+
As an SRE, passing this context to the engineering team would be extremely helpful for their investigation, as it would allow them to track down the issue much more quickly than if we simply told them that the service was "sometimes slow".
5454

5555
If you're curious, have a look at the source code for the `creditprocessorservice`. You'll see that requests with impossible, poor, and exceptional credit scores are handled differently, thus resulting in the differences in error rates and latency that we uncovered.
5656

57-
The behavior we saw with our application is typical for modern cloud-native applications, where different inputs passed to a service lead to different code paths, some of which result in slower performance or errors. For example, in a real credit check service, requests resulting in low credit scores may be sent to another downstream service to further evaluate risk, and may perform more slowly than requests resulting in higher scores.
57+
The behavior we saw with our application is typical for modern cloud-native applications, where different inputs passed to a service lead to different code paths, some of which result in slower performance or errors. For example, in a real credit check service, requests resulting in low credit scores may be sent to another downstream service to further evaluate risk, and may perform more slowly than requests resulting in higher scores, or encounter higher error rates.

0 commit comments

Comments
 (0)