You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: exercises/manual-instrumentation-metrics-java/initial/todobackend-springboot/src/main/java/io/novatec/todobackend/TodobackendApplication.java
Copy file name to clipboardExpand all lines: exercises/manual-instrumentation-metrics-java/solution/todobackend-springboot/src/main/java/io/novatec/todobackend/TodobackendApplication.java
.setDescription("How many times the GET call has been invoked.")
274
+
.setDescription("How many times an endpoint has been invoked")
273
275
.setUnit("requests")
274
276
.build();
275
277
}
@@ -417,7 +419,7 @@ insights from the data. Therefore, it is essential to carefully consider the dim
417
419
to ensure that they are both informative and manageable within the constraints of the monitoring system.
418
420
419
421
420
-
### Something
422
+
### Instruments
421
423
422
424
We've used a simple counter instrument to generate a metric. TheAPI however is capable of more here.
423
425
@@ -454,25 +456,228 @@ Synchronous instruments are invoked in line with the application code, while asy
454
456
a callback function that is invoked on demand. This allows for more efficient and flexible metric collection,
455
457
especially in scenarios where the metric value is expensive to compute or when the metric value changes infrequently.
456
458
457
-
### Metric dimensions
458
459
459
-
Add to previous chapter
460
+
### Measure golden signals
460
461
461
-
- GET Tom
462
-
- GET Jerry
463
-
--> Different Metrics! --> Cardinality
462
+
{{< figure src="images/resource_workload_analysis.PNG" width=600 caption="workload and resource analysis" >}}
464
463
465
-
### 4 Golden signals / Instruments Practice
464
+
Now, let's put our understanding of the metrics signal to use.
465
+
Before we do that, we must address an important question:*What*do we measure?
466
+
Unfortunately, the answer is anything but simple.
467
+
Due to the vast amount of events within a system and many statistical measures to calculate, there are nearly infinite things to measure.
468
+
Acatch-all approach is cost-prohibitive from a computation and storage point, increases the noise which makes it harder to find important signals, leads to alert fatigue, and so on.
469
+
The term metric refers to a statistic that we consciously chose to collect because we deem it to be *important*.
470
+
Important is a deliberately vague term, because it means different things to different people.
471
+
A system administrator typically approaches an investigation by looking at the utilization or saturation of physical system resources.
472
+
A developer is usually more interested in how the application responds, looking at the applied workload, response times, error rates, etc.
473
+
In contrast, a customer-centric role might look at more high-level indicators related to contractual obligations (e.g. as defined by an SLA), business outcomes, and so on.
474
+
The details of different monitoring perspectives and methodologies (such as [USE](https://www.brendangregg.com/usemethod.html) and RED) are beyond the scope of this lab.
475
+
However, the [four golden signals](https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals) of observability often provide a good starting point:
466
476
467
-
TODO
477
+
-**Traffic**: volume of requests handled by the system
478
+
-**Errors**: rate of failed requests
479
+
-**Latency**: the amount of time it takes to serve a request
480
+
-**Saturation**: how much of a resource is being consumed at a given time
481
+
482
+
We have already shown how to measure the total amount of traffic in the previous chapters.
483
+
Thus, let's continue with the remaining signals.
484
+
485
+
#### Error rate
486
+
487
+
As a next step, let's track the error rate of creating new todos. Create a separate Counter instrument.
488
+
Ultimately, the decision of what constitutes a failed request is up to us.
489
+
In this example, we'll simply refer to the name of the todo.
490
+
491
+
First, add another global variable to the class called:
492
+
493
+
```java { title="TodobackendApplication.java" }
494
+
privateLongCounter errorCounter;
495
+
```
496
+
497
+
Initialize it in the constructor of the class:
498
+
499
+
```java { title="TodobackendApplication.java" }
500
+
public TodobackendApplication(OpenTelemetry openTelemetry) {
.setDescription("How many times an error occurred")
506
+
.setUnit("requests")
507
+
.build();
508
+
}
509
+
```
510
+
511
+
Then include the `errorCounter` inside `someInternalMethod` like this:
512
+
513
+
```java { title="TodobackendApplication.java" }
514
+
String someInternalMethod(String todo) {
515
+
516
+
//...
517
+
518
+
if(todo.equals("fail")){
519
+
520
+
errorCounter.add(1);
521
+
System.out.println("Failing ...");
522
+
thrownewRuntimeException();
523
+
}
524
+
525
+
return todo;
526
+
}
527
+
```
528
+
529
+
Restart the app. When sending a fail request, you will see the error counter in the log output
530
+
531
+
```sh
532
+
curl -XPOST localhost:8080/todos/fail; echo
533
+
```
534
+
535
+
#### Latency
536
+
537
+
The time it takes a service to process a request is a crucial indicator of potential problems.
538
+
The tracing lab showed that spans contain timestamps that measure the duration of an operation.
539
+
Traces allow us to analyze latency in a specific transaction.
540
+
However, in practice, we often want to monitor the overall latency for a given service.
541
+
While it is possible to compute this from span metadata, converting between telemetry signals is not very practical.
542
+
For example, since capturing traces for every request is resource-intensive, we might want to use
543
+
sampling to reduce overhead.
544
+
Depending on the strategy, sampling may increase the probability that outlier events are missed.
545
+
Therefore, we typically analyze the latency via a Histogram.
546
+
Histograms are ideal forthis because they represent a frequency distribution across many requests.
547
+
They allow us to divide a set of data points into percentage-based segments, commonly known as percentiles.
548
+
For example, the 95th percentile latency (P95) represents the value below which 95% of response times fall.
549
+
A significant gap between P50 and higher percentiles suggests that a small percentage of requests experience
550
+
longer delays.
551
+
A major challenge is that there is no unified definition of how to measure latency.
552
+
We could measure the time a service spends processing application code, the time it takes to get a response
553
+
from a remote service, and so on.
554
+
To interpret measurements correctly, it is vital to have information on what was measured.
555
+
556
+
Let's use the meter to create a Histogram instrument.
557
+
Refer to the semantic conventions for [HTTP Metrics](https://opentelemetry.io/docs/specs/semconv/http/http-metrics/) for an instrument name and preferred unit of measurement.
558
+
To measure the time it took to serve the request, we'll create a timestamp at the beginning at the method
559
+
and calculate the difference from the timestamp at the end of the method.
468
560
469
-
Erzeuge Instruments im Konstruktor
470
561
471
-
- Traffic (macht man doch schon oben)
472
-
- Errors (POST Todo Fail)
473
-
- Latency (Req-Duration)
474
-
- Saturation (CPU via OperatingSystemMXBean)
562
+
First, add a global variable:
475
563
564
+
```java { title="TodobackendApplication.java" }
565
+
privateLongHistogram requestDuration;
566
+
```
567
+
568
+
Then initialize the instrument in the constructor:
569
+
570
+
```java { title="TodobackendApplication.java" }
571
+
public TodobackendApplication(OpenTelemetry openTelemetry) {
Histogramsdo not store their values explicitly, but implicitly through aggregations (sum, count min, max) and buckets.
615
+
Normally, we are not interested in the exact measured values, but the boundaries within which they lie.
616
+
Thus, we define buckets via those mentioned (upper) boundaries and count how many values are measured
617
+
within these buckets. The boundaries are inclusive.
618
+
619
+
In the example above we can read there was one request taking between 0 and 5 ms,
620
+
another request taking within 75 to 100 ms and two requests taking between 1000 and 2500 ms.
621
+
622
+
**Note:**Since the first upper boundary is 0.0, the first bucket is reserved only for the value 0.
623
+
Furthermore, there are actually 16 buckets for15 upper boundaries. There is always one implicit upper boundary for values
624
+
exceeding the last explicit boundary. Inthiscase, values greater than 10000.
625
+
You could call this last bucket the Inf+ bucket.
626
+
627
+
#### Saturation
628
+
629
+
All the previous metrics have been request-oriented.
630
+
For completeness, we'll also capture some resource-oriented metrics.
631
+
According to Google's SRE book, the fourth golden signal is called "[saturation](https://sre.google/sre-book/monitoring-distributed-systems/#saturation)".
632
+
Unfortunately, the terminology is not well-defined.
633
+
BrendanGregg, a renowned expert in the field, defines saturation as the amount of work that a resource is unable to service.
634
+
In other words, saturation is a backlog of unprocessed work.
635
+
An example of a saturation metric would be the length of a queue.
636
+
In contrast, utilization refers to the average time that a resource was busy servicing work.
637
+
We usually measure utilization as a percentage over time.
638
+
For example, 100% utilization means no more work can be accepted.
639
+
If we go strictly by definition, both terms refer to separate concepts.
640
+
One can lead to the other, but it doesn't have to.
641
+
It would be perfectly possible for a resource to experience high utilization without any saturation.
642
+
However, Google's definition of saturation, confusingly, resembles utilization.
643
+
Let's put the matter of terminology aside.
644
+
645
+
Let's measure the system cpu utilization. There is already a method `getCpuLoad` prepared,
646
+
which allows us to record values via gauge.
647
+
648
+
Add another global variable:
649
+
650
+
```java { title="TodobackendApplication.java" }
651
+
privateObservableDoubleGauge cpuLoad;
652
+
```
653
+
654
+
Then initialize the instrument in the constructor. This time we will use a callback function to record values.
655
+
This callback function will be called everytime, the `MetricReader` observes the gauge instrument.
656
+
As mentioned above, we have already configured a reading interval of 10s.
657
+
658
+
```java { title="TodobackendApplication.java" }
659
+
public TodobackendApplication(OpenTelemetry openTelemetry) {
0 commit comments