Skip to content

Commit 97c47de

Browse files
authored
Baseline: Support query baseline with MQE and use in the Alarm Kernel. (#13024)
1 parent d6bcecf commit 97c47de

File tree

57 files changed

+1620
-112
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+1620
-112
lines changed

docs/en/api/metrics-query-expression.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -432,7 +432,7 @@ then the expression result is:
432432
V(T1)-V(T1-2), V(T2)-V(T1-1), V(T3)-V(T1)
433433
```
434434

435-
**Note**:
435+
**Notice**
436436
* If the calculated metric value is empty, the result will be empty. Assume in the T3 point, the increase value = V(T3)-V(T1), If the metric V(T3) or V(T1) is empty, the result value in T3 will be empty.
437437

438438
### Result Type
@@ -494,6 +494,34 @@ metric{label1='a', label2='2c'}
494494
metric{label1='a', label2='2a'}
495495
```
496496

497+
### Baseline Operation
498+
Baseline Operation takes an expression and gets the baseline predicted values of the input metric.
499+
500+
Expression:
501+
```text
502+
baseline(Expression, <baseline_type>)
503+
```
504+
505+
- `baseline_type` is the type of the baseline predicted value. The type can be `value`, `upper`, `lower`.
506+
507+
for example:
508+
If we want to get the baseline predicted `upper` values of the `service_resp_time` metric, we can use the following expression:
509+
```text
510+
baseline(service_resp_time, upper)
511+
```
512+
513+
**Notice**:
514+
- This feature is required to enable the `baseline module` and deploy a baseline service. And the baseline service should implement the protocol of the [baseline.proto](../../../oap-server/metrics-baseline/src/main/proto/baseline.proto).
515+
Otherwise, the result will be empty.
516+
- The baseline operation requires the relative metrics declared through baseline service.
517+
Otherwise, the result will be empty, which means there is no baseline or predicated value.
518+
- For now, the predictions aim to every hour.
519+
And the predicated values provided within this baseline are at a minute-level granularity.
520+
As a result, for CPM(calls per minute), when the query step is `MINUTE` and duration is in a full hour, the returned values are same in every minute of this whole hour.
521+
522+
### Result Type
523+
TIME_SERIES_VALUES.
524+
497525
## Expression Query Example
498526
### Labeled Value Metrics
499527
```text

docs/en/changes/changes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@
6868
* BaseLine: Support query baseline metrics names.
6969
* Add `Get Node List in the Cluster` API.
7070
* Add type descriptor when converting Envoy logs to JSON for persistence, to avoid conversion error.
71+
* Bseline: Support query baseline with MQE and use in the Alarm Rule.
7172

7273
#### UI
7374

docs/en/setup/backend/backend-alarm.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,21 @@ Currently, metrics from the **Service**, **Service Instance**, **Endpoint**, **S
125125

126126
Submit an issue or a pull request if you want to support any other scopes in Alarm.
127127

128+
### Use the Baseline Predicted Value to trigger the Alarm
129+
Since 10.2.0, SkyWalking supports using the baseline predicted value in the alarm rule expression.
130+
The MQE expression can refer to [Baseline Operation](../../api/metrics-query-expression.md#baseline-operation).
131+
132+
For example, the following rule will compare the service response time with the baseline predicted value in each time bucket, and
133+
when the service response time is higher than the baseline predicted value in 3 minutes of the last 10 minutes, the alarm will be triggered.
134+
135+
```yaml
136+
rules:
137+
service_resp_time_rule:
138+
expression: sum(service_resp_time > baseline(service_resp_time, upper)) > 3
139+
period: 10
140+
message: Service {name} response time is higher than the baseline predicted value in 3 minutes of last 10 minutes.
141+
```
142+
128143
## Hooks
129144
Hooks are a way to send alarm messages to the outside world. SkyWalking supports multiple hooks of the same type, each hook can support different configurations.
130145
For example, you can configure two Slack hooks, one named `default` and set `is-default: true` means this hook will apply on all `Alarm Rules` **without config** `hooks`.

oap-server/metrics-baseline/src/main/java/org/apache/skywalking/oap/server/baseline/service/BaselineQueryService.java

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,19 +18,28 @@
1818

1919
package org.apache.skywalking.oap.server.baseline.service;
2020

21+
import java.util.Map;
2122
import org.apache.skywalking.oap.server.library.module.Service;
2223

2324
import java.util.List;
2425

2526
public interface BaselineQueryService extends Service {
2627
/**
2728
* query supported query baseline metrics names
29+
*
2830
* @return
2931
*/
3032
List<String> querySupportedMetrics();
3133

3234
/**
33-
* query predict metrics
35+
* query predicted metrics
3436
*/
35-
List<PredictServiceMetrics> queryPredictMetrics(List<ServiceMetrics> serviceMetrics, long startTimeBucket, long endTimeBucket);
37+
List<PredictServiceMetrics> queryPredictMetrics(List<ServiceMetrics> serviceMetrics,
38+
long startTimeBucket,
39+
long endTimeBucket);
40+
41+
/**
42+
* query predicted metrics from cache, return all predicted metrics for the given service name and time bucket hour
43+
*/
44+
Map<String, PredictServiceMetrics.PredictMetricsValue> queryPredictMetricsFromCache(String serviceName, String timeBucketHour);
3645
}

oap-server/metrics-baseline/src/main/java/org/apache/skywalking/oap/server/baseline/service/BaselineQueryServiceImpl.java

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,13 @@
1818

1919
package org.apache.skywalking.oap.server.baseline.service;
2020

21+
import com.google.common.cache.Cache;
22+
import com.google.common.cache.CacheBuilder;
2123
import com.google.protobuf.Empty;
2224
import io.grpc.ManagedChannel;
25+
import java.util.HashMap;
26+
import java.util.Map;
27+
import java.util.concurrent.TimeUnit;
2328
import lombok.extern.slf4j.Slf4j;
2429
import org.apache.skywalking.apm.baseline.v3.AlarmBaselineMetricPrediction;
2530
import org.apache.skywalking.apm.baseline.v3.AlarmBaselineMetricsNames;
@@ -29,7 +34,11 @@
2934
import org.apache.skywalking.apm.baseline.v3.AlarmBaselineServiceMetricName;
3035
import org.apache.skywalking.apm.baseline.v3.KeyStringValuePair;
3136
import org.apache.skywalking.apm.baseline.v3.TimeBucketStep;
37+
import org.apache.skywalking.oap.server.core.Const;
38+
import org.apache.skywalking.oap.server.core.analysis.DownSampling;
39+
import org.apache.skywalking.oap.server.core.analysis.TimeBucket;
3240
import org.apache.skywalking.oap.server.library.client.grpc.GRPCClient;
41+
import org.apache.skywalking.oap.server.library.util.CollectionUtils;
3342
import org.apache.skywalking.oap.server.library.util.StringUtil;
3443

3544
import java.util.ArrayList;
@@ -41,8 +50,12 @@
4150
@Slf4j
4251
public class BaselineQueryServiceImpl implements BaselineQueryService {
4352
private AlarmBaselineServiceGrpc.AlarmBaselineServiceBlockingStub stub;
53+
private final Cache<String/*timeBucket,serviceName*/, Map<String/*metricName*/, PredictServiceMetrics.PredictMetricsValue>> baselineCache;
4454

4555
public BaselineQueryServiceImpl(String addr, int port) {
56+
this.baselineCache = CacheBuilder.newBuilder()
57+
.expireAfterAccess(1, TimeUnit.HOURS)
58+
.build();
4659
if (StringUtil.isEmpty(addr) || port <= 0) {
4760
return;
4861
}
@@ -55,6 +68,7 @@ public BaselineQueryServiceImpl(String addr, int port) {
5568
@Override
5669
public List<String> querySupportedMetrics() {
5770
if (stub == null) {
71+
log.warn("Baseline service is not set up, return empty list.");
5872
return Collections.emptyList();
5973
}
6074

@@ -66,6 +80,7 @@ public List<String> querySupportedMetrics() {
6680

6781
public List<PredictServiceMetrics> queryPredictMetrics(List<ServiceMetrics> serviceMetrics, long startTimeBucket, long endTimeBucket) {
6882
if (stub == null) {
83+
log.warn("Baseline service is not set up, return empty baseline values.");
6984
return Collections.emptyList();
7085
}
7186

@@ -77,6 +92,58 @@ public List<PredictServiceMetrics> queryPredictMetrics(List<ServiceMetrics> serv
7792
return Collections.emptyList();
7893
}
7994

95+
public Map<String, PredictServiceMetrics.PredictMetricsValue> queryPredictMetricsFromCache(String serviceName,
96+
String timeBucketHour) {
97+
if (stub == null) {
98+
log.warn("Baseline service is not set up, return empty baseline values.");
99+
return Collections.emptyMap();
100+
}
101+
String key = timeBucketHour + Const.COMMA + serviceName;
102+
Map<String, PredictServiceMetrics.PredictMetricsValue> baselineValues = this.baselineCache.asMap().get(key);
103+
104+
if (CollectionUtils.isNotEmpty(baselineValues)) {
105+
return baselineValues;
106+
}
107+
//reload all metrics and timeBucket baseline values for this service
108+
List<String> metrics = querySupportedMetrics();
109+
ServiceMetrics serviceMetrics = ServiceMetrics.builder()
110+
.serviceName(serviceName)
111+
.metricsNames(metrics)
112+
.build();
113+
//todo: need config?
114+
long startTimeBucket = TimeBucket.getTimeBucket(
115+
System.currentTimeMillis() - TimeUnit.HOURS.toMillis(24), DownSampling.Hour);
116+
long endTimeBucket = TimeBucket.getTimeBucket(
117+
System.currentTimeMillis() + TimeUnit.HOURS.toMillis(24), DownSampling.Hour);
118+
List<PredictServiceMetrics> predictServiceMetricsList = queryPredictMetrics(
119+
Collections.singletonList(serviceMetrics), startTimeBucket, endTimeBucket);
120+
if (CollectionUtils.isEmpty(predictServiceMetricsList)) {
121+
return Collections.emptyMap();
122+
}
123+
for (String metricName : metrics) {
124+
for (PredictServiceMetrics predictServiceMetrics : predictServiceMetricsList) {
125+
List<PredictServiceMetrics.PredictMetricsValue> predictMetricsValues = predictServiceMetrics.getMetricsValues()
126+
.get(
127+
metricName);
128+
if (CollectionUtils.isEmpty(predictMetricsValues)) {
129+
continue;
130+
}
131+
for (PredictServiceMetrics.PredictMetricsValue predictMetricsValue : predictMetricsValues) {
132+
if (predictMetricsValue == null) {
133+
continue;
134+
}
135+
this.baselineCache.asMap()
136+
.computeIfAbsent(
137+
predictMetricsValue.getTimeBucket() + Const.COMMA + serviceName,
138+
k -> new HashMap<>()
139+
)
140+
.put(metricName, predictMetricsValue);
141+
}
142+
}
143+
}
144+
return this.baselineCache.asMap().getOrDefault(key, Collections.emptyMap());
145+
}
146+
80147
private List<PredictServiceMetrics> queryPredictMetrics0(List<ServiceMetrics> serviceMetrics, long startTimeBucket, long endTimeBucket) {
81148
// building request
82149
final AlarmBaselineRequest.Builder request = AlarmBaselineRequest.newBuilder();

oap-server/mqe-grammar/src/main/antlr4/org/apache/skywalking/mqe/rt/grammar/MQELexer.g4

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,11 @@ ATTR3: 'attr3';
9898
ATTR4: 'attr4';
9999
ATTR5: 'attr5';
100100

101+
BASELINE: 'baseline';
102+
VALUE: 'value';
103+
UPPER: 'upper';
104+
LOWER: 'lower';
105+
101106
// Literals
102107
INTEGER: Digit+;
103108
DECIMAL: Digit+ DOT Digit+;

oap-server/mqe-grammar/src/main/antlr4/org/apache/skywalking/mqe/rt/grammar/MQEParser.g4

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ expression
4040
| aggregateLabels L_PAREN expression COMMA aggregateLabelsFunc R_PAREN #aggregateLabelsOp
4141
| sort_values L_PAREN expression (COMMA INTEGER)? COMMA order R_PAREN #sortValuesOP
4242
| sort_label_values L_PAREN expression COMMA order COMMA labelNameList R_PAREN #sortLabelValuesOP
43+
| baseline L_PAREN metric COMMA baseline_type R_PAREN #baselineOP
4344
;
4445

4546
expressionList
@@ -110,3 +111,6 @@ attributeName:
110111
ATTR0 | ATTR1 | ATTR2 | ATTR3 | ATTR4 | ATTR5;
111112
attribute: attributeName (EQ | NEQ) VALUE_STRING;
112113
attributeList: attribute (COMMA attribute)*;
114+
115+
baseline: BASELINE;
116+
baseline_type: VALUE | UPPER | LOWER;

oap-server/mqe-rt/pom.xml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,5 +44,10 @@
4444
<artifactId>mqe-grammar</artifactId>
4545
<version>${project.version}</version>
4646
</dependency>
47+
<dependency>
48+
<groupId>org.apache.skywalking</groupId>
49+
<artifactId>metrics-baseline</artifactId>
50+
<version>${project.version}</version>
51+
</dependency>
4752
</dependencies>
4853
</project>

0 commit comments

Comments
 (0)