Skip to content

Commit 7f369ce

Browse files
committed
more examples
1 parent 384f0e3 commit 7f369ce

19 files changed

+592
-0
lines changed

docs/reference/data-analysis/aggregations/search-aggregations-bucket-geohexgrid-aggregation.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,8 @@ Response:
8585
}
8686
```
8787

88+
% TESTRESPONSE[s/\.\.\./"took": $body.took,"_shards": $body._shards,"hits":$body.hits,"timed_out":false,/]
89+
8890

8991
## High-precision requests [geohexgrid-high-precision]
9092

@@ -118,6 +120,8 @@ POST /museums/_search?size=0
118120
}
119121
```
120122

123+
% TEST[continued]
124+
121125
Response:
122126

123127
```console-result
@@ -147,6 +151,8 @@ Response:
147151
}
148152
```
149153

154+
% TESTRESPONSE[s/\.\.\./"took": $body.took,"_shards": $body._shards,"hits":$body.hits,"timed_out":false,/]
155+
150156

151157
## Requests with additional bounding box filtering [geohexgrid-addtl-bounding-box-filtering]
152158

@@ -172,6 +178,8 @@ POST /museums/_search?size=0
172178
}
173179
```
174180

181+
% TEST[continued]
182+
175183
Response:
176184

177185
```console-result
@@ -198,6 +206,8 @@ Response:
198206
}
199207
```
200208

209+
% TESTRESPONSE[s/\.\.\./"took": $body.took,"_shards": $body._shards,"hits":$body.hits,"timed_out":false,/]
210+
201211

202212
### Aggregating `geo_shape` fields [geohexgrid-aggregating-geo-shape]
203213

docs/reference/data-analysis/aggregations/search-aggregations-bucket-significantterms-aggregation.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,45 @@ An aggregation that returns interesting or unusual occurrences of terms in a set
1717

1818
In all these cases the terms being selected are not simply the most popular terms in a set. They are the terms that have undergone a significant change in popularity measured between a *foreground* and *background* set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.
1919

20+
%
21+
% [source,console]
22+
% --------------------------------------------------
23+
% PUT /reports
24+
% {
25+
% "mappings": {
26+
% "properties": {
27+
% "force": {
28+
% "type": "keyword"
29+
% },
30+
% "crime_type": {
31+
% "type": "keyword"
32+
% }
33+
% }
34+
% }
35+
% }
36+
%
37+
% POST /reports/_bulk?refresh
38+
% {"index":{"_id":0}}
39+
% {"force": "British Transport Police", "crime_type": "Bicycle theft"}
40+
% {"index":{"_id":1}}
41+
% {"force": "British Transport Police", "crime_type": "Bicycle theft"}
42+
% {"index":{"_id":2}}
43+
% {"force": "British Transport Police", "crime_type": "Bicycle theft"}
44+
% {"index":{"_id":3}}
45+
% {"force": "British Transport Police", "crime_type": "Robbery"}
46+
% {"index":{"_id":4}}
47+
% {"force": "Metropolitan Police Service", "crime_type": "Robbery"}
48+
% {"index":{"_id":5}}
49+
% {"force": "Metropolitan Police Service", "crime_type": "Bicycle theft"}
50+
% {"index":{"_id":6}}
51+
% {"force": "Metropolitan Police Service", "crime_type": "Robbery"}
52+
% {"index":{"_id":7}}
53+
% {"force": "Metropolitan Police Service", "crime_type": "Robbery"}
54+
%
55+
% -------------------------------------------------
56+
% // TESTSETUP
57+
%
58+
2059
## Single-set analysis [_single_set_analysis]
2160

2261
In the simplest case, the *foreground* set of interest is the search results matched by a query and the *background* set used for statistical comparisons is the index or indices from which the results were gathered.
@@ -39,6 +78,8 @@ GET /_search
3978
}
4079
```
4180

81+
% TEST[s/_search/_search\?filter_path=aggregations/]
82+
4283
Response:
4384

4485
```console-result
@@ -62,6 +103,10 @@ Response:
62103
}
63104
```
64105

106+
% TESTRESPONSE[s/\.\.\.//]
107+
108+
% TESTRESPONSE[s/: (0\.)?[0-9]+/: $body.$_path/]
109+
65110
When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554) but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type.
66111

67112
The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons. To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces.
@@ -93,6 +138,8 @@ GET /_search
93138
}
94139
```
95140

141+
% TEST[s/_search/_search\?filter_path=aggregations/]
142+
96143
Response:
97144

98145
```console-result
@@ -143,6 +190,12 @@ Response:
143190
}
144191
```
145192

193+
% TESTRESPONSE[s/\.\.\.//]
194+
195+
% TESTRESPONSE[s/: (0\.)?[0-9]+/: $body.$_path/]
196+
197+
% TESTRESPONSE[s/: "[^"]*"/: $body.$_path/]
198+
146199
Now we have anomaly detection for each of the police forces using a single request.
147200

148201
We can use other forms of top-level aggregations to segment our data, for example segmenting by geographic area to identify unusual hot-spots of a particular crime type:
@@ -257,6 +310,8 @@ The JLH score can be used as a significance score by adding the parameter
257310
}
258311
```
259312

313+
% NOTCONSOLE
314+
260315
The scores are derived from the doc frequencies in *foreground* and *background* sets. The *absolute* change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the *relative* change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
261316

262317

@@ -270,6 +325,8 @@ Mutual information as described in "Information Retrieval", Manning et al., Chap
270325
}
271326
```
272327

328+
% NOTCONSOLE
329+
273330
Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, `include_negatives` can be set to `false`.
274331

275332
Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set
@@ -278,6 +335,8 @@ Per default, the assumption is that the documents in the bucket are also contain
278335
"background_is_superset": false
279336
```
280337

338+
% NOTCONSOLE
339+
281340

282341
### Chi square [_chi_square]
283342

@@ -288,6 +347,8 @@ Chi square as described in "Information Retrieval", Manning et al., Chapter 13.5
288347
}
289348
```
290349

350+
% NOTCONSOLE
351+
291352
Chi square behaves like mutual information and can be configured with the same parameters `include_negatives` and `background_is_superset`.
292353

293354

@@ -300,6 +361,8 @@ Google normalized distance as described in ["The Google Similarity Distance", Ci
300361
}
301362
```
302363

364+
% NOTCONSOLE
365+
303366
`gnd` also accepts the `background_is_superset` parameter.
304367

305368

@@ -383,6 +446,8 @@ GET /_search
383446
}
384447
```
385448

449+
% TEST[s/_search/_search?size=0/]
450+
386451

387452

388453
### Percentage [_percentage]
@@ -398,6 +463,8 @@ It would be hard for a seasoned boxer to win a championship if the prize was awa
398463
}
399464
```
400465

466+
% NOTCONSOLE
467+
401468

402469
### Which one is best? [_which_one_is_best]
403470

@@ -421,6 +488,8 @@ Customized scores can be implemented via a script:
421488
}
422489
```
423490

491+
% NOTCONSOLE
492+
424493
Scripts can be inline (as in above example), indexed or stored on disk. For details on the options, see [script documentation](docs-content://explore-analyze/scripting.md).
425494

426495
Available parameters in the script are

docs/reference/data-analysis/aggregations/search-aggregations-change-point-aggregation.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ A `change_point` aggregation looks like this in isolation:
3737
}
3838
```
3939

40+
% NOTCONSOLE
41+
4042
1. The buckets containing the values to test against.
4143

4244

@@ -99,6 +101,8 @@ GET kibana_sample_data_logs/_search
99101
}
100102
```
101103

104+
% NOTCONSOLE
105+
102106
1. A date histogram aggregation that creates buckets with one day long interval.
103107
2. A sibling aggregation of the `date` aggregation that calculates the average value of the `bytes` field within every bucket.
104108
3. The change point detection aggregation configuration object.
@@ -125,6 +129,8 @@ The request returns a response that is similar to the following:
125129
}
126130
```
127131

132+
% NOTCONSOLE
133+
128134
1. The bucket key that is the change point.
129135
2. The number of documents in that bucket.
130136
3. Aggregated values in the bucket.

docs/reference/data-analysis/aggregations/search-aggregations-pipeline-inference-bucket-aggregation.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,8 @@ A `inference` aggregation looks like this in isolation:
3232
}
3333
```
3434

35+
% NOTCONSOLE
36+
3537
1. The unique identifier or alias for the trained model.
3638
2. The optional inference config which overrides the model’s default settings
3739
3. Map the value of `avg_agg` to the model’s input field `avg_cost`
@@ -158,6 +160,8 @@ GET kibana_sample_data_logs/_search
158160
}
159161
```
160162

163+
% TEST[skip:setup kibana sample data]
164+
161165
1. A composite bucket aggregation that aggregates the data by `client_ip`.
162166
2. A series of metrics and bucket sub-aggregations.
163167
3. {{infer-cap}} bucket aggregation that specifies the trained model and maps the aggregation names to the model’s input fields.

docs/reference/data-analysis/text-analysis/analysis-cjk-bigram-tokenfilter.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,84 @@ The filter produces the following tokens:
3030
[ 東京, 京都, 都は, 日本, 本の, の首, 首都, 都で, であ, あり ]
3131
```
3232

33+
% [source,console-result]
34+
% --------------------------------------------------
35+
% {
36+
% "tokens" : [
37+
% {
38+
% "token" : "東京",
39+
% "start_offset" : 0,
40+
% "end_offset" : 2,
41+
% "type" : "<DOUBLE>",
42+
% "position" : 0
43+
% },
44+
% {
45+
% "token" : "京都",
46+
% "start_offset" : 1,
47+
% "end_offset" : 3,
48+
% "type" : "<DOUBLE>",
49+
% "position" : 1
50+
% },
51+
% {
52+
% "token" : "都は",
53+
% "start_offset" : 2,
54+
% "end_offset" : 4,
55+
% "type" : "<DOUBLE>",
56+
% "position" : 2
57+
% },
58+
% {
59+
% "token" : "日本",
60+
% "start_offset" : 5,
61+
% "end_offset" : 7,
62+
% "type" : "<DOUBLE>",
63+
% "position" : 3
64+
% },
65+
% {
66+
% "token" : "本の",
67+
% "start_offset" : 6,
68+
% "end_offset" : 8,
69+
% "type" : "<DOUBLE>",
70+
% "position" : 4
71+
% },
72+
% {
73+
% "token" : "の首",
74+
% "start_offset" : 7,
75+
% "end_offset" : 9,
76+
% "type" : "<DOUBLE>",
77+
% "position" : 5
78+
% },
79+
% {
80+
% "token" : "首都",
81+
% "start_offset" : 8,
82+
% "end_offset" : 10,
83+
% "type" : "<DOUBLE>",
84+
% "position" : 6
85+
% },
86+
% {
87+
% "token" : "都で",
88+
% "start_offset" : 9,
89+
% "end_offset" : 11,
90+
% "type" : "<DOUBLE>",
91+
% "position" : 7
92+
% },
93+
% {
94+
% "token" : "であ",
95+
% "start_offset" : 10,
96+
% "end_offset" : 12,
97+
% "type" : "<DOUBLE>",
98+
% "position" : 8
99+
% },
100+
% {
101+
% "token" : "あり",
102+
% "start_offset" : 11,
103+
% "end_offset" : 13,
104+
% "type" : "<DOUBLE>",
105+
% "position" : 9
106+
% }
107+
% ]
108+
% }
109+
% --------------------------------------------------
110+
33111

34112
## Add to an analyzer [analysis-cjk-bigram-tokenfilter-analyzer-ex]
35113

docs/reference/data-analysis/text-analysis/analysis-cjk-width-tokenfilter.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,21 @@ The filter produces the following token:
3636
シーサイドライナー
3737
```
3838

39+
% [source,console-result]
40+
% --------------------------------------------------
41+
% {
42+
% "tokens" : [
43+
% {
44+
% "token" : "シーサイドライナー",
45+
% "start_offset" : 0,
46+
% "end_offset" : 10,
47+
% "type" : "<KATAKANA>",
48+
% "position" : 0
49+
% }
50+
% ]
51+
% }
52+
% --------------------------------------------------
53+
3954

4055
## Add to an analyzer [analysis-cjk-width-tokenfilter-analyzer-ex]
4156

docs/reference/data-analysis/text-analysis/analysis-hunspell-tokenfilter.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,42 @@ The filter produces the following tokens:
7070
[ the, fox, jump, quick ]
7171
```
7272

73+
% [source,console-result]
74+
% ----
75+
% {
76+
% "tokens": [
77+
% {
78+
% "token": "the",
79+
% "start_offset": 0,
80+
% "end_offset": 3,
81+
% "type": "<ALPHANUM>",
82+
% "position": 0
83+
% },
84+
% {
85+
% "token": "fox",
86+
% "start_offset": 4,
87+
% "end_offset": 9,
88+
% "type": "<ALPHANUM>",
89+
% "position": 1
90+
% },
91+
% {
92+
% "token": "jump",
93+
% "start_offset": 10,
94+
% "end_offset": 17,
95+
% "type": "<ALPHANUM>",
96+
% "position": 2
97+
% },
98+
% {
99+
% "token": "quick",
100+
% "start_offset": 18,
101+
% "end_offset": 25,
102+
% "type": "<ALPHANUM>",
103+
% "position": 3
104+
% }
105+
% ]
106+
% }
107+
% ----
108+
73109

74110
## Configurable parameters [analysis-hunspell-tokenfilter-configure-parms]
75111

0 commit comments

Comments
 (0)