Skip to content

Commit 4a89e6e

Browse files
committed
Term Stats documentation
1 parent c6f7827 commit 4a89e6e

File tree

4 files changed

+110
-22
lines changed

4 files changed

+110
-22
lines changed

docs/reference/query-dsl/script-score-query.asciidoc

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,10 +62,17 @@ multiplied by `boost` to produce final documents' scores. Defaults to `1.0`.
6262
===== Use relevance scores in a script
6363

6464
Within a script, you can
65-
{ref}/modules-scripting-fields.html#scripting-score[access]
65+
{ref}/modules-scripting-fields.html#scripting-score[access]
6666
the `_score` variable which represents the current relevance score of a
6767
document.
6868

69+
[[script-score-access-term-statistics]]
70+
===== Use term statistics in a script
71+
72+
Within a script, you can
73+
{ref}/modules-scripting-fields.html#scripting-term-statistics[access]
74+
the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query.
75+
6976
[[script-score-predefined-functions]]
7077
===== Predefined functions
7178
You can use any of the available {painless}/painless-contexts.html[painless
@@ -147,7 +154,7 @@ updated since update operations also update the value of the `_seq_no` field.
147154

148155
[[decay-functions-numeric-fields]]
149156
====== Decay functions for numeric fields
150-
You can read more about decay functions
157+
You can read more about decay functions
151158
{ref}/query-dsl-function-score-query.html#function-decay[here].
152159

153160
* `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)`
@@ -233,7 +240,7 @@ The `script_score` query calculates the score for
233240
every matching document, or hit. There are faster alternative query types that
234241
can efficiently skip non-competitive hits:
235242

236-
* If you want to boost documents on some static fields, use the
243+
* If you want to boost documents on some static fields, use the
237244
<<query-dsl-rank-feature-query, `rank_feature`>> query.
238245
* If you want to boost documents closer to a date or geographic point, use the
239246
<<query-dsl-distance-feature-query, `distance_feature`>> query.

docs/reference/reranking/learning-to-rank-model-training.asciidoc

Lines changed: 25 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc
3838
from eland.ml.ltr import QueryFeatureExtractor
3939
4040
feature_extractors=[
41-
# We want to use the score of the match query for the title field as a feature:
41+
# We want to use the BM25 score of the match query for the title field as a feature:
4242
QueryFeatureExtractor(
4343
feature_name="title_bm25",
4444
query={"match": {"title": "{{query}}"}}
4545
),
46+
# We want to use the the number of matched terms in the title field as a feature:
47+
QueryFeatureExtractor(
48+
feature_name="title_matched_term_count",
49+
query={
50+
"script_score": {
51+
"query": {"match": {"title": "{{query}}"}},
52+
"script": {"source": "return _termStats.matchedTermsCount();"},
53+
}
54+
},
55+
),
4656
# We can use a script_score query to get the value
4757
# of the field rating directly as a feature:
4858
QueryFeatureExtractor(
@@ -54,26 +64,29 @@ feature_extractors=[
5464
}
5565
},
5666
),
57-
# We can execute a script on the value of the query
58-
# and use the return value as a feature:
59-
QueryFeatureExtractor(
60-
feature_name="query_length",
67+
# We extract the number of terms in the query as feature.
68+
QueryFeatureExtractor(
69+
feature_name="query_term_count",
6170
query={
6271
"script_score": {
63-
"query": {"match_all": {}},
64-
"script": {
65-
"source": "return params['query'].splitOnToken(' ').length;",
66-
"params": {
67-
"query": "{{query}}",
68-
}
69-
},
72+
"query": {"match": {"title": "{{query}}"}},
73+
"script": {"source": "return _termStats.uniqueTermsCount();"},
7074
}
7175
},
7276
),
7377
]
7478
----
7579
// NOTCONSOLE
7680

81+
[NOTE]
82+
.Tern statistics as features
83+
===================================================
84+
85+
It is very common for an LTR model to leverage raw term statistics as features.
86+
To extract these information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the <<query-dsl-script-score-query,`script_score`>> query.
87+
88+
===================================================
89+
7790
Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
7891

7992
[source,python]

docs/reference/reranking/learning-to-rank-search-usage.asciidoc

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each
6161
====== Negative scores
6262

6363
Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
64-
65-
[discrete]
66-
[[learning-to-rank-rescorer-limitations-term-statistics]]
67-
====== Term statistics as features
68-
69-
We do not currently support term statistics as features, however future releases will introduce this capability.
70-

docs/reference/scripting/fields.asciidoc

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,81 @@ GET my-index-000001/_search
8080
}
8181
-------------------------------------
8282

83+
[discrete]
84+
[[scripting-term-statistics]]
85+
=== Accessing term statistics of a document within a script
86+
87+
Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.
88+
89+
In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:
90+
91+
[source,console]
92+
-------------------------------------
93+
PUT my-index-000001/_doc/1?refresh
94+
{
95+
"text": "quick brown fox"
96+
}
97+
98+
PUT my-index-000001/_doc/2?refresh
99+
{
100+
"text": "quick fox"
101+
}
102+
103+
GET my-index-000001/_search
104+
{
105+
"query": { <1>
106+
"function_score": {
107+
"query": {
108+
"match": {
109+
"text": "quick brown fox"
110+
}
111+
},
112+
"script_score": {
113+
"script": {
114+
"source": "_termStats.termFreq().getAverage()" <2>
115+
}
116+
}
117+
}
118+
}
119+
}
120+
-------------------------------------
121+
122+
<1> Child query used to infer the field and the terms considered in term statistics.
123+
124+
<2> The script calculates the average document frequency for the terms in the query using `_termStats`.
125+
126+
`_termStats` provides access to the following functions for working with term statistics:
127+
128+
- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
129+
- `matchedTermsCount`: Returns the count of query terms that matched within the current document.
130+
- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
131+
- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
132+
- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.
133+
134+
[NOTE]
135+
.Functions returning aggregated statistics
136+
===================================================
137+
138+
The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
139+
140+
Statistics provides support for the following methods:
141+
142+
`getAverage()`: Returns the average value of the metric.
143+
`getMin()`: Returns the minimum value of the metric.
144+
`getMax()`: Returns the maximum value of the metric.
145+
`getSum()`: Returns the sum of the metric values.
146+
`getCount()`: Returns the count of terms included in the metric calculation.
147+
148+
===================================================
149+
150+
151+
[NOTE]
152+
.Painless language required
153+
===================================================
154+
155+
The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
156+
157+
===================================================
83158

84159
[discrete]
85160
[[modules-scripting-doc-vals]]

0 commit comments

Comments
 (0)