Skip to content

Commit c4a1562

Browse files
authored
[DOCS] Combine highlighting docs files (#60849) (#60893)
1 parent 3b8f8ba commit c4a1562

File tree

2 files changed

+194
-195
lines changed

2 files changed

+194
-195
lines changed

docs/reference/search/request/highlighters-internal.asciidoc

Lines changed: 0 additions & 194 deletions
This file was deleted.

docs/reference/search/request/highlighting.asciidoc

Lines changed: 194 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -928,4 +928,197 @@ If the `number_of_fragments` option is set to `0`,
928928
This is useful for highlighting the entire contents of a document or field.
929929

930930

931-
include::highlighters-internal.asciidoc[]
931+
[discrete]
932+
[[how-es-highlighters-work-internally]]
933+
== How highlighters work internally
934+
935+
Given a query and a text (the content of a document field), the goal of a
936+
highlighter is to find the best text fragments for the query, and highlight
937+
the query terms in the found fragments. For this, a highlighter needs to
938+
address several questions:
939+
940+
- How break a text into fragments?
941+
- How to find the best fragments among all fragments?
942+
- How to highlight the query terms in a fragment?
943+
944+
[discrete]
945+
=== How to break a text into fragments?
946+
Relevant settings: `fragment_size`, `fragmenter`, `type` of highlighter,
947+
`boundary_chars`, `boundary_max_scan`, `boundary_scanner`, `boundary_scanner_locale`.
948+
949+
Plain highlighter begins with analyzing the text using the given analyzer,
950+
and creating a token stream from it. Plain highlighter uses a very simple
951+
algorithm to break the token stream into fragments. It loops through terms in the token stream,
952+
and every time the current term's end_offset exceeds `fragment_size` multiplied by the number of
953+
created fragments, a new fragment is created. A little more computation is done with using `span`
954+
fragmenter to avoid breaking up text between highlighted terms. But overall, since the breaking is
955+
done only by `fragment_size`, some fragments can be quite odd, e.g. beginning
956+
with a punctuation mark.
957+
958+
Unified or FVH highlighters do a better job of breaking up a text into
959+
fragments by utilizing Java's `BreakIterator`. This ensures that a fragment
960+
is a valid sentence as long as `fragment_size` allows for this.
961+
962+
[discrete]
963+
=== How to find the best fragments?
964+
Relevant settings: `number_of_fragments`.
965+
966+
To find the best, most relevant, fragments, a highlighter needs to score
967+
each fragment in respect to the given query. The goal is to score only those
968+
terms that participated in generating the 'hit' on the document.
969+
For some complex queries, this is still work in progress.
970+
971+
The plain highlighter creates an in-memory index from the current token stream,
972+
and re-runs the original query criteria through Lucene's query execution planner
973+
to get access to low-level match information for the current text.
974+
For more complex queries the original query could be converted to a span query,
975+
as span queries can handle phrases more accurately. Then this obtained low-level match
976+
information is used to score each individual fragment. The scoring method of the plain
977+
highlighter is quite simple. Each fragment is scored by the number of unique
978+
query terms found in this fragment. The score of individual term is equal to its boost,
979+
which is by default is 1. Thus, by default, a fragment that contains one unique query term,
980+
will get a score of 1; and a fragment that contains two unique query terms,
981+
will get a score of 2 and so on. The fragments are then sorted by their scores,
982+
so the highest scored fragments will be output first.
983+
984+
FVH doesn't need to analyze the text and build an in-memory index, as it uses
985+
pre-indexed document term vectors, and finds among them terms that correspond to the query.
986+
FVH scores each fragment by the number of query terms found in this fragment.
987+
Similarly to plain highlighter, score of individual term is equal to its boost value.
988+
In contrast to plain highlighter, all query terms are counted, not only unique terms.
989+
990+
Unified highlighter can use pre-indexed term vectors or pre-indexed terms offsets,
991+
if they are available. Otherwise, similar to Plain Highlighter, it has to create
992+
an in-memory index from the text. Unified highlighter uses the BM25 scoring model
993+
to score fragments.
994+
995+
[discrete]
996+
=== How to highlight the query terms in a fragment?
997+
Relevant settings: `pre-tags`, `post-tags`.
998+
999+
The goal is to highlight only those terms that participated in generating the 'hit' on the document.
1000+
For some complex boolean queries, this is still work in progress, as highlighters don't reflect
1001+
the boolean logic of a query and only extract leaf (terms, phrases, prefix etc) queries.
1002+
1003+
Plain highlighter given the token stream and the original text, recomposes the original text to
1004+
highlight only terms from the token stream that are contained in the low-level match information
1005+
structure from the previous step.
1006+
1007+
FVH and unified highlighter use intermediate data structures to represent
1008+
fragments in some raw form, and then populate them with actual text.
1009+
1010+
A highlighter uses `pre-tags`, `post-tags` to encode highlighted terms.
1011+
1012+
[discrete]
1013+
=== An example of the work of the unified highlighter
1014+
1015+
Let's look in more details how unified highlighter works.
1016+
1017+
First, we create a index with a text field `content`, that will be indexed
1018+
using `english` analyzer, and will be indexed without offsets or term vectors.
1019+
1020+
[source,js]
1021+
--------------------------------------------------
1022+
PUT test_index
1023+
{
1024+
"mappings": {
1025+
"properties": {
1026+
"content": {
1027+
"type": "text",
1028+
"analyzer": "english"
1029+
}
1030+
}
1031+
}
1032+
}
1033+
--------------------------------------------------
1034+
// NOTCONSOLE
1035+
1036+
We put the following document into the index:
1037+
1038+
[source,js]
1039+
--------------------------------------------------
1040+
PUT test_index/_doc/doc1
1041+
{
1042+
"content" : "For you I'm only a fox like a hundred thousand other foxes. But if you tame me, we'll need each other. You'll be the only boy in the world for me. I'll be the only fox in the world for you."
1043+
}
1044+
--------------------------------------------------
1045+
// NOTCONSOLE
1046+
1047+
1048+
And we ran the following query with a highlight request:
1049+
1050+
[source,js]
1051+
--------------------------------------------------
1052+
GET test_index/_search
1053+
{
1054+
"query": {
1055+
"match_phrase" : {"content" : "only fox"}
1056+
},
1057+
"highlight": {
1058+
"type" : "unified",
1059+
"number_of_fragments" : 3,
1060+
"fields": {
1061+
"content": {}
1062+
}
1063+
}
1064+
}
1065+
--------------------------------------------------
1066+
// NOTCONSOLE
1067+
1068+
1069+
After `doc1` is found as a hit for this query, this hit will be passed to the
1070+
unified highlighter for highlighting the field `content` of the document.
1071+
Since the field `content` was not indexed either with offsets or term vectors,
1072+
its raw field value will be analyzed, and in-memory index will be built from
1073+
the terms that match the query:
1074+
1075+
{"token":"onli","start_offset":12,"end_offset":16,"position":3},
1076+
{"token":"fox","start_offset":19,"end_offset":22,"position":5},
1077+
{"token":"fox","start_offset":53,"end_offset":58,"position":11},
1078+
{"token":"onli","start_offset":117,"end_offset":121,"position":24},
1079+
{"token":"onli","start_offset":159,"end_offset":163,"position":34},
1080+
{"token":"fox","start_offset":164,"end_offset":167,"position":35}
1081+
1082+
Our complex phrase query will be converted to the span query:
1083+
`spanNear([text:onli, text:fox], 0, true)`, meaning that we are looking for
1084+
terms "onli: and "fox" within 0 distance from each other, and in the given
1085+
order. The span query will be run against the created before in-memory index,
1086+
to find the following match:
1087+
1088+
{"term":"onli", "start_offset":159, "end_offset":163},
1089+
{"term":"fox", "start_offset":164, "end_offset":167}
1090+
1091+
In our example, we have got a single match, but there could be several matches.
1092+
Given the matches, the unified highlighter breaks the text of the field into
1093+
so called "passages". Each passage must contain at least one match.
1094+
The unified highlighter with the use of Java's `BreakIterator` ensures that each
1095+
passage represents a full sentence as long as it doesn't exceed `fragment_size`.
1096+
For our example, we have got a single passage with the following properties
1097+
(showing only a subset of the properties here):
1098+
1099+
Passage:
1100+
startOffset: 147
1101+
endOffset: 189
1102+
score: 3.7158387
1103+
matchStarts: [159, 164]
1104+
matchEnds: [163, 167]
1105+
numMatches: 2
1106+
1107+
Notice how a passage has a score, calculated using the BM25 scoring formula
1108+
adapted for passages. Scores allow us to choose the best scoring
1109+
passages if there are more passages available than the requested
1110+
by the user `number_of_fragments`. Scores also let us to sort passages by
1111+
`order: "score"` if requested by the user.
1112+
1113+
As the final step, the unified highlighter will extract from the field's text
1114+
a string corresponding to each passage:
1115+
1116+
"I'll be the only fox in the world for you."
1117+
1118+
and will format with the tags <em> and </em> all matches in this string
1119+
using the passages's `matchStarts` and `matchEnds` information:
1120+
1121+
I'll be the <em>only</em> <em>fox</em> in the world for you.
1122+
1123+
This kind of formatted strings are the final result of the highlighter returned
1124+
to the user.

0 commit comments

Comments
 (0)