Skip to content

Commit aa3556b

Browse files
Update UnigramTokenizer.java
1 parent c036769 commit aa3556b

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -367,8 +367,10 @@ List<DelimitedToken.Encoded> tokenize(CharSequence inputSequence, IntToIntFuncti
367367
new DelimitedToken.Encoded(
368368
Strings.format("<0x%02X>", bytes[i]),
369369
pieces[i],
370+
// even though we are changing the number of characters in the output, we don't
371+
// need to change the offsets. The offsets refer to the input characters
370372
offsetCorrection.apply(node.startsAtCharPos),
371-
offsetCorrection.apply(node.startsAtCharPos + i)
373+
offsetCorrection.apply(endsAtChars)
372374
)
373375
);
374376
}

0 commit comments

Comments
 (0)