Skip to content

Commit 21894d9

Browse files
author
Max Hniebergall
committed
Was using byte position for end of offset, but it seems like using char position is correct
1 parent 759bb7f commit 21894d9

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/UnigramTokenizer.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -368,7 +368,7 @@ List<DelimitedToken.Encoded> tokenize(CharSequence inputSequence, IntToIntFuncti
368368
Strings.format("<0x%02X>", bytes[i]),
369369
pieces[i],
370370
offsetCorrection.apply(node.startsAtCharPos),
371-
offsetCorrection.apply(startsAtBytes + i)
371+
offsetCorrection.apply(node.startsAtCharPos + i)
372372
)
373373
);
374374
}

0 commit comments

Comments
 (0)