Skip to content

Conversation

@maxhniebergall
Copy link
Contributor

Was using byte position, but the other code uses char position, so we should probably do that.

@elasticsearchmachine
Copy link
Collaborator

Hi @maxhniebergall, I've created a changelog YAML for you.

@maxhniebergall
Copy link
Contributor Author

@elasticmachine merge upstream

@maxhniebergall
Copy link
Contributor Author

maxhniebergall commented Nov 19, 2024

Upon further investigation, I have determined that this initial bug fix was not causing the reported problem (although it is also a bug independently), and the problem doesn't seem to be with bytefallback. In fact, the huggingface tokenizer doesn't even produce any output for the problematic characters.

@maxhniebergall maxhniebergall marked this pull request as ready for review November 19, 2024 20:22
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@maxhniebergall
Copy link
Contributor Author

@elasticmachine merge upstream

@maxhniebergall maxhniebergall added the auto-backport Automatically create backport pull requests when merged label Nov 19, 2024
Copy link
Member

@dan-rubinstein dan-rubinstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maxhniebergall maxhniebergall merged commit 7705514 into main Nov 20, 2024
17 checks passed
@maxhniebergall maxhniebergall deleted the useCharPosInsteadOfBytePos branch November 20, 2024 20:08
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.x
8.16

maxhniebergall added a commit to maxhniebergall/elasticsearch that referenced this pull request Nov 20, 2024
* Was using byte position for end of offset, but it seems like using char position is correct

* Update docs/changelog/116358.yaml

* Update UnigramTokenizer.java

---------

Co-authored-by: Elastic Machine <[email protected]>
maxhniebergall added a commit to maxhniebergall/elasticsearch that referenced this pull request Nov 20, 2024
* Was using byte position for end of offset, but it seems like using char position is correct

* Update docs/changelog/116358.yaml

* Update UnigramTokenizer.java

---------

Co-authored-by: Elastic Machine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Nov 20, 2024
* Was using byte position for end of offset, but it seems like using char position is correct

* Update docs/changelog/116358.yaml

* Update UnigramTokenizer.java

---------

Co-authored-by: Elastic Machine <[email protected]>
elasticsearchmachine pushed a commit that referenced this pull request Nov 20, 2024
* Was using byte position for end of offset, but it seems like using char position is correct

* Update docs/changelog/116358.yaml

* Update UnigramTokenizer.java

---------

Co-authored-by: Elastic Machine <[email protected]>
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Nov 20, 2024
* Was using byte position for end of offset, but it seems like using char position is correct

* Update docs/changelog/116358.yaml

* Update UnigramTokenizer.java

---------

Co-authored-by: Elastic Machine <[email protected]>
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
* Was using byte position for end of offset, but it seems like using char position is correct

* Update docs/changelog/116358.yaml

* Update UnigramTokenizer.java

---------

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged >bug :ml Machine learning Team:ML Meta label for the ML team v8.16.2 v8.17.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants