Skip to content
This repository was archived by the owner on Mar 9, 2023. It is now read-only.

Tokenizing Ellipsis creates empty tokens #120

@polm

Description

@polm

While working on the spaCy Japanese model support and integrating Sudachi, ran into the issue that the one-character ellipsis () was causing errors. If you tokenize this ellipsis you get three tokens from SudachiPy, with surfaces like ['', '', '…'].

I assume this is a bug but wasn't able to track down where it's happening. I also checked ㍻, and while that is also normalized internally it seems to be output as a single character without issue.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions