Positive Tokenization? #13383

dave-richards · 2024-03-15T22:24:04Z

dave-richards
Mar 15, 2024

I am new to NLU and spacy, but I have been reading he docs and doing some testing. I would like to implement a custom tokenizer for Biblical Greek. My reading of the tokenizer docs is that the customizations are "negative", i.e. a token is not a whitespace character and it's not a prefix and its not a suffix and its not an infix. Everything else is a valid token. I would like to work the other way around. I would like to define exactly what is a token and continues down the pipeline and skip over what is not. Is my understanding correct and is it possible to invert the logic to work as I would like?

svlandeg · 2024-03-19T10:43:12Z

svlandeg
Mar 19, 2024

Hi!

Just to be sure: are you aware that we are supporting "Ancient greek" with the language tag grc? Does this not work sufficiently well for you?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Positive Tokenization? #13383

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Positive Tokenization? #13383

Uh oh!

dave-richards Mar 15, 2024

Replies: 1 comment

Uh oh!

svlandeg Mar 19, 2024

dave-richards
Mar 15, 2024

svlandeg
Mar 19, 2024