Skip to content
Discussion options

You must be logged in to vote

If you think the tokenizer is hanging due to this change we'd be interested in a bug report. (Maybe there's a bad regex combination that leads to this? In the past we had problems with the URL regex appearing to hang on extremely long tokens because there was a slow lookbehind, let's see, this was first mentioned in #4362. My first attempt at fixing this was reverted for being too breaking and then we introduced url_match instead in #5121.)

I don't think that we will want to provide an option for this in the default tokenizer (although I can discuss it with the team), but you can use a custom tokenizer with the exact v3.1 behavior if you'd like. The main hassle in this is that the tokeniz…

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@phil-scholarcy
Comment options

@phil-scholarcy
Comment options

@adrianeboyd
Comment options

Answer selected by polm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants