You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
- The RE2 library does not support arbitrary lookahead or lookbehind assertions, nor does it support backreferences. Look at the `docs <https://swtch.com/~rsc/regexp/regexp3.html#caveats>`_ here for more info.
932
-
- The final tokenization step always uses spaces as seperators. To split strings based on a specific regex pattern, similar to Python's `re.split <https://docs.python.org/3/library/re.html#re.split>`_, a tuple of ``('<regex_pattern>', ' ')`` can be provided.
932
+
- The final tokenization step always uses spaces as separators. To split strings based on a specific regex pattern, similar to Python's `re.split <https://docs.python.org/3/library/re.html#re.split>`_, a tuple of ``('<regex_pattern>', ' ')`` can be provided.
933
933
934
934
Example
935
935
Regex tokenization based on ``(patterns, replacements)`` list.
@@ -998,7 +998,7 @@ def bytes_to_unicode():
998
998
The reversible bpe codes work on unicode strings.
999
999
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
1000
1000
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
1001
-
This is a signficant percentage of your normal, say, 32K bpe vocab.
1001
+
This is a significant percentage of your normal, say, 32K bpe vocab.
1002
1002
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
1003
1003
And avoids mapping to whitespace/control characters the bpe code barfs on.
0 commit comments