Skip to content

Commit a7c4cef

Browse files
Update to tokenizers 0.19 (#57)
1 parent 0c8f4b7 commit a7c4cef

File tree

10 files changed

+130
-80
lines changed

10 files changed

+130
-80
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1414
- Support for regular expressions to split pre-tokenizer. See
1515
`Tokenizers.PreTokenizer.split_regex/3`.
1616

17+
### Removed
18+
19+
- **(Breaking)** `:add_prefix_space` option in favour of `:prepend_scheme` for metaspace
20+
decoder and pre-tokenizer
21+
1722
## [v0.4.0] - 2023-08-09
1823

1924
### Added

lib/tokenizers/decoder.ex

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,11 @@ defmodule Tokenizers.Decoder do
7474
* `replacement` - the replacement character. Defaults to `▁`
7575
(as char)
7676
77-
* `add_prefix_space` - whether to add a space to the first word.
78-
Defaults to `true`
77+
* `:prepend_scheme` - whether to add a space to the first word if there
78+
isn't already one. This lets us treat "hello" exactly like "say hello".
79+
Either of `:always`, `:never`, `:first`. `:first` means the space is
80+
only added on the first token (relevant when special tokens are used
81+
or other pre_tokenizer are used). Defaults to `:always`
7982
8083
"""
8184
@spec metaspace(keyword()) :: t()

lib/tokenizers/pre_tokenizer.ex

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -103,9 +103,11 @@ defmodule Tokenizers.PreTokenizer do
103103
104104
* `:replacement` - the replacement character to use. Defaults to `"▁"`
105105
106-
* `:add_prefix_space` - whether to add a space to the first word
107-
if there isn’t already one. This lets us treat hello exactly
108-
like say hello. Defaults to `true`
106+
* `:prepend_scheme` - whether to add a space to the first word if there
107+
isn't already one. This lets us treat "hello" exactly like "say hello".
108+
Either of `:always`, `:never`, `:first`. `:first` means the space is
109+
only added on the first token (relevant when special tokens are used
110+
or other pre_tokenizer are used). Defaults to `:always`
109111
110112
"""
111113
@spec metaspace(keyword()) :: t()

native/ex_tokenizers/Cargo.lock

Lines changed: 57 additions & 59 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

native/ex_tokenizers/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,5 @@ crate-type = ["cdylib"]
1313
anyhow = "1"
1414
rustler = "0.29.1"
1515
thiserror = "1"
16-
tokenizers = { version = "0.15.0", default-features = false, features = ["onig", "esaxx_fast"]}
16+
tokenizers = { version = "0.19.1", default-features = false, features = ["onig", "esaxx_fast"]}
1717
serde = { version = "1.0", features = [ "rc", "derive" ] }

0 commit comments

Comments
 (0)