Skip to content

Commit 777cf46

Browse files
Update docs (#50)
1 parent 15b9cdf commit 777cf46

21 files changed

+688
-615
lines changed

README.md

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,9 @@
44

55
Elixir bindings for [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers).
66

7-
## Getting started
7+
## Installation
88

9-
In order to use `Tokenizers`, you will need Elixir installed. Then create an Elixir project via the `mix` build tool:
10-
11-
```
12-
$ mix new my_app
13-
```
14-
15-
Then you can add `Tokenizers` as dependency in your `mix.exs`.
9+
You can add `:tokenizers` as dependency in your `mix.exs`:
1610

1711
```elixir
1812
def deps do
@@ -30,26 +24,27 @@ Mix.install([
3024
])
3125
```
3226

33-
## Quick example
27+
## Example
28+
29+
You can use any pre-trained tokenizer from any model repo on Hugging Face Hub, such as [bert-base-cased](https://huggingface.co/bert-base-cased).
3430

3531
```elixir
36-
# Go get a tokenizer -- any from the Hugging Face models repo will do
3732
{:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("bert-base-cased")
3833
{:ok, encoding} = Tokenizers.Tokenizer.encode(tokenizer, "Hello there!")
3934
Tokenizers.Encoding.get_tokens(encoding)
40-
# ["Hello", "there", "!"]
35+
#=> ["Hello", "there", "!"]
4136
Tokenizers.Encoding.get_ids(encoding)
42-
# [8667, 1175, 106]
37+
#=> [8667, 1175, 106]
4338
```
4439

4540
The [notebooks](./notebooks) directory has [an introductory Livebook](./notebooks/pretrained.livemd) to give you a feel for the API.
4641

4742
## Contributing
4843

49-
Tokenizers uses Rust to call functionality from the Hugging Face Tokenizers library. While
50-
Rust is not necessary to use Tokenizers as a package, you need Rust tooling installed on
51-
your machine if you want to compile from source, which is the case when contributing to
52-
Tokenizers. In particular, you will need Rust Stable, which can be installed with
44+
Tokenizers uses Rust to call functionality from the Hugging Face Tokenizers library. While
45+
Rust is not necessary to use Tokenizers as a package, you need Rust tooling installed on
46+
your machine if you want to compile from source, which is the case when contributing to
47+
Tokenizers. In particular, you will need Rust Stable, which can be installed with
5348
[Rustup](https://rust-lang.github.io/rustup/installation/index.html).
5449

5550
## License

lib/tokenizers.ex

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,19 @@ defmodule Tokenizers do
44
55
Hugging Face describes the Tokenizers library as:
66
7-
> Fast State-of-the-art tokenizers, optimized for both research and production
7+
> Fast State-of-the-art tokenizers, optimized for both research and
8+
> production
89
>
9-
> 🤗 Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. These tokenizers are also used in 🤗 Transformers.
10+
> 🤗 Tokenizers provides an implementation of today’s most used
11+
> tokenizers, with a focus on performance and versatility. These
12+
> tokenizers are also used in 🤗 Transformers.
1013
11-
This library has bindings to use pretrained tokenizers. Support for building and training
12-
a tokenizer from scratch is forthcoming.
14+
A tokenizer is effectively a pipeline of transformations that take
15+
a text input and return an encoded version of that text (`t:Tokenizers.Encoding.t/0`).
1316
14-
A tokenizer is effectively a pipeline of transforms to take some input text and return a
15-
`Tokenizers.Encoding.t()`. The main entrypoint to this library is the `Tokenizers.Tokenizer`
16-
module, which holds the `Tokenizers.Tokenizer.t()` struct, a container holding the constituent
17-
parts of the pipeline. Most functionality is there.
17+
The main entrypoint to this library is the `Tokenizers.Tokenizer`
18+
module, which defines the `t:Tokenizers.Tokenizer.t/0` struct, a
19+
container holding the constituent parts of the pipeline. Most
20+
functionality is in that module.
1821
"""
1922
end

lib/tokenizers/added_token.ex

Lines changed: 34 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,51 @@
11
defmodule Tokenizers.AddedToken do
22
@moduledoc """
3-
This struct represents AddedTokens
3+
This struct represents a token added to tokenizer vocabulary.
44
"""
55

66
@type t() :: %__MODULE__{resource: reference()}
77
defstruct [:resource]
88

9-
@typedoc """
10-
Options for added token initialisation. All options can be ommited.
11-
"""
12-
@type opts() :: [
13-
special: boolean(),
14-
single_word: boolean(),
15-
lstrip: boolean(),
16-
rstrip: boolean(),
17-
normalized: boolean()
18-
]
19-
209
@doc """
21-
Create a new AddedToken.
10+
Builds a new added token.
11+
12+
## Options
13+
14+
* `:special` - defines whether this token is a special token.
15+
Defaults to `false`
2216
23-
* `:special` (default `false`) - defines whether this token is a special token.
17+
* `:single_word` - defines whether this token should only match
18+
single words. If `true`, this token will never match inside of a
19+
word. For example the token `ing` would match on `tokenizing` if
20+
this option is `false`. The notion of ”inside of a word” is
21+
defined by the word boundaries pattern in regular expressions
22+
(i.e. the token should start and end with word boundaries).
23+
Defaults to `false`
2424
25-
* `:single_word` (default `false`) - defines whether this token should only match single words.
26-
If `true`, this token will never match inside of a word. For example the token `ing` would
27-
match on `tokenizing` if this option is `false`, but not if it is `true`.
28-
The notion of ”inside of a word” is defined by the word boundaries pattern
29-
in regular expressions (i.e. the token should start and end with word boundaries).
25+
* `:lstrip` - defines whether this token should strip all potential
26+
whitespace on its left side. If `true`, this token will greedily
27+
match any whitespace on its left. For example if we try to match
28+
the token `[MASK]` with `lstrip=true`, in the text `"I saw a [MASK]"`,
29+
we would match on `" [MASK]"`. (Note the space on the left).
30+
Defaults to `false`
3031
31-
* `:lstrip` (default `false`) - defines whether this token should strip all potential
32-
whitespaces on its left side.
33-
If `true`, this token will greedily match any whitespace on its left.
34-
For example if we try to match the token `[MASK]` with `lstrip=true`,
35-
in the text `"I saw a [MASK]"`, we would match on `" [MASK]"`. (Note the space on the left).
32+
* `:rstrip` - defines whether this token should strip all potential
33+
whitespaces on its right side. If `true`, this token will greedily
34+
match any whitespace on its right. It works just like `:lstrip`,
35+
but on the right. Defaults to `false`
3636
37-
* `:rstrip` (default `false`) - defines whether this token should strip all potential
38-
whitespaces on its right side.
39-
If `true`, this token will greedily match any whitespace on its right.
40-
It works just like `lstrip` but on the right.
37+
* `:normalized` - defines whether this token should match against
38+
the normalized version of the input text. For example, with the
39+
added token `"yesterday"`, and a normalizer in charge of
40+
lowercasing the text, the token could be extract from the input
41+
`"I saw a lion Yesterday"`. If `true`, the token will be extracted
42+
from the normalized input `"i saw a lion yesterday"`. If `false`,
43+
the token will be extracted from the original input
44+
`"I saw a lion Yesterday"`. Defaults to `false` for special tokens
45+
and `true` otherwise
4146
42-
* `:normalized` (default `true` for not special tokens, `false` for special tokens) -
43-
defines whether this token should match against the normalized version of the input text.
44-
For example, with the added token `"yesterday"`,
45-
and a normalizer in charge of lowercasing the text,
46-
the token could be extract from the input `"I saw a lion Yesterday"`.
47-
If `true`, the token will be extracted from the normalized input `"i saw a lion yesterday"`.
48-
If `false`, the token will be extracted from the original input `"I saw a lion Yesterday"`.
4947
"""
50-
@spec new(token :: String.t(), opts :: opts()) :: t()
48+
@spec new(token :: String.t(), keyword()) :: t()
5149
defdelegate new(token, opts \\ []), to: Tokenizers.Native, as: :added_token_new
5250

5351
@doc """

lib/tokenizers/decoder.ex

Lines changed: 57 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,16 @@
11
defmodule Tokenizers.Decoder do
22
@moduledoc """
3-
The Decoder knows how to go from the IDs used by the Tokenizer, back to a readable piece of text.
4-
Some Normalizer and PreTokenizer use special characters or identifiers that need to be reverted.
3+
Decoders and decoding functions.
4+
5+
Decoder transforms a sequence of token ids back to a readable piece
6+
of text.
7+
8+
Some normalizers and pre-tokenizers use special characters or
9+
identifiers that need special logic to be reverted.
510
"""
611

712
defstruct [:resource]
13+
814
@type t() :: %__MODULE__{resource: reference()}
915

1016
@doc """
@@ -13,113 +19,104 @@ defmodule Tokenizers.Decoder do
1319
@spec decode(t(), [String.t()]) :: {:ok, String.t()} | {:error, any()}
1420
defdelegate decode(decoder, tokens), to: Tokenizers.Native, as: :decoders_decode
1521

16-
@typedoc """
17-
Options for BPE decoder initialization. All options can be ommited.
22+
@doc """
23+
Creates a BPE decoder.
1824
19-
* `suffix` - The suffix to add to the end of each word, defaults to `</w>`
20-
"""
21-
@type bpe_options :: [suffix: String.t()]
25+
## Options
26+
27+
* `suffix` - the suffix to add to the end of each word. Defaults
28+
to `</w>`
2229
23-
@doc """
24-
Creates new BPE decoder
2530
"""
26-
@spec bpe(bpe_options :: bpe_options()) :: t()
27-
defdelegate bpe(options \\ []), to: Tokenizers.Native, as: :decoders_bpe
31+
@spec bpe(keyword()) :: t()
32+
defdelegate bpe(opts \\ []), to: Tokenizers.Native, as: :decoders_bpe
2833

2934
@doc """
30-
Creates new ByteFallback decoder
35+
Creates a ByteFallback decoder.
3136
"""
3237
@spec byte_fallback() :: t()
3338
defdelegate byte_fallback(), to: Tokenizers.Native, as: :decoders_byte_fallback
3439

3540
@doc """
36-
Creates new ByteLevel decoder
41+
Creates a ByteLevel decoder.
3742
"""
3843
@spec byte_level() :: t()
3944
defdelegate byte_level(), to: Tokenizers.Native, as: :decoders_byte_level
4045

41-
@typedoc """
42-
Options for CTC decoder initialization. All options can be ommited.
46+
@doc """
47+
Creates a CTC decoder.
4348
44-
* `pad_token` - The token used for padding, defaults to `<pad>`
45-
* `word_delimiter_token` - The token used for word delimiter, defaults to `|`
46-
* `cleanup` - Whether to cleanup tokenization artifacts, defaults to `true`
47-
"""
48-
@type ctc_options :: [
49-
pad_token: String.t(),
50-
word_delimiter_token: String.t(),
51-
cleanup: boolean()
52-
]
49+
## Options
50+
51+
* `pad_token` - the token used for padding. Defaults to `<pad>`
52+
53+
* `word_delimiter_token` - the token used for word delimiter.
54+
Defaults to `|`
55+
56+
* `cleanup` - whether to cleanup tokenization artifacts, defaults
57+
to `true`
5358
54-
@doc """
55-
Creates new CTC decoder
5659
"""
57-
@spec ctc(ctc_options :: ctc_options()) :: t()
58-
defdelegate ctc(options \\ []), to: Tokenizers.Native, as: :decoders_ctc
60+
@spec ctc(keyword()) :: t()
61+
defdelegate ctc(opts \\ []), to: Tokenizers.Native, as: :decoders_ctc
5962

6063
@doc """
61-
Creates new Fuse decoder
64+
Creates a Fuse decoder.
6265
"""
6366
@spec fuse :: t()
6467
defdelegate fuse(), to: Tokenizers.Native, as: :decoders_fuse
6568

66-
@typedoc """
67-
Options for Metaspace decoder initialization. All options can be ommited.
69+
@doc """
70+
Creates a Metaspace decoder.
71+
72+
## Options
6873
69-
* `replacement` - The replacement character, defaults to `▁` (as char)
70-
* `add_prefix_space` - Whether to add a space to the first word, defaults to `true`
71-
"""
74+
* `replacement` - the replacement character. Defaults to `▁`
75+
(as char)
7276
73-
@type metaspace_options :: [
74-
replacement: char(),
75-
add_prefix_space: boolean()
76-
]
77+
* `add_prefix_space` - whether to add a space to the first word.
78+
Defaults to `true`
7779
78-
@doc """
79-
Creates new Metaspace decoder
8080
"""
81-
@spec metaspace(metaspace_options :: metaspace_options()) :: t()
82-
defdelegate metaspace(options \\ []),
81+
@spec metaspace(keyword()) :: t()
82+
defdelegate metaspace(opts \\ []),
8383
to: Tokenizers.Native,
8484
as: :decoders_metaspace
8585

8686
@doc """
87-
Creates new Replace decoder
87+
Creates a Replace decoder.
8888
"""
89-
@spec replace(pattern :: String.t(), content :: String.t()) :: t()
89+
@spec replace(String.t(), String.t()) :: t()
9090
defdelegate replace(pattern, content), to: Tokenizers.Native, as: :decoders_replace
9191

9292
@doc """
93-
Creates new Sequence decoder
93+
Combines a list of decoders into a single sequential decoder.
9494
"""
9595
@spec sequence(decoders :: [t()]) :: t()
9696
defdelegate sequence(decoders), to: Tokenizers.Native, as: :decoders_sequence
9797

9898
@doc """
99-
Creates new Strip decoder.
99+
Creates a Strip decoder.
100100
101101
It expects a character and the number of times to strip the
102102
character on `left` and `right` sides.
103103
"""
104-
@spec strip(content :: char(), left :: non_neg_integer(), right :: non_neg_integer()) :: t()
104+
@spec strip(char(), non_neg_integer(), non_neg_integer()) :: t()
105105
defdelegate strip(content, left, right), to: Tokenizers.Native, as: :decoders_strip
106106

107-
@typedoc """
108-
Options for WordPiece decoder initialization. All options can be ommited.
107+
@doc """
108+
Creates a WordPiece decoder.
109109
110-
* `prefix` - The prefix to use for subwords, defaults to `##`
111-
* `cleanup` - Whether to cleanup tokenization artifacts, defaults to `true`
112-
"""
113-
@type word_piece_options :: [
114-
prefix: String.t(),
115-
cleanup: boolean()
116-
]
110+
## Options
111+
112+
* `prefix` - The prefix to use for subwords. Defaults to `##`
113+
114+
* `cleanup` - Whether to cleanup tokenization artifacts. Defaults
115+
to `true`
117116
118-
@doc """
119-
Creates new WordPiece decoder
120117
"""
121-
@spec word_piece(word_piece_options :: word_piece_options()) :: t()
122-
defdelegate word_piece(options \\ []),
118+
@spec word_piece(keyword()) :: t()
119+
defdelegate word_piece(opts \\ []),
123120
to: Tokenizers.Native,
124121
as: :decoders_wordpiece
125122
end

0 commit comments

Comments
 (0)