Skip to content

Commit ea2bca3

Browse files
authored
New nvtext::wordpiece_tokenizer APIs (rapidsai#17600)
Creates a new word-piece-tokenizer which replaces the existing subword-tokenizer in nvtext. The subword-tokenizer logic is to split out and specialized to perform basic tokenizing with the word-piece logic only. The normalizing part is already a separate API. The output will be a lists column of tokens only. The first change is that the new API uses `wordpiece` instead of `subword`. Here are the 2 C++ API declarations: ``` std::unique_ptr<wordpiece_vocabulary> load_wordpiece_vocabulary( cudf::strings_column_view const& input, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr); ``` The vocabulary is loaded as a strings column and the returned object can be used on multiple calls to the next API: ``` std::unique_ptr<cudf::column> wordpiece_tokenize( cudf::strings_column_view const& input, wordpiece_vocabulary const& vocabulary, cudf::size_type max_words_per_row, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr); ``` This will return a lists column of integers which represent the tokens for each row. The `max_words_per_row` will stop the tokenizing process for each row once the number of input words (characters delimited by space) has been reached. This means you may get more tokens than `max_words_per_row` for a row if a single word produces multiple tokens. Note, that this API expects the input string to already be normalized -- processed by the `nvtext::normalize_characters` API which is also being reworked in rapidsai#17818 The Python interface has the following pattern: ``` from cudf.core.wordpiece_tokenize import WordPieceVocabulary input_string = .... # output of the normalizer vocab_file = os.path.join(datadir, "bert_base_cased_sampled/vocab.txt") vc = cudf.read_text(vocab_file, delimiter="\n", strip_delimiters=True) wpt = WordPieceVocabulary(vc) wpr = wpt.tokenize(input_string) ``` The output is a lists column of the tokens and no longer the tensor-data and metadata format. If this format is needed, then we can consider a 3rd API that converts the output here to that format. Closes rapidsai#17507 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Shruti Shivakumar (https://github.com/shrshi) - Basit Ayantunde (https://github.com/lamarrr) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: rapidsai#17600
1 parent 32bdfb0 commit ea2bca3

File tree

16 files changed

+1900
-10
lines changed

16 files changed

+1900
-10
lines changed

cpp/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -757,6 +757,7 @@ add_library(
757757
src/text/subword/wordpiece_tokenizer.cu
758758
src/text/tokenize.cu
759759
src/text/vocabulary_tokenize.cu
760+
src/text/wordpiece_tokenize.cu
760761
src/transform/bools_to_mask.cu
761762
src/transform/compute_column.cu
762763
src/transform/encode.cu

cpp/benchmarks/text/subword.cpp

Lines changed: 39 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/*
2-
* Copyright (c) 2020-2024, NVIDIA CORPORATION.
2+
* Copyright (c) 2020-2025, NVIDIA CORPORATION.
33
*
44
* Licensed under the Apache License, Version 2.0 (the "License");
55
* you may not use this file except in compliance with the License.
@@ -20,6 +20,7 @@
2020
#include <cudf/strings/strings_column_view.hpp>
2121

2222
#include <nvtext/subword_tokenize.hpp>
23+
#include <nvtext/wordpiece_tokenize.hpp>
2324

2425
#include <nvbench/nvbench.cuh>
2526

@@ -57,7 +58,10 @@ static void bench_subword_tokenizer(nvbench::state& state)
5758
{
5859
auto const num_rows = static_cast<cudf::size_type>(state.get_int64("num_rows"));
5960

60-
std::vector<char const*> h_strings(num_rows, "This is a test ");
61+
std::vector<char const*> h_strings(
62+
num_rows,
63+
"This is a test This is a test This is a test This is a test This is a test This is a test "
64+
"This is a test This is a test ");
6165
cudf::test::strings_column_wrapper strings(h_strings.begin(), h_strings.end());
6266
static std::string hash_file = create_hash_vocab_file();
6367
std::vector<uint32_t> offsets{14};
@@ -83,3 +87,36 @@ static void bench_subword_tokenizer(nvbench::state& state)
8387
NVBENCH_BENCH(bench_subword_tokenizer)
8488
.set_name("subword_tokenize")
8589
.add_int64_axis("num_rows", {32768, 262144, 2097152});
90+
91+
static void bench_wordpiece_tokenizer(nvbench::state& state)
92+
{
93+
auto const num_rows = static_cast<cudf::size_type>(state.get_int64("num_rows"));
94+
auto const max_words = static_cast<cudf::size_type>(state.get_int64("max_words"));
95+
96+
auto const h_strings = std::vector<char const*>(
97+
num_rows,
98+
"This is a test This is a test This is a test This is a test This is a test This is a test "
99+
"This is a test This is a test ");
100+
auto const num_words = 32; // "This is a test" * 8
101+
auto const d_strings = cudf::test::strings_column_wrapper(h_strings.begin(), h_strings.end());
102+
auto const input = cudf::strings_column_view{d_strings};
103+
104+
auto const vocabulary =
105+
cudf::test::strings_column_wrapper({"", "[UNK]", "This", "is", "a", "test"});
106+
auto const vocab = nvtext::load_wordpiece_vocabulary(cudf::strings_column_view(vocabulary));
107+
108+
state.set_cuda_stream(nvbench::make_cuda_stream_view(cudf::get_default_stream().value()));
109+
auto chars_size = input.chars_size(cudf::get_default_stream());
110+
state.add_global_memory_reads<nvbench::int8_t>(chars_size);
111+
auto out_size = num_rows * (max_words > 0 ? std::min(max_words, num_words) : num_words);
112+
state.add_global_memory_writes<nvbench::int32_t>(out_size);
113+
114+
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
115+
auto result = nvtext::wordpiece_tokenize(input, *vocab, max_words);
116+
});
117+
}
118+
119+
NVBENCH_BENCH(bench_wordpiece_tokenizer)
120+
.set_name("wordpiece_tokenize")
121+
.add_int64_axis("num_rows", {32768, 262144, 2097152})
122+
.add_int64_axis("max_words", {0, 20, 40});
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
/*
2+
* Copyright (c) 2024-2025, NVIDIA CORPORATION.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
#pragma once
17+
18+
#include <cudf/column/column.hpp>
19+
#include <cudf/scalar/scalar.hpp>
20+
#include <cudf/strings/strings_column_view.hpp>
21+
#include <cudf/utilities/export.hpp>
22+
#include <cudf/utilities/memory_resource.hpp>
23+
24+
namespace CUDF_EXPORT nvtext {
25+
/**
26+
* @addtogroup nvtext_tokenize
27+
* @{
28+
* @file
29+
*/
30+
31+
/**
32+
* @brief Vocabulary object to be used with nvtext::wordpiece_tokenizer
33+
*
34+
* Use nvtext::load_wordpiece_vocabulary to create this object.
35+
*/
36+
struct wordpiece_vocabulary {
37+
/**
38+
* @brief Vocabulary object constructor
39+
*
40+
* Token ids are the row indices within the vocabulary column.
41+
* Each vocabulary entry is expected to be unique otherwise the behavior is undefined.
42+
*
43+
* @throw std::invalid_argument if `vocabulary` contains nulls or is empty
44+
*
45+
* @param input Strings for the vocabulary
46+
* @param stream CUDA stream used for device memory operations and kernel launches
47+
* @param mr Device memory resource used to allocate the returned column's device memory
48+
*/
49+
wordpiece_vocabulary(cudf::strings_column_view const& input,
50+
rmm::cuda_stream_view stream = cudf::get_default_stream(),
51+
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
52+
~wordpiece_vocabulary();
53+
54+
struct wordpiece_vocabulary_impl;
55+
std::unique_ptr<wordpiece_vocabulary_impl> _impl;
56+
};
57+
58+
/**
59+
* @brief Create a tokenize_vocabulary object from a strings column
60+
*
61+
* Token ids are the row indices within the vocabulary column.
62+
* Each vocabulary entry is expected to be unique otherwise the behavior is undefined.
63+
*
64+
* @throw std::invalid_argument if `vocabulary` contains nulls or is empty
65+
*
66+
* @param input Strings for the vocabulary
67+
* @param stream CUDA stream used for device memory operations and kernel launches
68+
* @param mr Device memory resource used to allocate the returned column's device memory
69+
* @return Object to be used with nvtext::tokenize_with_vocabulary
70+
*/
71+
std::unique_ptr<wordpiece_vocabulary> load_wordpiece_vocabulary(
72+
cudf::strings_column_view const& input,
73+
rmm::cuda_stream_view stream = cudf::get_default_stream(),
74+
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
75+
76+
/**
77+
* @brief Returns the token ids for the input string a wordpiece tokenizer
78+
* algorithm with the given vocabulary
79+
*
80+
* Example:
81+
* @code{.pseudo}
82+
* vocabulary = ["[UNK]", "a", "have", "I", "new", "GP", "##U", "!"]
83+
* v = load_wordpiece_vocabulary(vocabulary)
84+
* input = ["I have a new GPU now !"]
85+
* t = wordpiece_tokenize(i,v)
86+
* t is now [[3, 2, 1, 4, 5, 6, 0, 7]]
87+
* @endcode
88+
*
89+
* The `max_words_per_row` also optionally limits the output by only processing
90+
* a maximum number of words per row. Here a word is defined as consecutive
91+
* sequence of characters delimited by space character(s).
92+
*
93+
* Example:
94+
* @code{.pseudo}
95+
* vocabulary = ["[UNK]", "a", "have", "I", "new", "GP", "##U", "!"]
96+
* v = load_wordpiece_vocabulary(vocabulary)
97+
* input = ["I have a new GPU now !"]
98+
* t4 = wordpiece_tokenize(i,v,4)
99+
* t4 is now [[3, 2, 1, 4]]
100+
* t5 = wordpiece_tokenize(i,v,5)
101+
* t5 is now [[3, 2, 1, 4, 5, 6]]
102+
* @endcode
103+
*
104+
* Any null row entry results in a corresponding null entry in the output.
105+
*
106+
* @param input Strings column to tokenize
107+
* @param vocabulary Used to lookup tokens within `input`
108+
* @param max_words_per_row Maximum number of words to tokenize for each row.
109+
* Default 0 tokenizes all words.
110+
* @param stream CUDA stream used for device memory operations and kernel launches
111+
* @param mr Device memory resource used to allocate the returned column's device memory
112+
* @return Lists column of token ids
113+
*/
114+
std::unique_ptr<cudf::column> wordpiece_tokenize(
115+
cudf::strings_column_view const& input,
116+
wordpiece_vocabulary const& vocabulary,
117+
cudf::size_type max_words_per_row = 0,
118+
rmm::cuda_stream_view stream = cudf::get_default_stream(),
119+
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
120+
121+
/** @} */ // end of tokenize group
122+
} // namespace CUDF_EXPORT nvtext

0 commit comments

Comments
 (0)