Commit 44920d7
Generate Charsmaps During Build (#597)
* Generate Charsmaps During Build
Use ICU package during build to generate CharsMaps with Sentencepiece builder. Don't include the ICU package in the Tokenizers distribution.
* Update Tests
Delete tests for word level tokenizer - model is no longer on hub.
* Keep ICU dll during build
* Update Linux Build
* Add C++17 Requirement
* Add generated charsmaps to repo
* Remove ICU Debug Build
* Link sentencepiece-train only for charsmaps regeneration
* Remove redundant charsmaps
* Reuse one charsmap function for all node types
* Update Charsmaps
* Remove reinterpret_cast
* Update Benchmark to Report Warmup Metrics
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>1 parent 5753b07 commit 44920d7
File tree
10 files changed
+115334
-401
lines changed- benchmark
- cmake
- external
- modules
- src
10 files changed
+115334
-401
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
148 | 148 | | |
149 | 149 | | |
150 | 150 | | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
151 | 166 | | |
152 | 167 | | |
153 | 168 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| |||
93 | 94 | | |
94 | 95 | | |
95 | 96 | | |
| 97 | + | |
96 | 98 | | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
97 | 104 | | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
98 | 108 | | |
99 | 109 | | |
100 | 110 | | |
| |||
0 commit comments