Skip to content

Commit 44920d7

Browse files
apaniukovCopilot
andauthored
Generate Charsmaps During Build (#597)
* Generate Charsmaps During Build Use ICU package during build to generate CharsMaps with Sentencepiece builder. Don't include the ICU package in the Tokenizers distribution. * Update Tests Delete tests for word level tokenizer - model is no longer on hub. * Keep ICU dll during build * Update Linux Build * Add C++17 Requirement * Add generated charsmaps to repo * Remove ICU Debug Build * Link sentencepiece-train only for charsmaps regeneration * Remove redundant charsmaps * Reuse one charsmap function for all node types * Update Charsmaps * Remove reinterpret_cast * Update Benchmark to Report Warmup Metrics * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 5753b07 commit 44920d7

File tree

10 files changed

+115334
-401
lines changed

10 files changed

+115334
-401
lines changed

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,21 @@ cmake -DCMAKE_BUILD_TYPE=Release ..
148148
make
149149
```
150150

151+
#### CMake Build Options
152+
153+
| Option | Default | Description |
154+
|--------|---------|-------------|
155+
| `REGENERATE_PRECOMPILED_CHARSMAP` | `OFF` | Regenerate Unicode normalization tables. Requires ICU. |
156+
| `ENABLE_SYSTEM_ICU` | `OFF` | Use system ICU instead of building from source (only when regenerating). |
157+
158+
The `precompiled_charsmap.hpp` header containing Unicode normalization tables is pre-generated and committed to the repository. Most users don't need to regenerate it. To update these tables (e.g., after updating SentencePiece):
159+
160+
```bash
161+
cmake -DCMAKE_BUILD_TYPE=Release -DREGENERATE_PRECOMPILED_CHARSMAP=ON ..
162+
make update_precompiled_charsmap
163+
# Commit the updated src/precompiled_charsmap.hpp
164+
```
165+
151166
After that, you can transfer all binaries from `build/src` to `<openvino_dir>` as described in the C++ installation instruction above.
152167

153168
## Usage

benchmark/benchmark.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
import matplotlib.pyplot as plt
1212
import openvino as ov
1313
import pandas as pd
14+
import psutil
1415
import seaborn as sns
1516
from openvino import AsyncInferQueue, CompiledModel, InferRequest, ProfilingInfo, properties
1617
from openvino_tokenizers import convert_tokenizer
@@ -93,8 +94,17 @@ def benchmark_tokenizers(
9394
results = []
9495

9596
# warmup
97+
process = psutil.Process()
9698
for repeat in range(1, 2):
99+
# print time of the first tokenization pass
100+
print(f"Warmup iteration {repeat}")
101+
102+
mem_before_ov = process.memory_info().rss / 1024 / 1024 # MB
103+
ov_start = perf_counter()
97104
ov_tokenizer(["test " * repeat])
105+
ov_time = perf_counter() - ov_start
106+
mem_after_ov = process.memory_info().rss / 1024 / 1024 # MB
107+
print(f"OV warmup time: {ov_time:.6f} seconds, memory delta: {mem_after_ov - mem_before_ov:.2f} MB")
98108
hf_tokenizer(["test " * repeat])
99109

100110
ov_input_ids = []

0 commit comments

Comments
 (0)