You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
server : Add option to return token pieces in /tokenize endpoint (ggml-org#9108)
* server : added with_pieces functionality to /tokenize endpoint
* server : Add tokenize with pieces tests to server.feature
* Handle case if tokenizer splits along utf8 continuation bytes
* Add example of token splitting
* Remove trailing ws
* Fix trailing ws
* Maybe fix ci
* maybe this fix windows ci?
---------
Co-authored-by: Xuan Son Nguyen <[email protected]>
Copy file name to clipboardExpand all lines: examples/server/README.md
+37-2Lines changed: 37 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -407,9 +407,44 @@ Notice that each `probs` is an array of length `n_probs`.
407
407
408
408
*Options:*
409
409
410
-
`content`: Set the text to tokenize.
410
+
`content`: (Required) The text to tokenize.
411
411
412
-
`add_special`: Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
412
+
`add_special`: (Optional) Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
413
+
414
+
`with_pieces`: (Optional) Boolean indicating whether to return token pieces along with IDs. Default: `false`
415
+
416
+
**Response:**
417
+
418
+
Returns a JSON object with a `tokens` field containing the tokenization result. The `tokens` array contains either just token IDs or objects with `id` and `piece` fields, depending on the `with_pieces` parameter. The piece field is a string if the piece is valid unicode or a list of bytes otherwise.
419
+
420
+
421
+
If `with_pieces` is `false`:
422
+
```json
423
+
{
424
+
"tokens": [123, 456, 789]
425
+
}
426
+
```
427
+
428
+
If `with_pieces` is `true`:
429
+
```json
430
+
{
431
+
"tokens": [
432
+
{"id": 123, "piece": "Hello"},
433
+
{"id": 456, "piece": " world"},
434
+
{"id": 789, "piece": "!"}
435
+
]
436
+
}
437
+
```
438
+
439
+
With input 'á' (utf8 hex: C3 A1) on tinyllama/stories260k
0 commit comments