Skip to content

Commit a4dbc27

Browse files
bwade@scottlogic.comSiteleaf
authored andcommitted
Updated Bigger Recolour Bert Cli and 5 other files
1 parent 693644b commit a4dbc27

File tree

6 files changed

+287
-9
lines changed

6 files changed

+287
-9
lines changed

_drafts/token-triage.markdown

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
---
2-
title: Token Triage
3-
date: 2025-12-15 11:02:00 Z
2+
title: Token Prism
3+
date: 2025-12-15 15:11:00 Z
44
categories:
55
- Artificial Intelligence
6-
summary: Making a simple CLI tool to visualise tokeniser output.
6+
summary: Visualising the hidden building blocks of LLM text
77
author: jstrong
88
---
99

10-
Recently, I have been working on an agentic AI system. Tool calls and their results abound and the tokens mount up quite quickly. I had a need to see where all the tokens were coming from, what they consisted of, and if they were all necessary. In particular, I wanted to visualise the token output of [OpenAI](https://openai.com/) models. OpenAI already provides a [tokeniser website](https://platform.openai.com/tokenizer) for its models but given the sensitivity of the data I am working with, using this with any more than toy data would be inappropriate. Consequently, I set out to make my own, more secure, offline solution.
10+
# Token Prism:
11+
12+
When working with agentic AI, tool calls and their results abound and the tokens mount up quite quickly. Resultantly, I wanted to visualise the token output of [OpenAI](https://openai.com/) models. OpenAI already provides a [tokeniser website](https://platform.openai.com/tokenizer) for its models but due to data sensitivity, using this with any more than toy data would be inappropriate. Consequently, I set out to make my own, more secure, offline solution.
1113

1214
## Background
1315

@@ -23,7 +25,7 @@ Whilst tokens may be more efficient, they are not without their drawbacks. An of
2325
2426
The eagle-eyed among you will notice the answer is 3. However, even the most advanced LLMs of the day regularly claim otherwise - and this is almost all down to tokenisation. For example, GPT-4 does not see 'strawberry' as 'S-T-R-A-W-B-E-R-R-Y,' but instead 'STR-AW-BERRY.'
2527

26-
![Strawberry tokenisation with images courtesy of Nano Banana Pro.](/uploads/tokenisation_diagram.png)
28+
![Strawberry tokenisation with images courtesy of Nano Banana Pro.](/uploads/no_robot_tokenisation_diagram.png)
2729

2830
It cannot 'see' the letters individually, so it is difficult for it to count them correctly.
2931

@@ -39,7 +41,7 @@ In particular, I wanted a CLI-based solution for ease of use that would support
3941

4042
## How
4143

42-
[`tiktoken`](https://github.com/openai/tiktoken) is the Python package which allows access to tokenisers for OpenAI models such as GPT-4o, GPT-5 etc., so I decided to go ahead with Python as the language for my application.
44+
[`tiktoken`](https://github.com/openai/tiktoken) is the Python SDK which allows access to tokenisers for OpenAI models such as GPT-4o, GPT-5 etc., however it is purely a library for encoding and decoding programmatically - its output is not readily human-readable. Therefore, I decided to use Python as the language for my application, wrapping the backend logic of `tiktoken` with a visual interface better suited for analysis by a person.
4345

4446
### Separating
4547

@@ -77,19 +79,19 @@ For the CLI, I went with [`click`](https://github.com/pallets/click) to define t
7779

7880
With the tokens separated and decoded, I applied a colour cycle to the output. The resulting CLI looks like this:
7981

80-
![Piping file to CLI.](/uploads/cli_file_pipe.png)
82+
![Piping file to CLI.](/uploads/bigger_cli_file_pipe_recolour.png)
8183

8284
This works as I had envisioned, so now it is time to move on to the reactive aspect. I decided to go with the [`textual`](https://github.com/Textualize/textual) TUI package to facilitate this. The API was straightfoward and easy to use and now when you pass `-i` or `--interactive` then you see:
8385

84-
![TUI shown via the interactive flag.](/uploads/tui_video.gif)
86+
![TUI shown via the interactive flag.](/uploads/tui_moving_recolour.svg)
8587

8688
At present, only 3 'statistics' are displayed, but I have plans to add more which would aid in tokenised input analysis.
8789

8890
## Extension
8991

9092
With these features, the application had reached MVP status. However, I saw an avenue for improving upon its capabilities: supporting any tokeniser available from [HuggingFace](https://huggingface.co/). The change to allow this was small, given the API for the [`tokenizers`](https://github.com/huggingface/tokenizers) library is relatively similar to that of `tiktoken`. This change expanded the horizons of the application massively and allowed for seeing how thousands of open-source models approach tokenisation, which is often very different to OpenAI:
9193

92-
![CLI with the Google model 'bert-base-cased', sourced from HuggingFace.](/uploads/bert_cli.png)
94+
![CLI with the Google model 'bert-base-cased', sourced from HuggingFace.](/uploads/bigger_recolour_bert_cli.png)
9395

9496
## Conclusion
9597

35 KB
Loading
12.6 KB
Loading
103 KB
Loading
23.6 KB
Loading

_uploads/tui_moving_recolour.svg

Lines changed: 276 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)