You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _drafts/token-triage.markdown
+11-9Lines changed: 11 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,15 @@
1
1
---
2
-
title: Token Triage
3
-
date: 2025-12-15 11:02:00 Z
2
+
title: Token Prism
3
+
date: 2025-12-15 15:11:00 Z
4
4
categories:
5
5
- Artificial Intelligence
6
-
summary: Making a simple CLI tool to visualise tokeniser output.
6
+
summary: Visualising the hidden building blocks of LLM text
7
7
author: jstrong
8
8
---
9
9
10
-
Recently, I have been working on an agentic AI system. Tool calls and their results abound and the tokens mount up quite quickly. I had a need to see where all the tokens were coming from, what they consisted of, and if they were all necessary. In particular, I wanted to visualise the token output of [OpenAI](https://openai.com/) models. OpenAI already provides a [tokeniser website](https://platform.openai.com/tokenizer) for its models but given the sensitivity of the data I am working with, using this with any more than toy data would be inappropriate. Consequently, I set out to make my own, more secure, offline solution.
10
+
# Token Prism:
11
+
12
+
When working with agentic AI, tool calls and their results abound and the tokens mount up quite quickly. Resultantly, I wanted to visualise the token output of [OpenAI](https://openai.com/) models. OpenAI already provides a [tokeniser website](https://platform.openai.com/tokenizer) for its models but due to data sensitivity, using this with any more than toy data would be inappropriate. Consequently, I set out to make my own, more secure, offline solution.
11
13
12
14
## Background
13
15
@@ -23,7 +25,7 @@ Whilst tokens may be more efficient, they are not without their drawbacks. An of
23
25
24
26
The eagle-eyed among you will notice the answer is 3. However, even the most advanced LLMs of the day regularly claim otherwise - and this is almost all down to tokenisation. For example, GPT-4 does not see 'strawberry' as 'S-T-R-A-W-B-E-R-R-Y,' but instead 'STR-AW-BERRY.'
25
27
26
-

28
+

27
29
28
30
It cannot 'see' the letters individually, so it is difficult for it to count them correctly.
29
31
@@ -39,7 +41,7 @@ In particular, I wanted a CLI-based solution for ease of use that would support
39
41
40
42
## How
41
43
42
-
[`tiktoken`](https://github.com/openai/tiktoken) is the Python package which allows access to tokenisers for OpenAI models such as GPT-4o, GPT-5 etc., so I decided to go ahead with Python as the language for my application.
44
+
[`tiktoken`](https://github.com/openai/tiktoken) is the Python SDK which allows access to tokenisers for OpenAI models such as GPT-4o, GPT-5 etc., however it is purely a library for encoding and decoding programmatically - its output is not readily human-readable. Therefore, I decided to use Python as the language for my application, wrapping the backend logic of `tiktoken` with a visual interface better suited for analysis by a person.
43
45
44
46
### Separating
45
47
@@ -77,19 +79,19 @@ For the CLI, I went with [`click`](https://github.com/pallets/click) to define t
77
79
78
80
With the tokens separated and decoded, I applied a colour cycle to the output. The resulting CLI looks like this:
79
81
80
-

82
+

81
83
82
84
This works as I had envisioned, so now it is time to move on to the reactive aspect. I decided to go with the [`textual`](https://github.com/Textualize/textual) TUI package to facilitate this. The API was straightfoward and easy to use and now when you pass `-i` or `--interactive` then you see:
83
85
84
-

86
+

85
87
86
88
At present, only 3 'statistics' are displayed, but I have plans to add more which would aid in tokenised input analysis.
87
89
88
90
## Extension
89
91
90
92
With these features, the application had reached MVP status. However, I saw an avenue for improving upon its capabilities: supporting any tokeniser available from [HuggingFace](https://huggingface.co/). The change to allow this was small, given the API for the [`tokenizers`](https://github.com/huggingface/tokenizers) library is relatively similar to that of `tiktoken`. This change expanded the horizons of the application massively and allowed for seeing how thousands of open-source models approach tokenisation, which is often very different to OpenAI:
91
93
92
-

94
+

0 commit comments