Skip to content

Commit fbfd55d

Browse files
jackspirouclaude
andcommitted
fix: correct .gitignore to only exclude root tokenizer binary
- Change tokenizer to /tokenizer in .gitignore - This prevents cmd/tokenizer directory from being ignored - Add previously ignored cmd/tokenizer files to git - Fixes GoReleaser "couldn't find main file" error 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 34cc921 commit fbfd55d

File tree

7 files changed

+267
-2
lines changed

7 files changed

+267
-2
lines changed

.github/workflows/release.yaml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,22 @@ jobs:
3636
with:
3737
go-version: '1.24'
3838

39+
- name: Debug - List directory structure
40+
run: |
41+
echo "Current directory: $(pwd)"
42+
echo "Directory contents:"
43+
ls -la
44+
echo "cmd directory:"
45+
ls -la cmd/
46+
echo "cmd/tokenizer directory:"
47+
ls -la cmd/tokenizer/
48+
3949
- name: Run GoReleaser
4050
uses: goreleaser/goreleaser-action@v6
4151
with:
4252
distribution: goreleaser
4353
version: '~> v2'
44-
args: release --clean
54+
args: release --clean --verbose
4555
env:
4656
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
4757

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ build/
143143
dist/
144144
bin/
145145
cmd/example/example
146-
tokenizer
146+
/tokenizer
147147

148148
# GoReleaser
149149
dist/

cmd/tokenizer/README.md

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
# Tokenizer CLI
2+
3+
A command-line interface for tokenizing text using various language model tokenizers.
4+
5+
## Installation
6+
7+
```bash
8+
go install github.com/agentstation/tokenizer/cmd/tokenizer@latest
9+
```
10+
11+
Or build from source:
12+
13+
```bash
14+
go build -o tokenizer ./cmd/tokenizer
15+
```
16+
17+
## Usage
18+
19+
The tokenizer CLI uses a subcommand structure where each tokenizer implementation is a subcommand.
20+
21+
### Basic Commands
22+
23+
```bash
24+
# Encode text to token IDs
25+
tokenizer llama3 encode "Hello, world!"
26+
# Output: 128000 9906 11 1917 0 128001
27+
28+
# Decode token IDs back to text
29+
tokenizer llama3 decode 128000 9906 11 1917 0 128001
30+
# Output: <|begin_of_text|>Hello, world!<|end_of_text|>
31+
32+
# Get tokenizer information
33+
tokenizer llama3 info
34+
```
35+
36+
### Encoding Options
37+
38+
```bash
39+
# Encode without special tokens
40+
tokenizer llama3 encode --bos=false --eos=false "Hello, world!"
41+
# Output: 9906 11 1917 0
42+
43+
# Different output formats
44+
tokenizer llama3 encode -o json "Hello, world!"
45+
# Output: [128000,9906,11,1917,0,128001]
46+
47+
tokenizer llama3 encode -o newline "Hello, world!"
48+
# Output: (one token per line)
49+
# 128000
50+
# 9906
51+
# 11
52+
# 1917
53+
# 0
54+
# 128001
55+
```
56+
57+
### Piping and Streaming
58+
59+
```bash
60+
# Pipe text to encode
61+
echo "Hello, world!" | tokenizer llama3 encode
62+
63+
# Pipe tokens to decode
64+
echo "128000 9906 11 1917 0 128001" | tokenizer llama3 decode
65+
66+
# Round-trip encoding and decoding
67+
tokenizer llama3 encode "test" | tokenizer llama3 decode
68+
69+
# Stream large files
70+
cat large_file.txt | tokenizer llama3 stream
71+
```
72+
73+
### Streaming Mode
74+
75+
For processing large files or real-time input:
76+
77+
```bash
78+
# Basic streaming
79+
tokenizer llama3 stream < input.txt
80+
81+
# Custom buffer settings
82+
tokenizer llama3 stream --buffer-size=8192 --max-buffer=2097152 < large_file.txt
83+
84+
# Stream without special tokens
85+
tokenizer llama3 stream --bos=false --eos=false < input.txt
86+
```
87+
88+
## Available Tokenizers
89+
90+
### llama3
91+
92+
Meta's Llama 3 tokenizer with 128,256 tokens (128,000 regular + 256 special tokens).
93+
94+
**Commands:**
95+
- `encode` - Convert text to token IDs
96+
- `decode` - Convert token IDs to text
97+
- `stream` - Process text in streaming mode
98+
- `info` - Display tokenizer information
99+
100+
## Examples
101+
102+
### Tokenize a file
103+
104+
```bash
105+
# Tokenize entire file
106+
tokenizer llama3 encode < document.txt > tokens.txt
107+
108+
# Count tokens in a file
109+
tokenizer llama3 encode < document.txt | wc -w
110+
```
111+
112+
### Batch processing
113+
114+
```bash
115+
# Process multiple files
116+
for file in *.txt; do
117+
echo "Tokenizing $file..."
118+
tokenizer llama3 encode < "$file" > "${file%.txt}.tokens"
119+
done
120+
```
121+
122+
### Integration with other tools
123+
124+
```bash
125+
# Use with jq for JSON processing
126+
tokenizer llama3 encode -o json "Hello" | jq length
127+
128+
# Extract specific tokens
129+
tokenizer llama3 encode "Hello, world!" | awk '{print $2}'
130+
```
131+
132+
## Future Tokenizers
133+
134+
The CLI is designed to support multiple tokenizers. Future additions may include:
135+
- GPT-2/GPT-3 tokenizers
136+
- BERT tokenizer
137+
- SentencePiece tokenizers
138+
- Custom tokenizers
139+
140+
Each tokenizer will follow the same subcommand pattern:
141+
```bash
142+
tokenizer [tokenizer-name] [command] [options]
143+
```
144+
145+
<!-- gomarkdoc:embed:start -->
146+
147+
<!-- Code generated by gomarkdoc. DO NOT EDIT -->
148+
149+
# tokenizer
150+
151+
```go
152+
import "github.com/agentstation/tokenizer/cmd/tokenizer"
153+
```
154+
155+
## Index
156+
157+
158+
159+
Generated by [gomarkdoc](<https://github.com/princjef/gomarkdoc>)
160+
161+
162+
<!-- gomarkdoc:embed:end -->

cmd/tokenizer/generate.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
package main
2+
3+
//go:generate gomarkdoc -o README.md -e . --embed --repository.url https://github.com/agentstation/tokenizer --repository.default-branch master --repository.path /cmd/tokenizer

cmd/tokenizer/main.go

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
package main
2+
3+
import (
4+
"fmt"
5+
"os"
6+
)
7+
8+
var (
9+
// Version information (set by build flags).
10+
version = "dev"
11+
commit = "none"
12+
buildDate = "unknown"
13+
goVersion = "unknown"
14+
)
15+
16+
func main() {
17+
if err := rootCmd.Execute(); err != nil {
18+
fmt.Fprintln(os.Stderr, err)
19+
os.Exit(1)
20+
}
21+
}

cmd/tokenizer/root.go

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
package main
2+
3+
import (
4+
"fmt"
5+
6+
"github.com/spf13/cobra"
7+
8+
llama3cmd "github.com/agentstation/tokenizer/llama3/cmd/llama3"
9+
)
10+
11+
// rootCmd represents the base command when called without any subcommands.
12+
var rootCmd = &cobra.Command{
13+
Use: "tokenizer",
14+
Short: "A multi-model tokenizer CLI tool",
15+
Long: `Tokenizer is a CLI tool for tokenizing text using various language models.
16+
17+
This tool provides a unified interface for working with different tokenizer
18+
implementations. Each tokenizer is available as a subcommand with its own
19+
set of operations.
20+
21+
Currently supported tokenizers:
22+
- llama3: Meta's Llama 3 tokenizer (128k vocabulary, byte-level BPE)
23+
24+
Common operations available for tokenizers:
25+
- encode: Convert text to token IDs
26+
- decode: Convert token IDs back to text
27+
- stream: Process large files in streaming mode
28+
- info: Display tokenizer information`,
29+
Example: ` # Encode text with Llama 3
30+
tokenizer llama3 encode "Hello, world!"
31+
32+
# Decode tokens
33+
tokenizer llama3 decode 1234 5678
34+
35+
# Stream a large file
36+
cat large_file.txt | tokenizer llama3 stream
37+
38+
# Get tokenizer info
39+
tokenizer llama3 info`,
40+
SilenceUsage: true,
41+
}
42+
43+
// versionCmd represents the version command.
44+
var versionCmd = &cobra.Command{
45+
Use: "version",
46+
Short: "Print version information",
47+
Run: func(cmd *cobra.Command, args []string) {
48+
fmt.Printf("tokenizer version %s\n", version)
49+
if commit != "none" {
50+
fmt.Printf(" commit: %s\n", commit)
51+
}
52+
if buildDate != "unknown" {
53+
fmt.Printf(" built: %s\n", buildDate)
54+
}
55+
if goVersion != "unknown" {
56+
fmt.Printf(" go version: %s\n", goVersion)
57+
}
58+
},
59+
}
60+
61+
func init() {
62+
// Register commands
63+
rootCmd.AddCommand(versionCmd)
64+
rootCmd.AddCommand(llama3cmd.Command())
65+
66+
// Future tokenizers can be added here:
67+
// rootCmd.AddCommand(gpt2cmd.Command())
68+
// rootCmd.AddCommand(bertcmd.Command())
69+
}

test-tokenizer

-6.77 MB
Binary file not shown.

0 commit comments

Comments
 (0)