|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +High-performance .NET port of OpenAI's [tiktoken](https://github.com/openai/tiktoken) tokenizer, optimized for token counting speed. Published as [Tiktoken](https://www.nuget.org/packages/Tiktoken/) on NuGet. |
| 8 | + |
| 9 | +## Build Commands |
| 10 | + |
| 11 | +```bash |
| 12 | +# Build the solution |
| 13 | +dotnet build Tiktoken.sln |
| 14 | + |
| 15 | +# Build for release |
| 16 | +dotnet build Tiktoken.sln -c Release |
| 17 | + |
| 18 | +# Run unit tests |
| 19 | +dotnet test src/tests/Tiktoken.UnitTests/Tiktoken.UnitTests.csproj |
| 20 | + |
| 21 | +# Run all tests |
| 22 | +dotnet test Tiktoken.sln |
| 23 | + |
| 24 | +# Run benchmarks |
| 25 | +dotnet run -c Release --project src/benchmarks/Tiktoken.Benchmarks/Tiktoken.Benchmarks.csproj |
| 26 | +``` |
| 27 | + |
| 28 | +## Architecture |
| 29 | + |
| 30 | +### Project Layout |
| 31 | + |
| 32 | +| Project | Purpose | |
| 33 | +|---------|---------| |
| 34 | +| `src/libs/Tiktoken/` | Main convenience library -- bundles Core + cl100k + o200k encodings | |
| 35 | +| `src/libs/Tiktoken.Core/` | Core tokenizer engine (`Encoder`, `ModelToEncoder`, BPE logic) | |
| 36 | +| `src/libs/Tiktoken.Encodings.Abstractions/` | Base types for encoding definitions | |
| 37 | +| `src/libs/Tiktoken.Encodings.cl100k/` | `cl100k_base` encoding (GPT-3.5/GPT-4) | |
| 38 | +| `src/libs/Tiktoken.Encodings.o200k/` | `o200k_base` encoding (GPT-4o) | |
| 39 | +| `src/libs/Tiktoken.Encodings.p50k/` | `p50k_base` / `p50k_edit` encodings | |
| 40 | +| `src/libs/Tiktoken.Encodings.r50k/` | `r50k_base` encoding | |
| 41 | +| `src/tests/Tiktoken.UnitTests/` | Unit tests (MSTest + FluentAssertions + Verify) | |
| 42 | +| `src/benchmarks/Tiktoken.Benchmarks/` | BenchmarkDotNet performance benchmarks | |
| 43 | +| `benchmarks/` | Historical benchmark result reports (Markdown) | |
| 44 | + |
| 45 | +### Supported Encodings |
| 46 | + |
| 47 | +- `o200k_base` -- GPT-4o models |
| 48 | +- `cl100k_base` -- GPT-3.5-turbo, GPT-4 models |
| 49 | +- `r50k_base` -- older GPT-3 models |
| 50 | +- `p50k_base` / `p50k_edit` -- Codex models |
| 51 | + |
| 52 | +### Key API |
| 53 | + |
| 54 | +```csharp |
| 55 | +var encoder = ModelToEncoder.For("gpt-4o"); |
| 56 | +var tokens = encoder.Encode("hello world"); // [15339, 1917] |
| 57 | +var text = encoder.Decode(tokens); // "hello world" |
| 58 | +var count = encoder.CountTokens(text); // 2 |
| 59 | +var parts = encoder.Explore(text); // ["hello", " world"] |
| 60 | +``` |
| 61 | + |
| 62 | +### Build Configuration |
| 63 | + |
| 64 | +- **Target frameworks:** `net4.6.2`, `netstandard2.0`, `netstandard2.1`, `net8.0`, `net9.0` |
| 65 | +- **Language:** C# with nullable reference types |
| 66 | +- **Unsafe code:** Enabled in Core for performance |
| 67 | +- **Encoding data:** Embedded as `.tiktoken` resources in `Tiktoken.Core/Encodings/` |
| 68 | +- **Versioning:** Semantic versioning from git tags via MinVer |
| 69 | +- **Testing:** MSTest + FluentAssertions + Verify |
| 70 | + |
| 71 | +### CI/CD |
| 72 | + |
| 73 | +- Uses shared workflows from `HavenDV/workflows` repo |
| 74 | +- Dependabot updates NuGet packages |
0 commit comments