Skip to content

Commit 6ff5929

Browse files
committed
feat: cleanup, and publishing the 0.1.0 version
1 parent 006634d commit 6ff5929

File tree

17 files changed

+65438
-59
lines changed

17 files changed

+65438
-59
lines changed

.github/workflows/release.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
name: release
2+
on:
3+
push:
4+
tags:
5+
- "v*"
6+
permissions:
7+
contents: read
8+
id-token: write
9+
jobs:
10+
publish:
11+
runs-on: ubuntu-latest
12+
environment: release
13+
steps:
14+
- uses: actions/checkout@v4
15+
- uses: astral-sh/setup-uv@v4
16+
- run: uv build --no-sources
17+
- run: uv publish # No token needed!

.gitignore

Lines changed: 33 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Byte-compiled / optimized / DLL files
22
__pycache__/
3-
*.py[codz]
3+
*.py[cod]
44
*$py.class
55

66
# C extensions
@@ -46,7 +46,7 @@ htmlcov/
4646
nosetests.xml
4747
coverage.xml
4848
*.cover
49-
*.py.cover
49+
*.py,cover
5050
.hypothesis/
5151
.pytest_cache/
5252
cover/
@@ -94,35 +94,19 @@ ipython_config.py
9494
# install all needed dependencies.
9595
#Pipfile.lock
9696

97-
# UV
98-
# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99-
# This is especially recommended for binary packages to ensure reproducibility, and is more
100-
# commonly ignored for libraries.
101-
#uv.lock
102-
10397
# poetry
10498
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
10599
# This is especially recommended for binary packages to ensure reproducibility, and is more
106100
# commonly ignored for libraries.
107101
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108102
#poetry.lock
109-
#poetry.toml
110103

111104
# pdm
112105
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
113-
# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
114-
# https://pdm-project.org/en/latest/usage/project/#working-with-version-control
115-
#pdm.lock
116-
#pdm.toml
117-
.pdm-python
118-
.pdm-build/
119-
120-
# pixi
121-
# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
122-
#pixi.lock
123-
# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
124-
# in the .venv directory. It is recommended not to include this directory in version control.
125-
.pixi
106+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
107+
# in version control.
108+
# https://pdm.fming.dev/#use-with-ide
109+
.pdm.toml
126110

127111
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
128112
__pypackages__/
@@ -136,7 +120,6 @@ celerybeat.pid
136120

137121
# Environments
138122
.env
139-
.envrc
140123
.venv
141124
env/
142125
venv/
@@ -170,38 +153,31 @@ cython_debug/
170153

171154
# PyCharm
172155
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
173-
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
174-
# and can be added to the global gitignore or merged into this file. For a more nuclear
175-
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
176-
#.idea/
177-
178-
# Abstra
179-
# Abstra is an AI-powered process automation framework.
180-
# Ignore directories containing user credentials, local state, and settings.
181-
# Learn more at https://abstra.io/docs
182-
.abstra/
183-
184-
# Visual Studio Code
185-
# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore
186-
# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
187-
# and can be added to the global gitignore or merged into this file. However, if you prefer,
188-
# you could uncomment the following to ignore the entire vscode folder
189-
# .vscode/
190-
191-
# Ruff stuff:
192-
.ruff_cache/
193-
194-
# PyPI configuration file
156+
# be added to the global gitignore or merged into this project gitignore. For a PyCharm
157+
# project, it is recommended to include the following files:
158+
# .idea/
159+
# *.iml
160+
# *.ipr
161+
# *.iws
162+
163+
# VS Code
164+
.vscode/
165+
166+
# macOS
167+
.DS_Store
168+
169+
# Windows
170+
Thumbs.db
171+
ehthumbs.db
172+
Desktop.ini
173+
174+
# Project specific
175+
*.model
176+
*.vocab
177+
*.json
178+
!data/*.model
179+
!data/*.vocab
180+
!data/*.json
181+
182+
# PyPI configuration (contains tokens)
195183
.pypirc
196-
197-
# Cursor
198-
# Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
199-
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
200-
# refer to https://docs.cursor.com/context/ignore-files
201-
.cursorignore
202-
.cursorindexingignore
203-
204-
# Marimo
205-
marimo/_static/
206-
marimo/_lsp/
207-
__marimo__/

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.11

CHANGELOG.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# changelog
2+
3+
all notable changes to this project will be documented in this file.
4+
5+
## [unreleased]
6+
7+
### added
8+
- cli interface with rich output formatting
9+
- utility functions for text validation and vocab management
10+
- proper package structure with src layout
11+
- mit license
12+
- comprehensive documentation
13+
14+
### changed
15+
- reorganized code into proper python package structure
16+
- updated build system to use hatchling
17+
- improved error handling and validation
18+
19+
### fixed
20+
- package structure issues preventing proper builds
21+
- model path resolution for bundled data files
22+
23+
## [0.1.0] - 2025-08-22
24+
25+
### added
26+
- initial release of burmese tokenizer
27+
- sentencepiece-based tokenization for burmese text
28+
- pre-trained model for burmese language
29+
- basic encode/decode functionality
30+
- vocab management features
31+
32+
## [0.1.1] - 2025-08-22
33+
34+
### added
35+
- cleaned up

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2024 janakhpon
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 62 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,62 @@
1-
# burmese_tokenizer
2-
A tokenizer trained specifically for Burmese language using Unigram LM.
1+
# Burmese Tokenizer
2+
3+
Tokenize Burmese text like a pro. No fancy stuff, just gets the job done.
4+
5+
## Quick Start
6+
7+
```bash
8+
# Using pip
9+
pip install burmese-tokenizer
10+
11+
# Using uv (faster)
12+
uv add burmese-tokenizer
13+
```
14+
15+
```python
16+
from burmese_tokenizer import BurmeseTokenizer
17+
18+
tokenizer = BurmeseTokenizer()
19+
text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"
20+
21+
# Tokenize
22+
result = tokenizer.encode(text)
23+
print(result["pieces"]) # ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']
24+
25+
# Decode
26+
decoded = tokenizer.decode(result["pieces"])
27+
print(decoded) # မင်္ဂလာပါ။ နေကောင်းပါသလား။
28+
```
29+
30+
## CLI
31+
32+
```bash
33+
# Tokenize
34+
burmese-tokenizer "မင်္ဂလာပါ။"
35+
36+
# Verbose mode (shows all the details)
37+
burmese-tokenizer -v "မင်္ဂလာပါ။"
38+
39+
# Decode tokens back to text
40+
burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"
41+
```
42+
43+
## API
44+
45+
- `encode(text)` - Chop text into tokens
46+
- `decode(pieces)` - Glue tokens back together
47+
- `decode_ids(ids)` - Convert IDs back to text
48+
- `get_vocab_size()` - How many tokens we know
49+
- `get_vocab()` - The whole vocabulary
50+
51+
## Dev Setup
52+
53+
```bash
54+
git clone git@github.com:Code-Yay-Mal/burmese_tokenizer.git
55+
cd burmese_tokenizer
56+
uv sync --dev
57+
uv run pytest
58+
```
59+
60+
## License
61+
62+
MIT - do whatever you want with it.

0 commit comments

Comments
 (0)