Code-Yay-Mal
diff --git a/‎.github/workflows/release.yml‎
Lines changed: 17 additions & 0 deletions b/‎.github/workflows/release.yml‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 33 additions & 57 deletions b/‎.gitignore‎
Lines changed: 33 additions & 57 deletions
diff --git a/‎.python-version‎
Lines changed: 1 addition & 0 deletions b/‎.python-version‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 35 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎LICENSE‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 62 additions & 2 deletions b/‎README.md‎
Lines changed: 62 additions & 2 deletions
@@ -0,0 +1,17 @@
+name: release
+on:
+  push:
+    tags:
+      - "v*"
+permissions:
+  contents: read
+  id-token: write
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    environment: release
+    steps:
+      - uses: actions/checkout@v4
+      - uses: astral-sh/setup-uv@v4
+      - run: uv build --no-sources
+      - run: uv publish  # No token needed!
@@ -1,6 +1,6 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
-*.py[codz]
+*.py[cod]
 *$py.class
 
 # C extensions
@@ -46,7 +46,7 @@ htmlcov/
 nosetests.xml
 coverage.xml
 *.cover
-*.py.cover
+*.py,cover
 .hypothesis/
 .pytest_cache/
 cover/
@@ -94,35 +94,19 @@ ipython_config.py
 #   install all needed dependencies.
 #Pipfile.lock
 
-# UV
-#   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
-#   This is especially recommended for binary packages to ensure reproducibility, and is more
-#   commonly ignored for libraries.
-#uv.lock
-
 # poetry
 #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
 #   This is especially recommended for binary packages to ensure reproducibility, and is more
 #   commonly ignored for libraries.
 #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
 #poetry.lock
-#poetry.toml
 
 # pdm
 #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
-#   pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python.
-#   https://pdm-project.org/en/latest/usage/project/#working-with-version-control
-#pdm.lock
-#pdm.toml
-.pdm-python
-.pdm-build/
-
-# pixi
-#   Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control.
-#pixi.lock
-#   Pixi creates a virtual environment in the .pixi directory, just like venv module creates one
-#   in the .venv directory. It is recommended not to include this directory in version control.
-.pixi
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
 
 # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 __pypackages__/
@@ -136,7 +120,6 @@ celerybeat.pid
 
 # Environments
 .env
-.envrc
 .venv
 env/
 venv/
@@ -170,38 +153,31 @@ cython_debug/
 
 # PyCharm
 #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
-#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
-#  and can be added to the global gitignore or merged into this file.  For a more nuclear
-#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
-
-# Abstra
-# Abstra is an AI-powered process automation framework.
-# Ignore directories containing user credentials, local state, and settings.
-# Learn more at https://abstra.io/docs
-.abstra/
-
-# Visual Studio Code
-#  Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore 
-#  that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore
-#  and can be added to the global gitignore or merged into this file. However, if you prefer, 
-#  you could uncomment the following to ignore the entire vscode folder
-# .vscode/
-
-# Ruff stuff:
-.ruff_cache/
-
-# PyPI configuration file
+#  be added to the global gitignore or merged into this project gitignore.  For a PyCharm
+#  project, it is recommended to include the following files:
+#  .idea/
+#  *.iml
+#  *.ipr
+#  *.iws
+
+# VS Code
+.vscode/
+
+# macOS
+.DS_Store
+
+# Windows
+Thumbs.db
+ehthumbs.db
+Desktop.ini
+
+# Project specific
+*.model
+*.vocab
+*.json
+!data/*.model
+!data/*.vocab
+!data/*.json
+
+# PyPI configuration (contains tokens)
 .pypirc
-
-# Cursor
-#  Cursor is an AI-powered code editor. `.cursorignore` specifies files/directories to
-#  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
-#  refer to https://docs.cursor.com/context/ignore-files
-.cursorignore
-.cursorindexingignore
-
-# Marimo
-marimo/_static/
-marimo/_lsp/
-__marimo__/
@@ -0,0 +1 @@
+3.11
@@ -0,0 +1,35 @@
+# changelog
+
+all notable changes to this project will be documented in this file.
+
+## [unreleased]
+
+### added
+- cli interface with rich output formatting
+- utility functions for text validation and vocab management
+- proper package structure with src layout
+- mit license
+- comprehensive documentation
+
+### changed
+- reorganized code into proper python package structure
+- updated build system to use hatchling
+- improved error handling and validation
+
+### fixed
+- package structure issues preventing proper builds
+- model path resolution for bundled data files
+
+## [0.1.0] - 2025-08-22
+
+### added
+- initial release of burmese tokenizer
+- sentencepiece-based tokenization for burmese text
+- pre-trained model for burmese language
+- basic encode/decode functionality
+- vocab management features
+
+## [0.1.1] - 2025-08-22
+
+### added
+- cleaned up
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 janakhpon
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -1,2 +1,62 @@
-# burmese_tokenizer
-A tokenizer trained specifically for Burmese language using Unigram LM.
+# Burmese Tokenizer
+
+Tokenize Burmese text like a pro. No fancy stuff, just gets the job done.
+
+## Quick Start
+
+```bash
+# Using pip
+pip install burmese-tokenizer
+
+# Using uv (faster)
+uv add burmese-tokenizer
+```
+
+```python
+from burmese_tokenizer import BurmeseTokenizer
+
+tokenizer = BurmeseTokenizer()
+text = "မင်္ဂလာပါ။ နေကောင်းပါသလား။"
+
+# Tokenize
+result = tokenizer.encode(text)
+print(result["pieces"])  # ['▁မင်္ဂလာ', '▁ပါ', '။', '▁နေ', '▁ကောင်း', '▁ပါ', '▁သလား', '။']
+
+# Decode
+decoded = tokenizer.decode(result["pieces"])
+print(decoded)  # မင်္ဂလာပါ။ နေကောင်းပါသလား။
+```
+
+## CLI
+
+```bash
+# Tokenize
+burmese-tokenizer "မင်္ဂလာပါ။"
+
+# Verbose mode (shows all the details)
+burmese-tokenizer -v "မင်္ဂလာပါ။"
+
+# Decode tokens back to text
+burmese-tokenizer -d -t "▁မင်္ဂလာ,▁ပါ,။"
+```
+
+## API
+
+- `encode(text)` - Chop text into tokens
+- `decode(pieces)` - Glue tokens back together
+- `decode_ids(ids)` - Convert IDs back to text
+- `get_vocab_size()` - How many tokens we know
+- `get_vocab()` - The whole vocabulary
+
+## Dev Setup
+
+```bash
+git clone git@github.com:Code-Yay-Mal/burmese_tokenizer.git
+cd burmese_tokenizer
+uv sync --dev
+uv run pytest
+```
+
+## License
+
+MIT - do whatever you want with it.