Skip to content

speedyk-005/chunklet-py

Repository files navigation

🧩 Chunklet: Multi_strategy, Context-aware, Multilingual Text Chunker


Chunklet Logo

---

PyPI - Python Version PyPI Stability License: MIT Tests

Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
Author: speedyk_005
Version: 1.4.0 License: MIT

For a detailed history of changes, please see the Changelog.

🤔 Why Chunklet?

Feature Why it’s elite
⛓️ Hybrid Mode Combines token + sentence limits with guaranteed overlap — rare even in commercial stacks.
🌐 Multilingual Fallbacks Pysbd > SentenceSplitter > Regex, with dynamic confidence detection.
Clause-Level Overlap `overlap_percent operates at the clause level, preserving semantic flow across chunks using logic.
Parallel Batch Processing Efficient parallel processing with ThreadPoolExecutor, optimized for low overhead on small batches.
♻️ LRU Caching Smart memoization via functools.lru_cache.
🪄 Pluggable Token Counters Swap in GPT-2, BPE, or your own tokenizer.
✂️ Pluggable Sentence splitters Integrate custom splitters for more specific languages.

🧩 Chunking Modes

Pick your flavor:

  • "sentence" — chunk by sentence count only # the minimum max_sentences is 1.
  • "token" — chunk by token count only # The minimum max_tokens is 10
  • "hybrid" — sentence + token thresholds respected with guaranteed overlap. Internally, the system estimates a residual capacity of 0-2 typical clauses per sentence to manage chunk boundaries effectively.

🌍 Language Support (36+)

  • Primary (Pysbd): Supports a wide range of languages for highly accurate sentence boundary detection. (e.g., ar, pl, ja, da, zh, hy, my, ur, fr, it, fa, bg, el, mr, ru, nl, es, am, kk, en, hi, de) For more information: Pypi
  • Secondary (sentence_splitter): Provides support for additional languages not covered by Pysbd. For more information: Github (e.g., pt, no, cs, sk, lv, ro, ca, sl, sv, fi, lt, tr, hu, is)
  • Fallback (Smart Regex): For any language not explicitly supported by the above, a smart regex-based splitter is used as a reliable fallback.

📦 Installation

Install chunklet-py easily from PyPI:

pip install chunklet-py

To install from source for development:

git clone https://github.com/Speedyk-005/chunklet-py.git
cd chunklet
pip install -e .

Getting Started

See the Getting Started guide to get started with Chunklet.

For the full documentation, visit our documentation site.

Benchmarks

See BENCHMARKS.md for detailed performance benchmarks, and the benchmark script for the code used to generate them.

Internal Workflow

See the Internal Workflow for a high-level overview of Chunklet's internal processing flow.

Configuration Models

For detailed definitions, refer to the Models documentation.

Troubleshooting & Reference

  • Exceptions and Warnings: Understand the various errors and warnings Chunklet might throw your way, and how to deal with them.

Changelog

See the Changelog for a history of changes.


🧪 Planned Features

  • CLI interface with --file, --mode, --overlap-percent, etc.
  • Documents chunking with metadata.
  • Code chunking based on interest point.

💡Projects that inspired me

Tool Description
Semchunk Semantic-aware chunking using transformer embeddings.
CintraAI Code Chunker AST-based code chunker for intelligent code splitting.
semantic-chunker A strongly-typed semantic text chunking library that intelligently splits content while preserving structure and meaning.

🤝 Contributing

Want to help make Chunklet even better?

  1. Fork this repo
  2. Create a new feature branch
  3. Code like a star
  4. Submit a pull request

Check out the issues or open a PR!

See our Contributing Guidelines for details.


🙌 Contributors & Thanks

Big thanks to the people who helped shape Chunklet:

  • @jmbernabotto — for spotting lots of bugs 🐞 and even convincing me to add some cool features 🚀

📜 License

See the LICENSE file for full details.

MIT License. Use freely, modify boldly, and credit the legend (me. Just kidding!)

About

Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages