Skip to content

Commit c371b11

Browse files
Manul from Pathwayembe-pwjanchorowskiXGendredxtrous
committed
Release 0.7.7
Co-authored-by: Michał Bartoszkiewicz <embe@pathway.com> Co-authored-by: Jan Chorowski <janek@pathway.com> Co-authored-by: Xavier Gendre <xavier@pathway.com> Co-authored-by: Adrian Kosowski <adrian@pathway.com> Co-authored-by: Jakub Kowalski <kuba@pathway.com> Co-authored-by: Sergey Kulik <sergey@pathway.com> Co-authored-by: Mateusz Lewandowski <mateusz@pathway.com> Co-authored-by: Mohamed Malhou <mohamed@pathway.com> Co-authored-by: Krzysztof Nowicki <krzysiek@pathway.com> Co-authored-by: Richard Pelgrim <richard.pelgrim@pathway.com> Co-authored-by: Kamil Piechowiak <kamil@pathway.com> Co-authored-by: Paweł Podhajski <pawel.podhajski@pathway.com> Co-authored-by: Olivier Ruas <olivier@pathway.com> Co-authored-by: Przemysław Uznański <przemek@pathway.com> Co-authored-by: Sebastian Włudzik <sebastian.wludzik@pathway.com> GitOrigin-RevId: 312344420a55f049c50addb049b77b403a5ce194
1 parent 06c1ad0 commit c371b11

File tree

4 files changed

+84
-2
lines changed

4 files changed

+84
-2
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,11 @@ All notable changes to this project will be documented in this file.
55
This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66
## [Unreleased]
77

8+
## [0.7.7] - 2023-12-27
9+
10+
### Added
11+
- pathway.xpacks.llm.splitter.TokenCountSplitter.
12+
813
## [0.7.6] - 2023-12-22
914

1015
## New Features

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "pathway"
3-
version = "0.7.6"
3+
version = "0.7.7"
44
edition = "2021"
55
publish = false
66
rust-version = "1.72.0"

python/pathway/xpacks/llm/splitter.py

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
A library of text spliiters - routines which slit a long text into smaller chunks.
33
"""
44

5+
import unicodedata
56
from typing import Dict, List, Tuple
67

78

@@ -17,3 +18,79 @@ def null_splitter(txt: str) -> List[Tuple[str, Dict]]:
1718
The null splitter always return a list of length one containing the full text and empty metadata.
1819
"""
1920
return [(txt, {})]
21+
22+
23+
def _normalize_unicode(text: str):
24+
"""
25+
Get rid of ligatures
26+
"""
27+
return unicodedata.normalize("NFKC", text)
28+
29+
30+
class TokenCountSplitter:
31+
"""
32+
Splits a given string or a list of strings into chunks based on token
33+
count.
34+
35+
This splitter tokenizes the input texts and splits them into smaller parts ("chunks")
36+
ensuring that each chunk has a token count between `min_tokens` and
37+
`max_tokens`. It also attempts to break chunks at sensible points such as
38+
punctuation marks.
39+
40+
Arguments:
41+
min_tokens: minimum tokens in a chunk of text.
42+
max_tokens: maximum size of a chunk in tokens.
43+
encoding_name: name of the encoding from `tiktoken`.
44+
45+
Example:
46+
47+
# >>> from pathway.xpacks.llm.splitter import TokenCountSplitter
48+
# >>> import pathway as pw
49+
# >>> t = pw.debug.table_from_markdown(
50+
# ... '''| text
51+
# ... 1| cooltext'''
52+
# ... )
53+
# >>> splitter = TokenCountSplitter(min_tokens=1, max_tokens=1)
54+
# >>> t += t.select(chunks = pw.apply(splitter, pw.this.text))
55+
# >>> pw.debug.compute_and_print(t, include_id=False)
56+
# text | chunks
57+
# cooltext | (('cool', pw.Json({})), ('text', pw.Json({})))
58+
"""
59+
60+
CHARS_PER_TOKEN = 3
61+
PUNCTUATION = [".", "?", "!", "\n"]
62+
63+
def __init__(
64+
self,
65+
min_tokens: int = 50,
66+
max_tokens: int = 500,
67+
encoding_name: str = "cl100k_base",
68+
):
69+
self.min_tokens = min_tokens
70+
self.max_tokens = max_tokens
71+
self.encoding_name = encoding_name
72+
73+
def __call__(self, txt: str) -> List[Tuple[str, Dict]]:
74+
import tiktoken
75+
76+
tokenizer = tiktoken.get_encoding(self.encoding_name)
77+
text = _normalize_unicode(txt)
78+
tokens = tokenizer.encode_ordinary(text)
79+
output: List[Tuple[str, Dict]] = []
80+
i = 0
81+
while i < len(tokens):
82+
chunk_tokens = tokens[i : i + self.max_tokens]
83+
chunk = tokenizer.decode(chunk_tokens)
84+
last_punctuation = max(
85+
[chunk.rfind(p) for p in self.PUNCTUATION], default=-1
86+
)
87+
if (
88+
last_punctuation != -1
89+
and last_punctuation > self.CHARS_PER_TOKEN * self.min_tokens
90+
):
91+
chunk = chunk[: last_punctuation + 1]
92+
93+
i += len(tokenizer.encode_ordinary(chunk))
94+
95+
output.append((chunk, {}))
96+
return output

0 commit comments

Comments
 (0)