Skip to content

Commit bc1a4a1

Browse files
committed
v2.4.0
1 parent c03b3e2 commit bc1a4a1

File tree

7 files changed

+250
-39
lines changed

7 files changed

+250
-39
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,10 @@
22

33
All notable changes to this project will be documented in this file.
44

5+
## [2.4.0] - 2024-12-13
6+
### ✨ Added
7+
- Added `sentenceit` function.
8+
59
## [2.3.7] - 2024-11-25
610
### 📦 Updated
711
- Update `string-segmenter` patch version

README.md

Lines changed: 75 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ NPM Package for Semantically creating chunks from large texts. Useful for workfl
44

55
### Maintained by
66
<a href="https://www.equilllabs.com">
7-
<img src="https://raw.githubusercontent.com/jparkerweb/eQuill-Labs/refs/heads/main/src/static/images/logo-text-outline.png" alt="eQuill Labs" height="40">
7+
<img src="https://raw.githubusercontent.com/jparkerweb/eQuill-Labs/refs/heads/main/src/static/images/logo-text-outline.png" alt="eQuill Labs" height="32">
88
</a>
99

1010
## Features
@@ -260,6 +260,33 @@ The Semantic Chunking Web UI allows you to experiment with the chunking paramete
260260

261261
There is an additional function you can import to just "cram" sentences together till they meet your target token size for when you just need quick, high desity chunks.
262262

263+
264+
## Parameters
265+
266+
`cramit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter:
267+
268+
- `documents`: array of documents. each document is an object containing `document_name` and `document_text`.
269+
```
270+
documents = [
271+
{ document_name: "document1", document_text: "..." },
272+
{ document_name: "document2", document_text: "..." },
273+
...
274+
]
275+
```
276+
277+
- **Cramit Options Object:**
278+
279+
- `logging`: Boolean (optional, default `false`) - Enables logging of detailed processing steps.
280+
- `maxTokenSize`: Integer (optional, default `500`) - Maximum token size for each chunk.
281+
- `onnxEmbeddingModel`: String (optional, default `Xenova/all-MiniLM-L6-v2`) - ONNX model used for creating embeddings.
282+
- `dtype`: String (optional, default `fp32`) - Precision of the embedding model (options: `fp32`, `fp16`, `q8`, `q4`).
283+
- `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`).
284+
- `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`).
285+
- `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
286+
- `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
287+
- `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.
288+
- `excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.
289+
263290
Basic usage:
264291

265292
```javascript
@@ -285,37 +312,62 @@ main();
285312

286313
Look at the `example\example-cramit.js` file in the root of this project for a more complex example of using all the optional parameters.
287314

288-
### Tuning
315+
---
289316

290-
The behavior of the `chunkit` function can be finely tuned using several optional parameters in the options object. Understanding how each parameter affects the function can help you optimize the chunking process for your specific requirements.
317+
## `sentenceit` - ✂️ When you just need a Clean Split
291318

292-
#### `logging`
319+
There is an additional function you can import to just split sentences.
293320

294-
- **Type**: Boolean
295-
- **Default**: `false`
296-
- **Description**: Enables detailed debug output during the chunking process. Turning this on can help in diagnosing how chunks are formed or why certain chunks are combined.
297321

298-
#### `maxTokenSize`
322+
## Parameters
299323

300-
- **Type**: Integer
301-
- **Default**: `500`
302-
- **Description**: Sets the maximum number of tokens allowed in a single chunk. Smaller values result in smaller, more numerous chunks, while larger values can create fewer, larger chunks. It’s crucial for maintaining manageable chunk sizes when processing large texts.
324+
`sentenceit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter:
325+
326+
- `documents`: array of documents. each document is an object containing `document_name` and `document_text`.
327+
```
328+
documents = [
329+
{ document_name: "document1", document_text: "..." },
330+
{ document_name: "document2", document_text: "..." },
331+
...
332+
]
333+
```
303334

304-
#### `onnxEmbeddingModel`
335+
- **Sentenceit Options Object:**
336+
337+
- `logging`: Boolean (optional, default `false`) - Enables logging of detailed processing steps.
338+
- `onnxEmbeddingModel`: String (optional, default `Xenova/all-MiniLM-L6-v2`) - ONNX model used for creating embeddings.
339+
- `dtype`: String (optional, default `fp32`) - Precision of the embedding model (options: `fp32`, `fp16`, `q8`, `q4`).
340+
- `localModelPath`: String (optional, default `null`) - Local path to save and load models (example: `./models`).
341+
- `modelCacheDir`: String (optional, default `null`) - Directory to cache downloaded models (example: `./models`).
342+
- `returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
343+
- `returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
344+
- `chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.
345+
- `excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.
305346

306-
- **Type**: String
307-
- **Default**: `Xenova/paraphrase-multilingual-MiniLM-L12-v2`
308-
- **Description**: Specifies the model used to generate sentence embeddings. Different models may yield different qualities of embeddings, affecting the chunking quality, especially in multilingual contexts.
309-
- **Resource Link**: [ONNX Embedding Models](https://huggingface.co/models?pipeline_tag=feature-extraction&library=onnx&sort=trending)
310-
Link to a filtered list of embedding models converted to ONNX library format by Xenova.
311-
Refer to the Model table below for a list of suggested models and their sizes (choose a multilingual model if you need to chunk text other than English).
347+
Basic usage:
312348

313-
#### `dtype`
349+
```javascript
350+
import { sentenceit } from 'semantic-chunking';
314351

315-
- **Type**: String
316-
- **Default**: `fp32`
317-
- **Description**: Indicates the precision of the embedding model. Options are `fp32`, `fp16`, `q8`, `q4`.
318-
`fp32` is the highest precision but also the largest size and slowest to load. `q8` is a good compromise between size and speed if the model supports it. All models support `fp32`, but only some support `fp16`, `q8`, and `q4`.
352+
let duckText = "A duck waddles into a bakery and quacks to the baker, \"I'll have a loaf of bread, please.\" The baker, amused, quickly wraps the loaf and hands it over. The duck takes a nibble, looks around, and then asks, \"Do you have any seeds to go with this?\" The baker, chuckling, replies, \"Sorry, we're all out of seeds today.\" The duck nods and continues nibbling on its bread, clearly unfazed by the lack of seed toppings. Just another day in the life of a bread-loving waterfowl! 🦆🍞";
353+
354+
// initialize documents array and add the duck text to it
355+
let documents = [];
356+
documents.push({
357+
document_name: "duck document",
358+
document_text: duckText
359+
});
360+
361+
// call the sentenceit function passing in the documents array and the options object
362+
async function main() {
363+
let myDuckChunks = await sentenceit(documents, { returnEmbedding: true });
364+
console.log("myDuckChunks", myDuckChunks);
365+
}
366+
main();
367+
368+
```
369+
370+
Look at the `example\example-sentenceit.js` file in the root of this project for a more complex example of using all the optional parameters.
319371

320372
---
321373

chunkit.js

Lines changed: 111 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@ export async function chunkit(
3939
combineChunks = DEFAULT_CONFIG.COMBINE_CHUNKS,
4040
combineChunksSimilarityThreshold = DEFAULT_CONFIG.COMBINE_CHUNKS_SIMILARITY_THRESHOLD,
4141
onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL,
42-
onnxEmbeddingModelQuantized, // legacy boolean (remove in next major version)
4342
dtype = DEFAULT_CONFIG.DTYPE,
4443
localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH,
4544
modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR,
@@ -56,9 +55,6 @@ export async function chunkit(
5655
throw new Error('Input must be an array of document objects');
5756
}
5857

59-
// if legacy boolean is used (onnxEmbeddingModelQuantized), set dtype (model precision) to 'q8'
60-
if (onnxEmbeddingModelQuantized === true) { dtype = 'q8'; }
61-
6258
// Initialize embedding utilities and set optional paths
6359
const { modelName, dtype: usedDtype } = await initializeEmbeddingUtils(
6460
onnxEmbeddingModel,
@@ -193,7 +189,6 @@ export async function cramit(
193189
logging = DEFAULT_CONFIG.LOGGING,
194190
maxTokenSize = DEFAULT_CONFIG.MAX_TOKEN_SIZE,
195191
onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL,
196-
onnxEmbeddingModelQuantized, // legacy boolean (remove in next major version)
197192
dtype = DEFAULT_CONFIG.DTYPE,
198193
localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH,
199194
modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR,
@@ -210,11 +205,8 @@ export async function cramit(
210205
throw new Error('Input must be an array of document objects');
211206
}
212207

213-
// if legacy boolean is used (onnxEmbeddingModelQuantized), set dtype (model precision) to 'q8'
214-
if (onnxEmbeddingModelQuantized === true) { dtype = 'q8'; }
215-
216208
// Initialize embedding utilities with paths
217-
const { modelName, isQuantized } = await initializeEmbeddingUtils(
209+
await initializeEmbeddingUtils(
218210
onnxEmbeddingModel,
219211
dtype,
220212
localModelPath,
@@ -259,8 +251,8 @@ export async function cramit(
259251
document_name: documentName,
260252
number_of_chunks: numberOfChunks,
261253
chunk_number: index + 1,
262-
model_name: modelName,
263-
is_model_quantized: isQuantized,
254+
model_name: onnxEmbeddingModel,
255+
dtype: dtype,
264256
text: prefixedChunk
265257
};
266258

@@ -296,3 +288,111 @@ export async function cramit(
296288
// Flatten the results array since we're processing multiple documents
297289
return allResults.flat();
298290
}
291+
292+
293+
// ------------------------------
294+
// -- Main sentenceit function --
295+
// ------------------------------
296+
export async function sentenceit(
297+
documents,
298+
{
299+
logging = DEFAULT_CONFIG.LOGGING,
300+
onnxEmbeddingModel = DEFAULT_CONFIG.ONNX_EMBEDDING_MODEL,
301+
dtype = DEFAULT_CONFIG.DTYPE,
302+
localModelPath = DEFAULT_CONFIG.LOCAL_MODEL_PATH,
303+
modelCacheDir = DEFAULT_CONFIG.MODEL_CACHE_DIR,
304+
returnEmbedding = DEFAULT_CONFIG.RETURN_EMBEDDING,
305+
returnTokenLength = DEFAULT_CONFIG.RETURN_TOKEN_LENGTH,
306+
chunkPrefix = DEFAULT_CONFIG.CHUNK_PREFIX,
307+
excludeChunkPrefixInResults = false,
308+
} = {}) {
309+
310+
if(logging) { printVersion(); }
311+
312+
// Input validation
313+
if (!Array.isArray(documents)) {
314+
throw new Error('Input must be an array of document objects');
315+
}
316+
317+
if (returnEmbedding) {
318+
// Initialize embedding utilities with paths
319+
await initializeEmbeddingUtils(
320+
onnxEmbeddingModel,
321+
dtype,
322+
localModelPath,
323+
modelCacheDir
324+
);
325+
}
326+
327+
// Process each document
328+
const allResults = await Promise.all(documents.map(async (doc) => {
329+
if (!doc.document_text) {
330+
throw new Error('Each document must have a document_text property');
331+
}
332+
333+
// Split the text into sentences
334+
const chunks = [];
335+
for (const { segment } of splitBySentence(doc.document_text)) {
336+
chunks.push(segment.trim());
337+
}
338+
339+
if (logging) {
340+
console.log('\nSENTENCEIT');
341+
console.log('=============\nSentences\n=============');
342+
chunks.forEach((chunk, index) => {
343+
console.log("\n");
344+
console.log(`--------------`);
345+
console.log(`-- Sentence ${(index + 1)} --`);
346+
console.log(`--------------`);
347+
console.log(chunk.substring(0, 50) + '...');
348+
});
349+
}
350+
351+
const documentName = doc.document_name || ""; // Normalize document_name
352+
const documentId = Date.now();
353+
const numberOfChunks = chunks.length;
354+
355+
return Promise.all(chunks.map(async (chunk, index) => {
356+
const prefixedChunk = chunkPrefix ? applyPrefixToChunk(chunkPrefix, chunk) : chunk;
357+
const result = {
358+
document_id: documentId,
359+
document_name: documentName,
360+
number_of_sentences: numberOfChunks,
361+
sentence_number: index + 1,
362+
text: prefixedChunk
363+
};
364+
365+
if (returnEmbedding) {
366+
result.model_name = onnxEmbeddingModel;
367+
result.dtype = dtype;
368+
result.embedding = await createEmbedding(prefixedChunk);
369+
370+
if (returnTokenLength) {
371+
try {
372+
const encoded = await tokenizer(prefixedChunk, { padding: true });
373+
if (encoded && encoded.input_ids) {
374+
result.token_length = encoded.input_ids.size;
375+
} else {
376+
console.error('Tokenizer returned unexpected format:', encoded);
377+
result.token_length = 0;
378+
}
379+
} catch (error) {
380+
console.error('Error during tokenization:', error);
381+
result.token_length = 0;
382+
}
383+
}
384+
385+
// Remove prefix if requested (after embedding calculation)
386+
if (excludeChunkPrefixInResults && chunkPrefix && chunkPrefix.trim()) {
387+
const prefixPattern = new RegExp(`^${chunkPrefix}:\\s*`);
388+
result.text = result.text.replace(prefixPattern, '');
389+
}
390+
}
391+
392+
return result;
393+
}));
394+
}));
395+
396+
// Flatten the results array since we're processing multiple documents
397+
return allResults.flat();
398+
}

example/example-chunkit.js

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ let trackedTimeSeconds = (endTime - startTime) / 1000;
5454
trackedTimeSeconds = parseFloat(trackedTimeSeconds.toFixed(2));
5555

5656
console.log("\n\n");
57-
// console.log("myTestChunks:");
58-
// console.log(myTestChunks);
57+
console.log("myTestChunks:");
58+
console.log(myTestChunks);
5959
console.log("length: " + myTestChunks.length);
6060
console.log("trackedTimeSeconds: " + trackedTimeSeconds);

example/example-sentenceit.js

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
// -----------------------
2+
// -- example-sentenceit.js --
3+
// --------------------------------------------------------------------------------
4+
// this is an example of how to use the sentenceit function
5+
// first we import the sentenceit function
6+
// then we setup the documents array with a text
7+
// then we call the sentenceit function with the text and an options object
8+
// the options object is optional
9+
//
10+
// the cramit function is faster than the chunkit function, but it is less accurate
11+
// useful for quickly chunking text, but not for exact semantic chunking
12+
// --------------------------------------------------------------------------------
13+
14+
import { sentenceit } from '../chunkit.js'; // this is typically just "import { sentenceit } from 'semantic-chunking';", but this is a local test
15+
import fs from 'fs';
16+
17+
// initialize documents array
18+
let documents = [];
19+
let textFiles = ['./example3.txt'];
20+
21+
// read each text file and add it to the documents array
22+
for (const textFile of textFiles) {
23+
documents.push({
24+
document_name: textFile,
25+
document_text: await fs.promises.readFile(textFile, 'utf8')
26+
});
27+
}
28+
29+
// start timing
30+
const startTime = performance.now();
31+
32+
let myTestSentences = await sentenceit(
33+
documents,
34+
{
35+
logging: false,
36+
onnxEmbeddingModel: "Xenova/all-MiniLM-L6-v2",
37+
dtype: 'fp32',
38+
localModelPath: "../models",
39+
modelCacheDir: "../models",
40+
returnEmbedding: true,
41+
}
42+
);
43+
44+
// end timeing
45+
const endTime = performance.now();
46+
47+
// calculate tracked time in seconds
48+
let trackedTimeSeconds = (endTime - startTime) / 1000;
49+
trackedTimeSeconds = parseFloat(trackedTimeSeconds.toFixed(2));
50+
51+
console.log("\n\n\n");
52+
console.log("myTestSentences:");
53+
console.log(myTestSentences);
54+
console.log("length: " + myTestSentences.length);
55+
console.log("trackedTimeSeconds: " + trackedTimeSeconds);

package-lock.json

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "semantic-chunking",
3-
"version": "2.3.9",
3+
"version": "2.4.0",
44
"description": "Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).",
55
"homepage": "https://www.equilllabs.com/projects/semantic-chunking",
66
"repository": {

0 commit comments

Comments
 (0)