You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -260,6 +260,33 @@ The Semantic Chunking Web UI allows you to experiment with the chunking paramete
260
260
261
261
There is an additional function you can import to just "cram" sentences together till they meet your target token size for when you just need quick, high desity chunks.
262
262
263
+
264
+
## Parameters
265
+
266
+
`cramit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter:
267
+
268
+
-`documents`: array of documents. each document is an object containing `document_name` and `document_text`.
-`returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
286
+
-`returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
287
+
-`chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.
288
+
-`excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.
289
+
263
290
Basic usage:
264
291
265
292
```javascript
@@ -285,37 +312,62 @@ main();
285
312
286
313
Look at the `example\example-cramit.js` file in the root of this project for a more complex example of using all the optional parameters.
287
314
288
-
### Tuning
315
+
---
289
316
290
-
The behavior of the `chunkit` function can be finely tuned using several optional parameters in the options object. Understanding how each parameter affects the function can help you optimize the chunking process for your specific requirements.
317
+
## `sentenceit` - ✂️ When you just need a Clean Split
291
318
292
-
#### `logging`
319
+
There is an additional function you can import to just split sentences.
293
320
294
-
-**Type**: Boolean
295
-
-**Default**: `false`
296
-
-**Description**: Enables detailed debug output during the chunking process. Turning this on can help in diagnosing how chunks are formed or why certain chunks are combined.
297
321
298
-
#### `maxTokenSize`
322
+
##Parameters
299
323
300
-
-**Type**: Integer
301
-
-**Default**: `500`
302
-
-**Description**: Sets the maximum number of tokens allowed in a single chunk. Smaller values result in smaller, more numerous chunks, while larger values can create fewer, larger chunks. It’s crucial for maintaining manageable chunk sizes when processing large texts.
324
+
`sentenceit` accepts an array of document objects and an optional configuration object. Here are the details for each parameter:
325
+
326
+
-`documents`: array of documents. each document is an object containing `document_name` and `document_text`.
-`returnEmbedding`: Boolean (optional, default `false`) - If set to `true`, each chunk will include an embedding vector. This is useful for applications that require semantic understanding of the chunks. The embedding model will be the same as the one specified in `onnxEmbeddingModel`.
343
+
-`returnTokenLength`: Boolean (optional, default `false`) - If set to `true`, each chunk will include the token length. This can be useful for understanding the size of each chunk in terms of tokens, which is important for token-based processing limits. The token length is calculated using the tokenizer specified in `onnxEmbeddingModel`.
344
+
-`chunkPrefix`: String (optional, default `null`) - A prefix to add to each chunk (e.g., "search_document: "). This is particularly useful when using embedding models that are trained with specific task prefixes, like the nomic-embed-text-v1.5 model. The prefix is added before calculating embeddings or token lengths.
345
+
-`excludeChunkPrefixInResults`: Boolean (optional, default `false`) - If set to `true`, the chunk prefix will be removed from the results. This is useful when you want to remove the prefix from the results while still maintaining the prefix for embedding calculations.
-**Description**: Specifies the model used to generate sentence embeddings. Different models may yield different qualities of embeddings, affecting the chunking quality, especially in multilingual contexts.
Link to a filtered list of embedding models converted to ONNX library format by Xenova.
311
-
Refer to the Model table below for a list of suggested models and their sizes (choose a multilingual model if you need to chunk text other than English).
347
+
Basic usage:
312
348
313
-
#### `dtype`
349
+
```javascript
350
+
import { sentenceit } from'semantic-chunking';
314
351
315
-
-**Type**: String
316
-
-**Default**: `fp32`
317
-
-**Description**: Indicates the precision of the embedding model. Options are `fp32`, `fp16`, `q8`, `q4`.
318
-
`fp32` is the highest precision but also the largest size and slowest to load. `q8` is a good compromise between size and speed if the model supports it. All models support `fp32`, but only some support `fp16`, `q8`, and `q4`.
352
+
let duckText ="A duck waddles into a bakery and quacks to the baker, \"I'll have a loaf of bread, please.\" The baker, amused, quickly wraps the loaf and hands it over. The duck takes a nibble, looks around, and then asks, \"Do you have any seeds to go with this?\" The baker, chuckling, replies, \"Sorry, we're all out of seeds today.\" The duck nods and continues nibbling on its bread, clearly unfazed by the lack of seed toppings. Just another day in the life of a bread-loving waterfowl! 🦆🍞";
353
+
354
+
// initialize documents array and add the duck text to it
355
+
let documents = [];
356
+
documents.push({
357
+
document_name:"duck document",
358
+
document_text: duckText
359
+
});
360
+
361
+
// call the sentenceit function passing in the documents array and the options object
362
+
asyncfunctionmain() {
363
+
let myDuckChunks =awaitsentenceit(documents, { returnEmbedding:true });
364
+
console.log("myDuckChunks", myDuckChunks);
365
+
}
366
+
main();
367
+
368
+
```
369
+
370
+
Look at the `example\example-sentenceit.js` file in the root of this project for a more complex example of using all the optional parameters.
0 commit comments