Support token-based chunking in chunk_by_title using tiktoken

Currently, `chunk_by_title` in `unstructured.chunking.title` uses `max_characters` for chunking, but I need it to support token-based chunking (e.g., 512 tokens per chunk) using `tiktoken`. 

The current implementation isn't ideal for token-based models like OpenAI GPT, which rely on token limits rather than character limits. How can we modify this to chunk by tokens instead of characters? 

Any guidance or suggestions to solve this would be appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support token-based chunking in chunk_by_title using tiktoken #4127

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support token-based chunking in chunk_by_title using tiktoken #4127

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions