Skip to content

2024-03-06: Token-based text splitting for data ingestion

Compare
Choose a tag to compare
@pamelafox pamelafox released this 06 Mar 19:03
· 378 commits to main since this release
e191f74

The highlight of this release is a new token-based text splitter, used by the prepdocs script when splitting content into chunks for the search index. The previous algorithm was based solely on character count, which meant that our prepdocs script did not work well for non-English documents or any documents which resulted in a higher than usual amount of tokens. If you do experience any regression in splitting quality as a result of this change, please file an issue.

What's Changed

New Contributors

Full Changelog: 2024-03-01...2024-03-06