Release 2024-03-06: Token-based text splitting for data ingestion · Azure-Samples/azure-search-openai-demo

The highlight of this release is a new token-based text splitter, used by the prepdocs script when splitting content into chunks for the search index. The previous algorithm was based solely on character count, which meant that our prepdocs script did not work well for non-English documents or any documents which resulted in a higher than usual amount of tokens. If you do experience any regression in splitting quality as a result of this change, please file an issue.

What's Changed

Improve text splitter for non-English documents by @tonybaloney in #1326
Restrict GitHub workflows run by @john0isaac in #1366
Improvements to load balancer setup script by @pamelafox in #1348
Update productionizing.md with link to search service size guide by @pamelafox in #1354
Update README.md to delete old links by @pamelafox in #1372
Update deploy_features.md link by @pamelafox in #1373
Add suggestion to use [azd auth login] in the free low-cost deploy tutorial by @elbruno in #1214
Bump the python-requirements group with 18 updates by @dependabot in #1368

New Contributors

@elbruno made their first contribution in #1214

Full Changelog: 2024-03-01...2024-03-06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2024-03-06: Token-based text splitting for data ingestion

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!