2024-03-06: Token-based text splitting for data ingestion
The highlight of this release is a new token-based text splitter, used by the prepdocs script when splitting content into chunks for the search index. The previous algorithm was based solely on character count, which meant that our prepdocs script did not work well for non-English documents or any documents which resulted in a higher than usual amount of tokens. If you do experience any regression in splitting quality as a result of this change, please file an issue.
What's Changed
- Improve text splitter for non-English documents by @tonybaloney in #1326
- Restrict GitHub workflows run by @john0isaac in #1366
- Improvements to load balancer setup script by @pamelafox in #1348
- Update productionizing.md with link to search service size guide by @pamelafox in #1354
- Update README.md to delete old links by @pamelafox in #1372
- Update deploy_features.md link by @pamelafox in #1373
- Add suggestion to use [azd auth login] in the free low-cost deploy tutorial by @elbruno in #1214
- Bump the python-requirements group with 18 updates by @dependabot in #1368
New Contributors
Full Changelog: 2024-03-01...2024-03-06