You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix issue #939 - tokens batch_type may exceed max_batch_size (#1948)
* 1. Fix the batching logic to include padding tokens in batch size increment in BatchReader.get_next method. The rebatch_input will always pass batch_increment_is_fixed=true. Since rebatch_input sorts the input by length in descending order, the first example in every batch will be longest, so batch increment will be fixed with the longest example in batch length. This solves the issue #939. But since the batch_increment_is_fixed=false by default, it won't affect the prefetching logic as mentioned in the revert of the previous PR addressing this issue here: #1314.
2. Fix the same issue in the CTranslate2/python/ctranslate2/extensions.py module in the _batch_iterator method.
3. Add tests for both changes
* add a comment
* 1. Improve the get_next method implementation by allowing it to work with unsorted examples input.
2. Fix memory over-allocation in case batch_type=tokens, if we reserve max_batch_size memory for the batch vector, we probably overallocate memory, thus shrink_to_fit is needed before we return the batch.
* 1. Rename batch_size_increment_is_fixed var to consider_padding
2. Update documentation
0 commit comments