fix: setting batch size correctly for language dataset #314

myungjin · 2025-11-14T01:02:34Z

Description

the map function for generating language dataset takes batch_size as an argument, whose default value is 1000. This creates an issue during tokenization. Tokenizer adds pad tokens incorrectly. We set the batch size correctly here.

Type of Change

Checklist

I have read the contributing guidelines
Existing issues have been referenced (where applicable)
I have verified this change is not present in other open pull requests
Functionality is documented
All code style checks pass
New code contribution is covered by automated tests
All new and existing tests pass

the map function for generating language dataset takes batch_size as an argument, whose default value is 1000. This creates an issue during tokenization. Tokenizer adds pad tokens incorrectly. We set the batch size correctly here.

fix: setting batch size correctly for language dataset

6705aa7

the map function for generating language dataset takes batch_size as an argument, whose default value is 1000. This creates an issue during tokenization. Tokenizer adds pad tokens incorrectly. We set the batch size correctly here.

myungjin requested a review from raresgaia123 November 14, 2025 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: setting batch size correctly for language dataset #314

fix: setting batch size correctly for language dataset #314

myungjin commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: setting batch size correctly for language dataset #314

Are you sure you want to change the base?

fix: setting batch size correctly for language dataset #314

Conversation

myungjin commented Nov 14, 2025

Description

Type of Change

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant