Skip to content

Conversation

@myungjin
Copy link
Contributor

Description

the map function for generating language dataset takes batch_size as an argument, whose default value is 1000. This creates an issue during tokenization. Tokenizer adds pad tokens incorrectly. We set the batch size correctly here.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

the map function for generating language dataset takes batch_size as an
argument, whose default value is 1000. This creates an issue during
tokenization. Tokenizer adds pad tokens incorrectly. We set the batch
size correctly here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant