-
Notifications
You must be signed in to change notification settings - Fork 228
Add UL2 data sampling and pretraining #358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
janEbert
wants to merge
122
commits into
bigscience-workshop:main
Choose a base branch
from
janEbert:ul2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 13 commits
Commits
Show all changes
122 commits
Select commit
Hold shift + click to select a range
b2fc665
Fix `PretrainedFromHF` tokenizer with T5 training
janEbert 13becf1
Allow passing existing casual attention masks
janEbert 7f50532
Refactor masked LM sampling style selection
janEbert d8db189
Add more masked LM sampling styles
janEbert 006c4e9
Allow Prefix-LM style masked LM
janEbert f802317
Add UL2 pretraining for T5 model
janEbert deed87f
Refactor span merging
janEbert 728e076
Support UL2 for decoder-only models
janEbert 42ece6b
Unconditionally use safe maximum sequence length
janEbert d18f84e
Add custom exceptions
janEbert fa5aa68
Error out on too long sequences
janEbert c7d8a8b
Remove additional sequence truncation
janEbert c722516
Prefer array-from-list creation
janEbert 69f6e70
Remove redundant imports
janEbert f08a104
Fix not inserting prefixes
janEbert d2fd03e
Do not insert `extra_id` tokens for PrefixLM task
janEbert daf52cc
Document `max_seq_length_dec` argument
janEbert 04be590
Skip redundant computations
janEbert 7bc5a87
Fix PrefixLM mean location
janEbert 775e99d
Pad decoder-only inputs to same length
janEbert 538c30b
Fix decoder-only attention mask shape
janEbert ba4476c
Document index set selection for PrefixLM masking
janEbert 678fbdc
Fix `max_ngrams` for normal sampling style
janEbert 00479e5
Do not limit `max_predictions_per_seq`
janEbert 795caef
Calculate and use amount of filtered tokens
janEbert 689e15f
Document normal sampling style
janEbert e44d0e4
Fix PrefixLM possible spans calculation
janEbert 075f05f
Use binary search for PrefixLM first tail index
janEbert 6bc7471
Calculate n-gram indices lazily
janEbert a105f32
Fix code style
janEbert f0fe282
Prefer list comprehensions
janEbert 11bd6db
Allow recognizing when UL2 is used
janEbert 43eee93
Support UL2 tokens for all tokenizers
janEbert 6686f04
Support `<extra_id>` tokens for GPT tokenizer
janEbert f6128c6
Fix tokenizer vocab access
janEbert 8f48763
Revert inheriting from `T5Dataset`
janEbert 7f99a12
Fix GPT tokenizer special token handling
janEbert 535a306
Do inherit from `torch.utils.data.Dataset`
janEbert db623b3
Add whitespace
janEbert ef72280
Allow selectively disabling denoiser token
janEbert 001b50c
Allow not replacing masks with sentinel tokens
janEbert 23c052f
Support not adding mask tokens in span corruption
janEbert 0f4fd3f
Fix expected number of added tokens
janEbert da1f4e9
Fix non-masked data
janEbert 55320ea
Fix unclear wording
janEbert 5d27b27
Adjust code style
janEbert 23181ab
Fix covered index skipping
janEbert 6032cc6
Prepend objective token before truncating
janEbert c9c336f
Automatically truncate sequences for decoder-only
janEbert b8003cb
Fix covered span skipping fix
janEbert e3d91a6
Make `build_index_mappings` public
janEbert e61e78f
Refactor getting sample
janEbert c3b0a55
Add sample packing to T5 dataset
janEbert c4d748b
Add sample packing to UL2 dataset
janEbert 689b57e
Fix typo and comment placement
janEbert af204e7
Fix not supplying `--pack-samples` argument
janEbert 78eb035
Add support for UL2R-style implementation
janEbert c03eed4
Fix T5 dataset packing
janEbert 9e84f06
Refactor `get_sample` to return a list
janEbert 5e2b4f5
Fix T5 sample packing
janEbert e2a0c36
Fix UL2 sample packing
janEbert c2884c8
Refactor samples dict creation
janEbert 7eb7923
Fix desired seq length
janEbert dd4c0d0
Fix padding removal
janEbert 58148f8
Allow repeating UL2 prompt token when packing
janEbert c41fecd
Allow packing different denoisers together
janEbert 057bb47
Refactor sample packing functions
janEbert e2062b7
Repeat prompt by default when packing UL2
janEbert d31b89f
Support pipelining for decoder-only model
janEbert 17dca4f
Fix GPT tokenizer vocab size query
janEbert bf9b1eb
Handle possibly empty list
janEbert c4aa4cd
Fix no newline at EOF
janEbert 8d7a0df
Allow full prefix Prefix-LM attention sampling
janEbert 9bd6e1e
Support PrefixLM models
janEbert ba4ab49
Allow setting number of few-shot examples
janEbert 9f53171
Update task/dataset name
janEbert 5b63d0b
Do not remove last token
janEbert 639b71d
Fix PrefixLM contexts
janEbert 127d1e4
Fix module refactor
janEbert 1bb788d
Fix possible `TypeError`
janEbert cf5965a
Optionally add prefix tokens
janEbert a538238
Automatically add UL2 tokens
janEbert 3a8bc35
Fix context lengths batch chunking
janEbert 6f0e33a
Allow different models to be loaded
janEbert 9c4c718
Fix context batch size padding
janEbert 754cf21
Add xPos embeddings
janEbert 08b0eaf
Add optional UL2 normal distribution scaling
janEbert 15622d2
Allow evaluating encoder-decoder models
janEbert e5a6169
Fix not passing `scale_normal_std`
janEbert d583fe9
Add T5-style GLU layers
janEbert ad7de7e
Rename xPos embedding class
janEbert 81a68f7
Integrate xPos embedding
janEbert 46e145d
Handle xPos embedding
janEbert 482f0ea
Do not use bias for 2nd MLP layer if using T5 GLU
janEbert 4385f7b
Fix T5 GLU constructor arguments
janEbert 2d24b13
Refactor samples dict creation
janEbert bd461f5
Move callees under caller
janEbert 35b2956
Handle empty context
janEbert f0171e0
Handle more possible model types
janEbert 92158d8
Fix fully truncated contexts with prefix tokens
janEbert 3b7692f
Make T5 GLU checks safer
janEbert b37d3ee
Improve import code style
janEbert 5959e89
Refactor dummy barriers
janEbert ce8c1a5
Refactor file name creation
janEbert 3e52966
Allow packing only full documents
janEbert 23efa88
Use full-doc packing for T5-style datasets
janEbert 88eb98a
Fix trying to all-reduce non-existent bias
janEbert 59e8451
Fix truncating packed sequences without padding
janEbert 24d46ff
Speed up packed dataset indexing
janEbert 600542d
Try to exit padding removal early
janEbert 58831d2
Fix xPos embedding
janEbert fe45cea
Fix padding loss mask
janEbert 15e7b98
Handle failure mode regarding non-DS checkpoints
janEbert ae45a9e
Fix decoder-only and no-mask-tokens seq lengths
janEbert 0c91b96
Omit second objective token if without mask tokens
janEbert 0c246c4
Fix NumPy deprecations
janEbert 7ce8635
Fix supplied arguments
janEbert 7290181
Do not add separator if S-denoising
janEbert 628d847
Fix caching error
janEbert 9c727e7
Fix number of labels calculation for decoder-only
janEbert 4ffa951
Do not automatically add <EOS> token when packing
janEbert ff5787e
Allow silently ignoring causal attention mask
janEbert File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normal_meanis not used it seemsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used here. :)
d8db189#diff-e1d14be32f4489a01cb8d571804fbba003f7f90715ef3cb3a27d9099e0245d6fR298