Skip to content
Merged
Show file tree
Hide file tree
Changes from 177 commits
Commits
Show all changes
180 commits
Select commit Hold shift + click to select a range
d2c35fc
added train script but with prefix manually declared
May 7, 2022
f977b85
made new dataset
May 9, 2022
fcfbf17
minor adjustments
May 9, 2022
870dfd8
added capabilities for padding and prefix lm index
lintangsutawika May 9, 2022
791bbd0
added finetune script
lintangsutawika May 9, 2022
0f44b92
removed script
lintangsutawika May 9, 2022
2ff0815
added adjustments and new dataset
May 9, 2022
f0a79f6
try mlm dataset
May 9, 2022
eb416c7
minor changes
May 9, 2022
c0bc21b
minor addition of import packages
May 9, 2022
82e824c
minor error fix
May 9, 2022
7bb17ec
minor error fix
May 9, 2022
9929766
samples follow how gpt dataset is loaded
May 9, 2022
861c41f
added masked_lm_prob
May 9, 2022
fe95115
fixed tokenizer abstractions for HF tokenizer
May 9, 2022
8ea5943
added mask id
May 9, 2022
aa0d146
added mask id
May 9, 2022
215e8cc
added mask id
May 9, 2022
b6eef43
added mask id
May 9, 2022
bfc73a5
added fix
May 9, 2022
1890f87
added bos and eos token id
May 9, 2022
01392a9
no need for sentinal token
May 9, 2022
923decb
add aux functions
May 9, 2022
4611d67
add aux functions
May 9, 2022
4356de3
add aux functions
May 9, 2022
f31c686
add pad_id
May 9, 2022
a3951e8
changed lm predictions to t5
May 18, 2022
97b9a92
changed lm predictions to t5
May 18, 2022
fe73a73
changed lm predictions to t5
May 18, 2022
6a9cb75
changed lm predictions to t5
May 18, 2022
469848f
changed lm predictions to t5
May 18, 2022
e68283f
tokenizer add mask, cls, sep tokens
May 18, 2022
476ae94
commit latest changes
May 21, 2022
72ff575
commit latest changes
May 21, 2022
3647291
added sentinal tokens
May 21, 2022
fcdc987
added sentinal tokens
May 21, 2022
d6fbe78
added sentinal tokens
May 21, 2022
c44daba
added additional_special_tokens
May 21, 2022
a2725d8
added additional_special_tokens
May 21, 2022
0e94245
check t5_input and output
May 21, 2022
b599ab6
check decoder in and decoder out
May 21, 2022
626b0ae
made into input and output tokens
May 22, 2022
6008937
made into input and output tokens
May 22, 2022
c1524db
made into input and output tokens
May 22, 2022
c59c061
made into input and output tokens
May 22, 2022
e677e16
made into input and output tokens
May 22, 2022
9ffaeb9
made into input and output tokens
May 22, 2022
d0a6a2f
made into input and output tokens
May 22, 2022
47fd987
made into input and output tokens
May 23, 2022
4f377e8
made into input and output tokens
May 23, 2022
5c0bf76
added eos
May 23, 2022
7c63e4b
added eos
May 23, 2022
871124c
test text_token
May 24, 2022
55a593d
test text_token
May 24, 2022
adb59ca
test text_token
May 24, 2022
d71afb4
test text_token
May 24, 2022
7b99bb7
test text_token
May 24, 2022
922b09d
assigned array
May 24, 2022
469a02d
assigned array
May 24, 2022
15cb6a0
assigned array
May 24, 2022
5b0bc17
hardcoded sequence length
May 24, 2022
0671c79
check again
May 28, 2022
6db5c9b
show sentinal tokens
lintangsutawika May 28, 2022
8a58007
show sentinal tokens
lintangsutawika May 28, 2022
8b0bbc2
show sentinal tokens
lintangsutawika May 28, 2022
3d1b256
show sentinal tokens
lintangsutawika May 28, 2022
ce00fd9
add more special tokens
lintangsutawika May 28, 2022
3bcc50c
changed how mlm data is loaded
lintangsutawika May 28, 2022
76960f7
changed how mlm data is loaded
lintangsutawika May 28, 2022
229d661
changed how mlm data is loaded
lintangsutawika May 28, 2022
55e3df7
changed how mlm data is loaded
lintangsutawika May 28, 2022
05dea6d
changed how mlm data is loaded
lintangsutawika May 28, 2022
661c8bb
added new script
lintangsutawika May 28, 2022
97d3810
added new script
lintangsutawika May 28, 2022
71388ee
added new script
lintangsutawika May 28, 2022
b0f04d5
try t5 dataset
lintangsutawika May 28, 2022
cd43a54
try t5 dataset
lintangsutawika May 28, 2022
e0dc666
try t5 dataset
lintangsutawika May 28, 2022
866cee1
try t5 dataset
lintangsutawika May 28, 2022
0b56a7d
try t5 dataset
lintangsutawika May 28, 2022
5bb512b
try t5 dataset
lintangsutawika May 28, 2022
31d844f
try t5 dataset
lintangsutawika May 28, 2022
1d21963
try t5 dataset
lintangsutawika May 28, 2022
1429645
try t5 dataset
lintangsutawika May 28, 2022
f5341f8
try t5 dataset
lintangsutawika May 28, 2022
b05b175
try t5 dataset
lintangsutawika May 28, 2022
59a6e32
try t5 dataset
lintangsutawika May 28, 2022
ab76d49
developing
lintangsutawika May 28, 2022
0d8dfac
developing
lintangsutawika May 28, 2022
e629224
developing
lintangsutawika May 28, 2022
efcf50f
developing
lintangsutawika May 28, 2022
e5eb615
developing
lintangsutawika May 28, 2022
2eee807
developing
lintangsutawika May 28, 2022
5840a11
developing
lintangsutawika May 28, 2022
6d38f73
test to see output of get_ltor_masks_and_position_ids
lintangsutawika May 29, 2022
430fa6f
test to see output of get_ltor_masks_and_position_ids
lintangsutawika May 29, 2022
444314f
add new script
May 29, 2022
26c837d
add new script
May 29, 2022
feb023c
add new script
May 29, 2022
f30b9b1
changed settings
May 30, 2022
0a9203a
changed settings
May 30, 2022
672a866
tidy up
May 31, 2022
3780e61
changed tokenizer and position embedding
May 31, 2022
2130c31
modifying mlm to reflect original implementation
Jun 2, 2022
26afe43
minor fix
Jun 2, 2022
c1b9816
minor fix
Jun 2, 2022
453822f
minor fix
Jun 2, 2022
a62266a
minor fix
Jun 2, 2022
02dda79
minor fix
Jun 2, 2022
80331cb
minor fix
Jun 2, 2022
350227d
minor fix
Jun 2, 2022
d0eecd4
minor fix
Jun 2, 2022
243cebe
minor fix
Jun 2, 2022
da22e0b
minor fix
Jun 2, 2022
083dce7
minor fix
Jun 2, 2022
541e9d6
minor fix
Jun 2, 2022
86bfc8a
minor fix
Jun 2, 2022
e21a448
minor fix
Jun 2, 2022
f47d678
minor fix
Jun 2, 2022
415b8bc
minor fix
Jun 2, 2022
79bd6f8
minor fix
Jun 2, 2022
ba19fdf
minor fix
Jun 2, 2022
d200f4d
minor fix
Jun 2, 2022
102a461
minor fix
Jun 2, 2022
e530440
minor fix
Jun 2, 2022
2568039
minor fix
Jun 2, 2022
e6b4120
minor fix
Jun 2, 2022
fd7fe97
minor fix
Jun 2, 2022
861fc7b
minor fix
Jun 2, 2022
21c1984
minor fix
Jun 2, 2022
14e8d0f
minor fix
Jun 2, 2022
920343f
minor fix
Jun 2, 2022
a68873d
minor fix
Jun 2, 2022
5d43986
minor fix
Jun 2, 2022
79e8c1a
set correct seq len
Jun 2, 2022
786d252
refined sampling method
Jun 8, 2022
9110520
refined sampling method
Jun 8, 2022
7db34b9
refined sampling method
Jun 8, 2022
d946515
refined sampling method
Jun 8, 2022
bb4e656
refined sampling method
Jun 8, 2022
2e7161d
refined sampling method
Jun 8, 2022
00473e4
first commit, adding non causal mlm dataset
Jun 8, 2022
5992776
fixed mlm dataset
Jun 8, 2022
83f5dee
fixed mlm dataset
Jun 8, 2022
3235c2d
fixed mlm dataset
Jun 8, 2022
5449978
fixed mlm dataset
Jun 8, 2022
95c9851
fixed mlm dataset
Jun 8, 2022
9ff6172
Merge branch 'bigscience-workshop:main' into mt0
Jun 12, 2022
451318f
minor changes
Jun 14, 2022
edfaa19
Merge branch 'mt0' of https://github.com/lintangsutawika/Megatron-Dee…
Jun 14, 2022
5657083
removed multitask finetuning related scripts
Jun 22, 2022
1cee345
Merge branch 'bigscience-workshop:main' into mlm-adaptation
Jun 22, 2022
b4b87fc
remove any unrelated to dataset, revert arguments.py
Jun 22, 2022
5e80cc1
revert tokenizer
Jun 22, 2022
253e81f
Improve MLM
thomasw21 Jun 23, 2022
1d8a5c0
Woops
thomasw21 Jun 23, 2022
e6036a0
Remove a bunch of attributes
thomasw21 Jun 23, 2022
408f16a
Fix naming
thomasw21 Jun 23, 2022
ae87552
Woops
thomasw21 Jun 23, 2022
e79c9a2
Use GPTDataset as underlying implementation
thomasw21 Jun 23, 2022
62ee550
Fix sep tokens
thomasw21 Jun 23, 2022
a2e9ba8
Change attribute naming
thomasw21 Jun 23, 2022
64334a4
GPT Dataset doesn't handle slicing
thomasw21 Jun 23, 2022
7a872c2
Remove tokenizer
thomasw21 Jun 23, 2022
b6f02c5
WIP
thomasw21 Jun 23, 2022
86680bc
WIP
thomasw21 Jun 23, 2022
4b2d840
WIP
thomasw21 Jun 23, 2022
b935b85
WIP
thomasw21 Jun 23, 2022
9a74d69
WIP
thomasw21 Jun 23, 2022
64b1515
WIP
thomasw21 Jun 23, 2022
b210364
WIP
thomasw21 Jun 23, 2022
6398d1d
MLM
thomasw21 Jun 23, 2022
e0f7c92
Cleanup
thomasw21 Jun 23, 2022
6b92958
Update megatron/data/mlm_dataset.py
Jun 26, 2022
faf0b9e
Cleanup + fix off by one issue
thomasw21 Jun 27, 2022
0e3ee15
Missing vocab extra ids
thomasw21 Jun 27, 2022
92070ce
Woops
thomasw21 Jun 27, 2022
ea69602
Understanding off by one isse
thomasw21 Jun 27, 2022
4dbe448
Woops
thomasw21 Jun 27, 2022
8f42790
Add
thomasw21 Jun 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions megatron/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -925,6 +925,9 @@ def __call__(self, parser, args, values, option_string=None):
'specific positions. This option tries to un-bias the loss by reweighting loss on specific '
'positions based on how frequently we train on that position.'
'This is mostly used for prefix_lm training')
group.add_argument("--noise_density", type=float, default=None, help="Span corruption noise density")
group.add_argument("--mean_noise_span_length", type=int, default=None, help="Span corruption mean noise span length")


return parser

Expand Down
2 changes: 1 addition & 1 deletion megatron/data/gpt_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,

# Single dataset.
if len(data_prefix) == 1:
all_train_datasets, all_valid_datasets, all_test_datasets = _build_train_valid_test_datasets(data_prefix[0],
all_train_datasets, all_valid_datasets, all_test_datasets = _build_train_valid_test_datasets(data_prefix[0],
data_impl, splits_string,
train_valid_test_num_samples,
seq_length, seed, skip_warmup)
Expand Down
372 changes: 372 additions & 0 deletions megatron/data/mlm_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,372 @@
"""Non-Causal Mask Language Model Finetune Style dataset."""

import numpy as np
import torch

from megatron import print_rank_0, get_tokenizer
from megatron.data.blendable_dataset import BlendableDataset
from megatron.data.dataset_utils import get_datasets_weights_and_num_samples
from megatron.data.dataset_utils import get_train_valid_test_split_, get_indexed_dataset_
from megatron.data.gpt_dataset import GPTDataset


def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
train_valid_test_num_samples,
sequence_length,
noise_density,
mean_noise_span_length,
seed,
skip_warmup
):
assert noise_density is not None
assert mean_noise_span_length is not None

if len(data_prefix) == 1:
return _build_train_valid_test_datasets(
data_prefix=data_prefix[0],
data_impl=data_impl,
splits_string=splits_string,
train_valid_test_num_samples=train_valid_test_num_samples,
sequence_length=sequence_length,
noise_density=noise_density,
mean_noise_span_length=mean_noise_span_length,
seed=seed,
skip_warmup=skip_warmup
)
# Blending dataset.
# Parse the values.
output = get_datasets_weights_and_num_samples(data_prefix,
train_valid_test_num_samples)
prefixes, weights, datasets_train_valid_test_num_samples = output

# Build individual datasets.
train_datasets = []
valid_datasets = []
test_datasets = []
for i in range(len(prefixes)):
train_ds, valid_ds, test_ds = _build_train_valid_test_datasets(
data_prefix=prefixes[i],
data_impl=data_impl,
splits_string=splits_string,
train_valid_test_num_samples=datasets_train_valid_test_num_samples[i],
sequence_length=sequence_length,
noise_density=noise_density,
mean_noise_span_length=mean_noise_span_length,
seed=seed,
skip_warmup=skip_warmup
)
if train_ds:
train_datasets.append(train_ds)
if valid_ds:
valid_datasets.append(valid_ds)
if test_ds:
test_datasets.append(test_ds)

# Blend.
blending_train_dataset = None
if train_datasets:
blending_train_dataset = BlendableDataset(train_datasets, weights)
blending_valid_dataset = None
if valid_datasets:
blending_valid_dataset = BlendableDataset(valid_datasets, weights)
blending_test_dataset = None
if test_datasets:
blending_test_dataset = BlendableDataset(test_datasets, weights)

return (blending_train_dataset, blending_valid_dataset,
blending_test_dataset)


def _build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
train_valid_test_num_samples,
sequence_length,
noise_density,
mean_noise_span_length,
seed,
skip_warmup):
"""Build train, valid, and test datasets."""


# Indexed dataset.
indexed_dataset = get_indexed_dataset_(data_prefix,
data_impl,
skip_warmup)

total_num_of_documents = indexed_dataset.sizes.shape[0] - 1
splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
# Print stats about the splits.
print_rank_0(' > dataset split:')

def print_split_stats(name, index):
print_rank_0(' {}:'.format(name))
print_rank_0(' document indices in [{}, {}) total of {} '
'documents'.format(splits[index], splits[index + 1],
splits[index + 1] - splits[index]))
start_index = indexed_dataset.doc_idx[splits[index]]
end_index = indexed_dataset.doc_idx[splits[index + 1]]
print_rank_0(' sentence indices in [{}, {}) total of {} '
'sentences'.format(start_index, end_index,
end_index - start_index))
print_split_stats('train', 0)
print_split_stats('validation', 1)
print_split_stats('test', 2)

def build_dataset(index, name):
dataset = None
if splits[index + 1] > splits[index]:
# Build the dataset accordingly.
documents = np.arange(start=splits[index], stop=splits[index + 1],
step=1, dtype=np.int32)
dataset = MLMDataset(
indexed_dataset=indexed_dataset,
documents=documents,
noise_density=noise_density,
mean_noise_span_length=mean_noise_span_length,
name=name,
data_prefix=data_prefix,
sequence_length=sequence_length,
num_samples=train_valid_test_num_samples[index],
seed=seed,
)
return dataset

train_dataset = build_dataset(0, 'train')
valid_dataset = build_dataset(1, 'valid')
test_dataset = build_dataset(2, 'test')

return (train_dataset, valid_dataset, test_dataset)


class MLMDataset(torch.utils.data.Dataset):

def __init__(
self,
name,
indexed_dataset,
documents,
data_prefix,
sequence_length,
num_samples,
seed,
noise_density=0.15,
mean_noise_span_length=3
):

# Params to store.
self.name = name
self.seed = seed
self.sequence_length = sequence_length

# Dataset.
self.indexed_dataset = indexed_dataset

self.noise_density = noise_density
self.mean_noise_span_length = mean_noise_span_length
# T5-like span masked language modeling will fuse consecutively masked tokens to a single sentinel token.
# To ensure that the input length is `sequence_length`, we need to increase the maximum length
# according to `noise_density` and `mean_noise_span_length`. We can also define the label length accordingly.
number_of_raw_tokens, inputs_length, targets_length, num_noise_spans = compute_input_and_target_lengths(
# +1 is used so that we can compute the as autoregressive systems require us to add one more token.
sequence_length=self.sequence_length + 1,
noise_density=self.noise_density,
mean_noise_span_length=self.mean_noise_span_length
)
self.number_of_raw_tokens = number_of_raw_tokens
self.inputs_length = inputs_length
self.targets_length = targets_length
self.num_noise_spans = num_noise_spans

# Build the samples mapping.
self._gpt_dataset = GPTDataset(
name=self.name,
data_prefix=data_prefix,
documents=documents,
indexed_dataset=self.indexed_dataset,
num_samples=num_samples,
seq_length=number_of_raw_tokens,
seed=seed
)

# Vocab stuff.
tokenizer = get_tokenizer()
self.sep_id = tokenizer.sep
self.sentinel_token_ids = tokenizer.additional_special_tokens_ids
assert len(self.sentinel_token_ids) > 0, "Provide the argument --vocab-extra-ids 100 to the script"
assert len(self.sentinel_token_ids) >= self.num_noise_spans, "Not enough sentinel tokens, please add more"

def __len__(self):
return len(self.samples_mapping)

def __getitem__(self, idx):
if isinstance(idx, slice):
raise NotImplementedError

sample = self._gpt_dataset[idx]["text"]

return build_training_sample(
sample=sample,
inputs_length=self.inputs_length,
targets_length=self.targets_length,
num_noise_spans=self.num_noise_spans,
sep_id=self.sep_id,
all_sentinel_token_ids=self.sentinel_token_ids,
)


def build_training_sample(
sample,
inputs_length,
targets_length,
num_noise_spans,
sep_id,
all_sentinel_token_ids,
):
"""Build training sample.

Arguments:
sample: int32 tensor
inputs_length: integer
targets_length: integer
num_noise_spans: integer
sep_id: integer
all_sentinel_token_ids: List[int]
Returns:
Dict with following keys:
- `input_tokens`: int32 tensor with as length input_length,
- `target_tokens`: int32 tensor with as length targets_length + 1,
"""

spans_start, mask_indices = random_spans_noise_mask(
inputs_length=inputs_length,
targets_length=targets_length,
num_noise_spans=num_noise_spans,
)
spans_end = np.concatenate([
spans_start[1:], np.full((1,), len(sample), dtype=np.int32)]
)

sentinel_token_ids = all_sentinel_token_ids[:num_noise_spans]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given num_noise_spans is always the same, maybe slightly faster to store sentinel_token_ids as a class attribute of MLMDataset & feed it as an argument to the func

I wonder if it wouldn't be better to make num_noise_spans probabilistic instead of deterministic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have a strong intuition that we should want to change those values. But the idea is to have T5 mlm here and rely on their number.


input_token_ids = np.concatenate(
[
elt
for start, end, sentinel_token in zip(spans_start[::2], spans_end[::2], sentinel_token_ids)
for elt in [sample[start: end], np.full((1,), sentinel_token, dtype=np.int32)]
] +
[np.full((1,), sep_id, dtype=np.int32)]
)
target_token_ids = np.concatenate(
[
elt
for start, end, sentinel_token in zip(spans_start[1::2], spans_end[1::2], sentinel_token_ids)
for elt in [np.full((1,), sentinel_token, dtype=np.int32), sample[start: end]]
] +
[np.full((1,), sep_id, dtype=np.int32)]
)

return {
'input_tokens': input_token_ids,
'target_tokens': target_token_ids
}


def compute_input_and_target_lengths(sequence_length, noise_density, mean_noise_span_length):
"""This function is copy of `random_spans_helper <https://github.com/google-research/text-to-text-transfer-transformer/blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2466>`__ .
Training parameters to avoid padding with random_spans_noise_mask.
When training a model with random_spans_noise_mask, we would like to set the other
training hyperparmeters in a way that avoids padding.
This function helps us compute these hyperparameters.
The number of noise tokens and the number of noise spans and non-noise spans
are determined deterministically as follows:
num_noise_tokens = round(length * noise_density)
num_nonnoise_spans = num_noise_spans = round(num_noise_tokens / mean_noise_span_length)
We assume that each noise span in the input is replaced by extra_tokens_per_span_inputs sentinel tokens,
and each non-noise span in the targets is replaced by extra_tokens_per_span_targets sentinel tokens.
This function tells us the required number of tokens in the raw example (for split_tokens())
as well as the length of the encoded targets. Note that this function assumes
the inputs and targets will have SEP appended and includes that in the reported length.
Args:
inputs_length: an integer - desired length of the tokenized inputs sequence
noise_density: a float
mean_noise_span_length: a float
Returns:
tokens_length: length of original text in tokens
targets_length: an integer - length in tokens of encoded targets sequence
"""

def _tokens_length_to_inputs_length_targets_length(_tokens_length):
num_noise_tokens = int(round(_tokens_length * noise_density))
num_nonnoise_tokens = _tokens_length - num_noise_tokens
_num_noise_spans = int(round(num_noise_tokens / mean_noise_span_length))
# inputs contain all nonnoise tokens, sentinels for all noise spans and one SEP token.
_input_length = num_nonnoise_tokens + _num_noise_spans + 1
_output_length = num_noise_tokens + _num_noise_spans + 1
return _input_length, _output_length, _num_noise_spans

tokens_length = sequence_length
inputs_length, targets_length, num_noise_spans = _tokens_length_to_inputs_length_targets_length(tokens_length)
while inputs_length + targets_length > sequence_length:
tokens_length -= 1
inputs_length, targets_length, num_noise_spans = _tokens_length_to_inputs_length_targets_length(tokens_length)

# tokens_length is the number of raw tokens we need to get
# inputs_length will be the input
# targets_length will be the target
# num_noise_spans is the number of spans we have to replace
return tokens_length, inputs_length, targets_length, num_noise_spans


def random_spans_noise_mask(
inputs_length,
targets_length,
num_noise_spans,
):

"""This function is inspired from `random_spans_noise_mask <https://github.com/google-research/text-to-text-transfer-transformer/blob/84f8bcc14b5f2c03de51bd3587609ba8f6bbd1cd/t5/data/preprocessors.py#L2682>`__ .
Noise mask consisting of random spans of noise tokens.
Spans alternate between non-noise and noise, beginning with non-noise.
Args:
inputs_length: int32 scalar
targets_length: int32 scalar
num_noise_spans: int32 scalar
Returns:
a int8 tensor with shape [num_noise_spans]
a boolean tensor with shape [length]
"""
# # pick the lengths of the noise spans and the non-noise spans
num_noise_tokens = targets_length - num_noise_spans - 1
num_nonnoise_tokens = inputs_length - num_noise_spans - 1
number_of_raw_tokens = num_noise_tokens + num_nonnoise_tokens

def _random_segmentation(num_items, num_segments):
"""Partition a sequence of items randomly into non-empty segments.
Args:
num_items: an integer scalar > 0
num_segments: an integer scalar in [1, num_items]
Returns:
a Tensor with shape [num_segments] containing positive integers that add
up to num_items
"""
mask_indices = np.arange(num_items - 1) < (num_segments - 1)
# TODO @thomasw21 handle random state correctly, ie synchronized across TP.
Copy link
Collaborator

@TevenLeScao TevenLeScao Jun 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scares me a bit because TP-random states things are hard to debug but tbh we should just test asap to see if loss goes down at the expected rate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes I need to double check that. I can have a go at it. Have forgotten about this TODO.

# we might not care as get_batch_pipe broadcasts data to all devices.
np.random.shuffle(mask_indices)
first_in_segment = np.pad(mask_indices, [[1, 0]], constant_values=0)
segment_id = np.cumsum(first_in_segment)
# count length of sub segments assuming that list is sorted
_, segment_length = np.unique(segment_id, return_counts=True)
return segment_length

noise_span_lengths = _random_segmentation(num_noise_tokens, num_noise_spans)
nonnoise_span_lengths = _random_segmentation(num_nonnoise_tokens, num_noise_spans)

interleaved_span_lengths = np.reshape(
np.stack([nonnoise_span_lengths, noise_span_lengths], axis=1), [num_noise_spans * 2]
)
span_starts = np.concatenate([np.full((1,), 0, dtype=np.int32), np.cumsum(interleaved_span_lengths)[:-1]])
span_start_indicator = np.zeros((number_of_raw_tokens,), dtype=np.int8)
span_start_indicator[span_starts] = True
span_num = np.cumsum(span_start_indicator)
is_noise = np.equal(span_num % 2, 1)

return span_starts, is_noise
Loading