Size mismatch error during train #942
-
We utilize Whispar to train an arabic dataset. All of our steps are in this notebook https://colab.research.google.com/drive/1msFlRKDXnZsaAZhvFlYVyxPsUA6qGQFo?usp=sharing But when we try to train, we get the following error: /usr/local/lib/python3.8/dist-packages/transformers/optimization.py:346: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
Hi, it appears that the text data is somehow not padded correctly to match the text encoder's context size (448). I'm not too familiar with the preprocessing pipelines in huggingface transformers, and it might be better answered by @sanchit-gandhi who wrote the notebook. (thanks!) |
Beta Was this translation helpful? Give feedback.
-
Hey @marthafikry! Cool to see that you're fine-tuning Whisper for Arabic! The issue is with your target label sequences. Some of the label sequences have a length that exceeds the model’s maximum generation length. These must be very long sequences, as the maximum generation length is 448. This is the longest sequence the model is configured to handle ( We've got two options here:
What we can do is compute the labels length of each target sequence: def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute input length
batch["input_length"] = len(batch["audio"])
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
# compute labels length
batch["labels_length"] = len(batch["labels"])
return batch And then filter those that exceed the models maximum length: MAX_DURATION_IN_SECONDS = 30.0
max_input_length = MAX_DURATION_IN_SECONDS * 16000
def filter_inputs(input_length):
"""Filter inputs with zero input length or longer than 30s"""
return 0 < input_length < max_input_length
max_label_length = model.config.max_length
def filter_labels(labels_length):
"""Filter label sequences longer than max length (448)"""
return labels_length < max_label_length You can then apply the # pre-process
common_voice = common_voice.map(prepare_dataset, remove_columns= my_dataset.column_names["train"])
# filter by audio length
common_voice = common_voice.filter(filter_inputs, input_columns=["input_length"], remove_columns=["input_length"]
# filter by label length
common_voice = common_voice.filter(filter_labels, input_columns=["labels_length"], remove_columns=["labels_length"]) That should pre-process the dataset and remove any label sequences that are too long for the model. Alternatively, we can change the model’s max length to any value we want: model.config.max_length = 500 This will update the max length to 500 tokens. Make sure to do this before you filter for it to take effect: max_label_length = model.config.max_length = 500
def filter_labels(labels_length):
"""Filter label sequences longer than the new max length (500)"""
return labels_length < max_label_length |
Beta Was this translation helpful? Give feedback.
Hey @marthafikry! Cool to see that you're fine-tuning Whisper for Arabic!
The issue is with your target label sequences. Some of the label sequences have a length that exceeds the model’s maximum generation length. These must be very long sequences, as the maximum generation length is 448. This is the longest sequence the model is configured to handle (
model.config.max_length
).We've got two options here:
What we can do is compute the labels length of each target sequence: