Skip to content

Dataset too large #6

@NHendrickson9616

Description

@NHendrickson9616

I am using the run_mlm.py file but I have my own copy because I changed where the tokenizer is going to since it is a different path from the model which is local.

While intially working with this method, I used the first two lines of my dataset and it was working just fine, but now that I have expanded the input, I am getting this error:

IndexError                                Traceback (most recent call last)
Cell In[58], line 5
      3 scorer = MaskedLM('/data/user/home/nchendri/LongRun/')
      4 text =  dsMap['test']['text']
----> 5 ppl = scorer.get_perplexity(text, batch=32)
      6 print(ppl)
      7 print(list(zip(text, ppl)))

Cell In[57], line 162, in MaskedLM.get_perplexity(self, input_texts, batch)
    159     return _e
    161 if self.max_length is not None:
--> 162     data.append([encode_mask(i) for i in range(min(self.max_length - len(self.sp_token_prefix), len(x)))])
    163 else:
    164     data.append([encode_mask(i) for i in range(len(x))])

Cell In[57], line 162, in <listcomp>(.0)
    159     return _e
    161 if self.max_length is not None:
--> 162     data.append([encode_mask(i) for i in range(min(self.max_length - len(self.sp_token_prefix), len(x)))])
    163 else:
    164     data.append([encode_mask(i) for i in range(len(x))])

Cell In[57], line 157, in MaskedLM.get_perplexity.<locals>.encode_mask(mask_position)
    155 # add the correct token id as the label
    156 label = [PAD_TOKEN_LABEL_ID] * _e['input_ids'].shape[1]
--> 157 label[mask_position + len(self.sp_token_prefix)] = masked_token_id
    158 _e['labels'] = torch.tensor([label], dtype=torch.long)
    159 return _e

IndexError: list assignment index out of range

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions