Add possibility to input multiple sentences#4
Conversation
simonepri
left a comment
There was a problem hiding this comment.
I left some comments more are coming.
Also if you don't manage to run the code formatter on your computer, you can copy-paste the code here: https://black.now.sh/
# Conflicts: # lm_scorer/models/abc/base.py # lm_scorer/models/gpt2.py
|
I have taking into account all your comments. There is still one typing issue, however not sure how to resolve this one. Am i supposed to use a # type ignore or is there an other possibility ? |
No, it is correctly pointing out a bug in the code |
|
I will also add some more tests concerning this new batch features in test_gpt2. |
|
Would you mind if I ask you to split this PR in two? It would be convenient to have a first PR (we can use this one) with the API changes + GPT2 implemented as just a for loop on the old single sentence code. Then the second PR will actually optimize GPT2 using batching. |
simonepri
left a comment
There was a problem hiding this comment.
Some small changes, we are almost ready to merge it.
Thanks for the work!!
Co-authored-by: Simone Primarosa <simonepri@outlook.com>
Co-authored-by: Simone Primarosa <simonepri@outlook.com>
|
@dldk-gael Thanks! |
|
No problem and thank you for your patience, I was not familiar with all those tools and good practices, I learned a lot from your code. |
In order to be able to fully benefit from parallelization, I propose to add the possibility to input the sentences into the transformer models by batch.
The work is not finished yet and I still have to pass the CI test (I will proceed as you suggest in #3) but it is working and would like to have your opinion before going further.
In order to make minimal change to your code, for now tokens_score return a list of log_probs, ids, tokens (one item for each sentence). However, it would be much more efficient to make the reduction when we still have the log_probs scores as a tensor. One possibility would be that _tokens_log_prob now return a tensor of shape (number sentences, max sentences length). What do you think of that ?
Also, I did not use the pad_sequence function from torch.nn.utils.rnn but recode it in order that it returns a mask (with 1 at the place of the padding value) which is useful latter in the code to remove the score from padding value. I think this function should not be in GPT2LMScorer class but somewhere else.