Experimental: Allow shortlists in marian-scorer (browsermt)#3
Draft
graemenail wants to merge 12 commits intobrowsermt-masterfrom
Draft
Experimental: Allow shortlists in marian-scorer (browsermt)#3graemenail wants to merge 12 commits intobrowsermt-masterfrom
marian-scorer (browsermt)#3graemenail wants to merge 12 commits intobrowsermt-masterfrom
Conversation
The cross_entropy_shortlist operation implements cross-entropy with a modified softmax stage. This modified softmax uses the shortlist indices to define the subset over which the softmax is normalized. The motivation is to have entries inside the shortlist with the result they would have in the absence of non-shortlist entries. This should offer some comparison between results in scoring and decoding modes.
The default behaviour of shortlist performed an index select to retain entries corresponding to shortlist candidates. This is desired for decoding, where this reduction in the size of tensors offers an increase in performance. The new behaviour retains the full size of the tensors, and is designed to be used with marian scorer. The list of indices for shortlist candidates can be later used during loss computation.
df9cf75 to
7e67124
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
NOTE!
Motivation
This is a port of PR #2 on to the browsermt fork of marian.
This is necessary to use models that require intgemm.
Caveats
--gemm-precision int8shiftAlphaAllcurrently doesn't work withmarian-scorer;int8,int8shiftdo however. The problem is related to the loading of the precomputed alphas. Work is ongoing here.This PR adds the possibility to use shortlist during (re)scoring in Marian Scorer. Its aim is to achieve word-scores from marian-scorer which are comparable to those obtained during decoding.
During decoding, tensor indices corresponding to non-shortlist tokens are discarded. This reduction in tensor size reduces the computational cost of later computations, and improves decoder performance. As such, the softmax+cross-entropy operation only ever sees shortlisted tokens. In order to imitate this in marian-rescorer, we perform a modified softmax which has a normalisation factor calculated from the sum of the subset defined by the shortlist. The sum of shortlist-only is correctly normalised to unity, while the sum over the full vocabulary is greater than (or equal to) unity. When we encounter tokens in scoring that are not in the shortlist, their value is not bounded above by 0, and therefore, may be positive.
You must maintain the same batching as used in decoding! The size of the generated shortlist depends on the contents of a particular batch, specifically the different tokens it contains.
For performance, the cross-entropy operation in Marian implements the softmax sum as part of it's node operation. This implementation is different, and uses several node operations to accomplish its result.
Finally, decoding and Scoring are two distinct modes of operation, utilising different code paths and therefore expression graphs, with decoder generating tokens sequentially, and scorer having them provided ahead of time. As such, floating point errors will propagate differently, and results may be numerically different.
Added dependencies: none
How to test
Using the same shortlist settings (e.g.
--shortlist lex.s2t.gz 100 100), you should receive a roughly similar word-score when rescoring on decoder output.Checklist