Understanding NER architecture #9182
-
Dear SpaCy team, I'm trying to find out, how a (custom) NER model is structured in general. As far as I understand, the embedding + encoding in Toc2Vec is done with Bloom embeddings for basic token vectorization and "residualized" convolution+maxpooling-like operations for contextualization. The following steps are a bit unclear to me, as we are about to enter the "Attend"+Predict-stage (of the EntityRecognizer) and the state-based parsing is applied. In particular, I was unable to find the corresponding lines in the source code, where the various features from the buffer and the previous entities are pulled as described in: What I could extract from EntityRecognizer was the following:
In addition, it is unclear (to me), how the current state is included for the prediction of the features, on which the state update depends, as described on the slide (in the video below) in line five [feature = get_features(state, state_weights)]. As far as I understand, L1 (precomputable_affine) is computed in advance and thus cannot be equivalent to line five. Also, it seems to me that the Predict-stage refers to the layer L2, where the 64-dim, contextualized, token-related vector is used to predict the next action, and layer L2 corresponds to line six [probs = mlp(features)]? So far, it (wrongfully?) appears to me that the parsing is more like a "feed-forward" sequence tagger rather than a state machine, where contextualization is only done through convolutions but not through state propagation. My questions in short:
I'd be delighted to fully understand the NER architecture. :) Cheers, See my
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
It's a bit tricky to match up the implementation against the algorithm description on a line-by-line basis, because of the bulk computation. For each token in the input batch, we precompute the pre-activation hidden layer values for each feature-position it could be in, where the feature-positions are things like "first item of the stack", "first item of the buffer", etc. This precomputation happens before we start stepping through the state. Once we're stepping through the state, we calculate the feature tokens, and resolve them to the array indices from the precomputed values. We then sum together those items, and apply whatever activation we need to for the result. This produces the hidden layer values that can then be used to predict the action scores. There's a separate, optimized implementation of this for prediction, and another implementation that's used during training.
|
Beta Was this translation helpful? Give feedback.
It's a bit tricky to match up the implementation against the algorithm description on a line-by-line basis, because of the bulk computation.
For each token in the input batch, we precompute the pre-activation hidden layer values for each feature-position it could be in, where the feature-positions are things like "first item of the stack", "first item of the buffer", etc. This precomputation happens before we start stepping through the state.
Once we're stepping through the state, we calculate the feature tokens, and resolve them to the array indices from the precomputed values. We then sum together those items, and apply whatever activation we need to for the result. This produces the hi…