Understanding NER architecture #9182

j-frei · 2021-09-09T14:28:24Z

j-frei
Sep 9, 2021

Dear SpaCy team,

I'm trying to find out, how a (custom) NER model is structured in general.

As far as I understand, the embedding + encoding in Toc2Vec is done with Bloom embeddings for basic token vectorization and "residualized" convolution+maxpooling-like operations for contextualization.
In my case, the output has a vector dimension dim nO == 96 (per token). As far as I understand, we have now succeeded the Embed+Encode-stage at that point. (According to https://explosion.ai/blog/deep-learning-formula-nlp)

The following steps are a bit unclear to me, as we are about to enter the "Attend"+Predict-stage (of the EntityRecognizer) and the state-based parsing is applied.

In particular, I was unable to find the corresponding lines in the source code, where the various features from the buffer and the previous entities are pulled as described in:
https://www.youtube.com/watch?v=sqDHBH9IjRU&t=42m22s

What I could extract from EntityRecognizer was the following:

% EntityRecognizer (parser_model)
% 3 layers:
% - L0 (tok2vec-listener>>list2array>>linear):
%   * Token2Vec Listener
%   * List2Array
%   * Linear/Dense
%       96 -> 64
% - L1 (precomputable_affine): Is this `state_weights = state2vec(tensor)`? -> is lower
%   * Pad: shape (1,3, 64, 2)
%                (1, nF, nO, nP)
%   * W: shape (3, 64, 2, 64)
%                (nF, nO, nP, nI)
%   * b: (64, 2)
%        (nO, nP)
%   ...
% - L2 (linear): -> is upper
%   * Linear/Dense
%       64 -> 38
% 38 Scores for transition model
% 9 labels -> 38 = 9 (labels) * 4 (B,I,L,U) + 2 (O,U-none)

In addition, it is unclear (to me), how the current state is included for the prediction of the features, on which the state update depends, as described on the slide (in the video below) in line five [feature = get_features(state, state_weights)]. As far as I understand, L1 (precomputable_affine) is computed in advance and thus cannot be equivalent to line five.
https://www.youtube.com/watch?v=sqDHBH9IjRU&t=44m15s

Also, it seems to me that the Predict-stage refers to the layer L2, where the 64-dim, contextualized, token-related vector is used to predict the next action, and layer L2 corresponds to line six [probs = mlp(features)]?

So far, it (wrongfully?) appears to me that the parsing is more like a "feed-forward" sequence tagger rather than a state machine, where contextualization is only done through convolutions but not through state propagation.

My questions in short:

Where is the previous state included for the prediction of the next state or features in the source code?
Which layer corresponds roughly to which line from https://www.youtube.com/watch?v=sqDHBH9IjRU&t=44m15s ?

I'd be delighted to fully understand the NER architecture. :)

Cheers,
Johann

See my base_config.cfg here:

# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg
[paths]
train = null
dev = null

[system]
gpu_allocator = null

[nlp]
lang = "de"
pipeline = ["tok2vec","ner"]
batch_size = 1000

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = false

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 96
depth = 4
window_size = 1
maxout_pieces = 3

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = ${paths.vectors}

Answered by honnibal

Sep 20, 2021

It's a bit tricky to match up the implementation against the algorithm description on a line-by-line basis, because of the bulk computation.

For each token in the input batch, we precompute the pre-activation hidden layer values for each feature-position it could be in, where the feature-positions are things like "first item of the stack", "first item of the buffer", etc. This precomputation happens before we start stepping through the state.

Once we're stepping through the state, we calculate the feature tokens, and resolve them to the array indices from the precomputed values. We then sum together those items, and apply whatever activation we need to for the result. This produces the hi…

View full answer

honnibal · 2021-09-20T12:11:12Z

honnibal
Sep 20, 2021
Maintainer

It's a bit tricky to match up the implementation against the algorithm description on a line-by-line basis, because of the bulk computation.

For each token in the input batch, we precompute the pre-activation hidden layer values for each feature-position it could be in, where the feature-positions are things like "first item of the stack", "first item of the buffer", etc. This precomputation happens before we start stepping through the state.

Once we're stepping through the state, we calculate the feature tokens, and resolve them to the array indices from the precomputed values. We then sum together those items, and apply whatever activation we need to for the result. This produces the hidden layer values that can then be used to predict the action scores.

There's a separate, optimized implementation of this for prediction, and another implementation that's used during training.

Here's the function that computes the scores for each state during prediction:

spaCy/spacy/ml/parser_model.pyx

Line 93 in ca49172

cdef void predict_states(ActivationsC* A, StateC** states,
The loop that calls into this function is here:

spaCy/spacy/pipeline/transition_parser.pyx

Line 203 in ca49172

cdef void _parseC(self, StateC** states,
The function above that, which sets up the precomputation etc, is here:

spaCy/spacy/pipeline/transition_parser.pyx

Line 181 in ca49172

def greedy_parse(self, docs, drop=0.):
The actual precomputation that's triggered in that self.model.predict is here:

spaCy/spacy/ml/parser_model.pyx

Line 332 in ca49172

cdef class precompute_hiddens:

3 replies

j-frei Sep 20, 2021
Author

Hey Matthew @honnibal,

Thank you very much for your input.
I think things are a lot clearer to me.

To make sure, I would like to know whether my following statements are true:

The pre-activation hidden layer values are computed by the "lower"-component. In the non-extra-state NER case, 3 features are "mixed" into one "super-feature" vector by a dense layer. All of the 3 single features are always contextualized token embeddings from Tok2Vec.
Follow-up: In the case of N tokens, I would have guessed that there are at most N*N*N combinations of inputs, since the lower-network has 3 token-related inputs!?
But it is mentioned differently (n_combinations = 2*N*3) in:

spaCy/spacy/ml/parser_model.pyx

Line 332 in ca49172

cdef class precompute_hiddens:
Each possible state for a non-extra-state NER case is composed of [0] the first word in the buffer, [1] the first word of the current entity, and [2] the word before the [0]-word in the text (since StateC operates on token indices).
It is defined here:

spaCy/spacy/pipeline/_parser_internals/_state.pxd

Line 108 in ca49172

elif n == 3:
During parsing, the "super-feature" vectors can be loaded according to the current state, since all "super-feature" vector combinations have been computed in advance. It is essentially "just" a lookup for the next, most likely state ("argmax", in the case of greedy search).
During parsing, for each state the next action (e.g. U-ORG, B-PERS) is estimated from the current-state-dependent "super-feature" vector through a dense layer.

Sorry for the exhaustive questions but I really want to make sure that the source code does the same as outlined by the authors.

I've already seen some research papers where this was not the case. Fortunately, SpaCy is not such a candidate, but I still want to understand the underlying method and implementation.

Cheers,
Johann

honnibal Sep 20, 2021
Maintainer

Hi Johann,

No problem, and I apologise that this isn't better explained anywhere.

That docstring is perhaps misleading. It is explaining why the precomputation can save time, because in the case of the dependency parser system, we visit 2*n states for a sentence of length n (each token being pushed onto the stack once and popped from the stack once). For the NER, we only visit n states for a sentence of length n.

The pre-activation hidden layer values are computed by the "lower"-component. In the non-extra-state NER case, 3 features are "mixed" into one "super-feature" vector by a dense layer. All of the 3 single features are always contextualized token embeddings from Tok2Vec.

Yes. Let's say we're doing the NER, and we have the following:

Batch of [5, 12, 4]
Tok2vec vectors of width 13
3 features per state
Hidden layer of width 17
2 maxout pieces
19 total actions to predict over

Prior to parsing we'll send the documents through the CNN tok2vec, and we'll have a list of arrays of sizes [(5, 13), (12, 13), (4, 13)]. We concatenate these into a ragged array of shape (5+12+4, 13) i.e. (21, 13).

Next, the lower layer calculates the precomputations. The weights layer for this is of shape (input: 13, output: 3*2*17), i.e. (13, 102). This is because for each token, we need to compute the feature-token values, and for each feature-token value we're computing two maxout pieces. We multiply the input array by this hidden layer, giving us an array of shape (21, 102) (i.e. we do (21, 13) @ (13, 102)). We also add a bias vector of (102,). Next, we reshape this array so that we have an array of shape (3, 21, 17, 2). This way for each token in the batch, we can access the 3 feature tokens by indexing into the precomputation table.

Now we start stepping through the states. At the first state, none of the documents are finished, so we have a batch of size 3. We compute a (3, 3) array of feature indices, and use them to retrieve a (3, 3, 17, 2) array. We sum along the axis 0, reducing the vector down to (3, 17, 2) and then max along the final dimension, giving us (3, 17). These are the values for the hidden layer.

Finally, we multiply the hidden layer values by the weight layer for the upper layer, giving us an array of shape (3, 19). We then add the bias values for the upper layer, and softmax to get the score probabilities. We then argmax to find the predicted score, and apply the action.

This gives us our next three states, which we predict in the same manner. Eventually we'll hit a final configuration for our third example, and we'll only have two states to predict over. Next our first example will terminate, and we'll have a batch of 1 until it's also finished.

j-frei Sep 21, 2021
Author

Hey Matthew,
thank you so much! That's the explanation I've been looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Understanding NER architecture #9182

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Understanding NER architecture #9182

Uh oh!

j-frei Sep 9, 2021

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

honnibal Sep 20, 2021 Maintainer

Uh oh!

Uh oh!

j-frei Sep 20, 2021 Author

Uh oh!

honnibal Sep 20, 2021 Maintainer

Uh oh!

j-frei Sep 21, 2021 Author

j-frei
Sep 9, 2021

Replies: 1 comment 3 replies

honnibal
Sep 20, 2021
Maintainer

j-frei Sep 20, 2021
Author

honnibal Sep 20, 2021
Maintainer

j-frei Sep 21, 2021
Author