WikiNeural to .spacy Format #12135

eschaffn · 2023-01-19T20:36:15Z

eschaffn
Jan 19, 2023

Hi there, I'd like to convert the WikiNeural (https://github.com/Babelscape/wikineural) Russian dataset into .spacy format. Currently it's in one-token-per-line format. I used the spacy convert built in function to convert using the options -c ner -n 10 and tried to train a NER transformer pipeline.

However the F1 score is extremely low (in the .01-.05 range) and I think the problem might be with how spacy reconstructs tokenized data into documents. Has anyone else had this issue?

A preview of the data before conversion:

В	O 
1961	O
году	O
рудник	O
Грумант	B-LOC
был	O
законсервирован	O
из-за	O
низких	O
технико-экономических	O
показателей	O
.	O

Note: The original file had 3 columns... Index, Token, Label
I removed the first column and re ran the converter because the converter tool was grabbing tokens from the first column.

Answered by adrianeboyd

Jan 20, 2023

After removing the first column of token indices, training an NER model from this data seems to work as expected:

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    109.67    1.36    0.74    7.75    0.01
  0     500       2027.59   8985.10   52.63   54.92   50.53    0.53
  0    1000        513.54   6050.69   65.29   66.87   63.78    0.65
  0    1500        676.13   5320.49   68.34   72.32   64.77    0.68
  0    2000       6945.42   5176.05   …

View full answer

adrianeboyd · 2023-01-20T09:39:15Z

adrianeboyd
Jan 20, 2023

After removing the first column of token indices, training an NER model from this data seems to work as expected:

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    109.67    1.36    0.74    7.75    0.01
  0     500       2027.59   8985.10   52.63   54.92   50.53    0.53
  0    1000        513.54   6050.69   65.29   66.87   63.78    0.65
  0    1500        676.13   5320.49   68.34   72.32   64.77    0.68
  0    2000       6945.42   5176.05   70.41   73.57   67.51    0.70
  0    2500        585.66   4846.84   71.82   75.20   68.73    0.72

Double-check with spacy debug data -V to make sure the common tokens in the vocab and the labels have been converted correctly.

If you're using a transformer model, double-check that it's one that's appropriate for Russian. It looks like we don't have a Russian-specific default, so spacy init config -l ru currently uses bert-base-multilingual-uncased. It should work in general, but you might see better results with a Russian-specific model.

1 reply

eschaffn Jan 20, 2023
Author

Thank you!

I realized a bit after I posted that swapping from a Russian monolingual transformer base to xlm-roberta fixed the problem, and it wasn't the data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WikiNeural to .spacy Format #12135

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

WikiNeural to .spacy Format #12135

Uh oh!

Uh oh!

eschaffn Jan 19, 2023

Replies: 1 comment · 1 reply

Uh oh!

adrianeboyd Jan 20, 2023

Uh oh!

eschaffn Jan 20, 2023 Author

eschaffn
Jan 19, 2023

Replies: 1 comment 1 reply

adrianeboyd
Jan 20, 2023

eschaffn Jan 20, 2023
Author