Error while training Spacy NER with Hindi dataset #11726

srikamalteja · 2022-10-31T16:48:01Z

srikamalteja
Oct 31, 2022

Hi, I'm trying to train the Spacy NER transformers model with Hindi language dataset. The issue is there are no F1, Precison and Recall scores changing at all as mentioned in the below training table (please look at the attached image).

Steps to reproduce

!pip install -U spacy-transformers

Successfully converted .conll files to .spacy files using spacy convert command. After conversion to .spacy, total train documents, 30559 and test 3665.

Training - !python -m spacy train /final_config.cfg --output /content/output --paths.train train.spacy --paths.dev test.spacy --gpu-id 0

Final configuration file.

[paths]
  train = null
  dev = null
  vectors = null
  init_tok2vec = null
  
  [system]
  gpu_allocator = "pytorch"
  seed = 0
  
  [nlp]
  lang = "hi"
  pipeline = ["transformer","ner"]
  batch_size = 128
  disabled = []
  before_creation = null
  after_creation = null
  after_pipeline_creation = null
  tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
  
  [components]
  
  [components.ner]
  factory = "ner"
  incorrect_spans_key = null
  moves = null
  scorer = {"@scorers":"spacy.ner_scorer.v1"}
  update_with_oracle_cut_size = 100
  
  [components.ner.model]
  @architectures = "spacy.TransitionBasedParser.v2"
  state_type = "ner"
  extra_state_tokens = false
  hidden_width = 64
  maxout_pieces = 2
  use_upper = false
  nO = null
  
  [components.ner.model.tok2vec]
  @architectures = "spacy-transformers.TransformerListener.v1"
  grad_factor = 1.0
  pooling = {"@layers":"reduce_mean.v1"}
  upstream = "*"
  
  [components.transformer]
  factory = "transformer"
  max_batch_items = 50
  set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
  
  [components.transformer.model]
  @architectures = "spacy-transformers.TransformerModel.v3"
  name = "roberta-base"
  mixed_precision = false
  
  [components.transformer.model.get_spans]
  @span_getters = "spacy-transformers.strided_spans.v1"
  window = 128
  stride = 96
  
  [components.transformer.model.grad_scaler_config]
  
  [components.transformer.model.tokenizer_config]
  use_fast = false
  
  [components.transformer.model.transformer_config]
  
  [corpora]
  
  [corpora.dev]
  @readers = "spacy.Corpus.v1"
  path = ${paths.dev}
  max_length = 0
  gold_preproc = false
  limit = 0
  augmenter = null
  
  [corpora.train]
  @readers = "spacy.Corpus.v1"
  path = ${paths.train}
  max_length = 0
  gold_preproc = false
  limit = 0
  augmenter = null
  
  [training]
  accumulate_gradient = 3
  dev_corpus = "corpora.dev"
  train_corpus = "corpora.train"
  seed = ${system.seed}
  gpu_allocator = ${system.gpu_allocator}
  dropout = 0.1
  patience = 1600
  max_epochs = 0
  max_steps = 20000
  eval_frequency = 10
  frozen_components = []
  annotating_components = []
  before_to_disk = null
  
  [training.batcher]
  @batchers = "spacy.batch_by_padded.v1"
  discard_oversize = false
  size = 50
  buffer = 256
  get_length = null
  
  [training.logger]
  @loggers = "spacy.ConsoleLogger.v1"
  progress_bar = true
  
  [training.optimizer]
  @optimizers = "Adam.v1"
  beta1 = 0.9
  beta2 = 0.999
  L2_is_weight_decay = true
  L2 = 0.01
  grad_clip = 1.0
  use_averages = false
  eps = 0.00000001
  
  [training.optimizer.learn_rate]
  @schedules = "warmup_linear.v1"
  warmup_steps = 250
  total_steps = 20000
  initial_rate = 0.00005
  
  [training.score_weights]
  ents_f = 1.0
  ents_p = 0.0
  ents_r = 0.0
  ents_per_type = null
  
  [pretraining]
  
  [initialize]
  vectors = ${paths.vectors}
  init_tok2vec = ${paths.init_tok2vec}
  vocab_data = null
  lookups = null
  before_init = null
  after_init = null
  
  [initialize.components]
  
  [initialize.tokenizer]

I have modified default values of few parameters in the config file as I was getting CUDA out of memory exception.

lang - "hi"
max_batch_items = 50
eval_frequency = 10
size = 50

Info about spaCy

`- spaCy version: 3.4.2

**Notebook - Google Colab
Platform: Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.15
Pipelines: en_core_web_sm (3.4.1)`

Answered by polm

Nov 1, 2022

Sorry you're having trouble with this. To make it easier for us to help you, please do not post screenshots of terminal output, which are hard to read, and please read the Markdown guide on formatting code blocks.

Only zero scores usually indicates a data problem. Did you try using spacy debug data to check if there were problems with your data?

Is your text data Romanized? If not then the base Transformer model of roberta-base won't work well, as none of your words will be in the vocabulary. You could try using a Hindi Transformer model (just change the name in the config) or use a non-Tranformer tok2vec (that's probably easiest).

Let us know if those don't help.

View full answer

polm · 2022-11-01T03:37:40Z

polm
Nov 1, 2022

Sorry you're having trouble with this. To make it easier for us to help you, please do not post screenshots of terminal output, which are hard to read, and please read the Markdown guide on formatting code blocks.

Only zero scores usually indicates a data problem. Did you try using spacy debug data to check if there were problems with your data?

Is your text data Romanized? If not then the base Transformer model of roberta-base won't work well, as none of your words will be in the vocabulary. You could try using a Hindi Transformer model (just change the name in the config) or use a non-Tranformer tok2vec (that's probably easiest).

Let us know if those don't help.

2 replies

srikamalteja Nov 1, 2022
Author

Thanks for your reply. I have used the spacy debug data and everything seems fine. Here is the output

Output

=============================== Training stats ===============================
Language: hi
Training pipeline: transformer, ner
24881 training docs
8359 evaluation docs
⚠ 16 training examples also in evaluation data

============================== Vocab & Vectors ==============================
ℹ 509637 total word(s) in the data (37768 unique)
ℹ No word vectors present in the package

========================== Named Entity Recognition ==========================
ℹ 3 label(s)
0 missing value(s) (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
✔ No entities crossing sentence boundaries

================================== Summary ==================================
✔ 6 checks passed
⚠ 1 warning

Note: I haven't Romanized, as I'm not aware of what it does? Any links on how to Romanize Hindi text, please?

Also, below is the sample of my Hindi dataset

परभूपुर B-LOC भारत B-LOC के O उत्तर B-LOC प्रदेश I-LOC राज्य O के O इलाहाबाद B-LOC जिले O के O हंडिया B-LOC प्रखण्ड O में O स्थित O एक O गाँव O है। O

I have tried with Hindi transformer model and non-transformer model (tok2vec), still only zeros coming up after every training step.

Thanks.

polm Nov 2, 2022

Thank you for providing the output of debug data, that does look OK.

Note: I haven't Romanized, as I'm not aware of what it does? Any links on how to Romanize Hindi text, please?

Romanizing means writing your text in the latin alphabet, rather than another alphabet like devanagari. I am not suggesting that you do this. However, the roberta-base model you're using as a base was only trained on English - while it kind of works with other language using latin script, with language using non-latin script it may not work at all, as every word can be out of the model vocabulary.

I would recommend you try changing roberta-base to a model with support for Hindi. A quick search shows roberta-hindi exists, so you could try that.

srikamalteja · 2022-11-14T17:37:43Z

srikamalteja
Nov 14, 2022
Author

Thanks @polm, Your suggestion helped me to build something. At first, I tried with Tok2Vec, it worked and got decent F-score 0.76, then tried with Hindi transformer ai4bharat/indic-bert by playing around with batch size and other data related parameters required for training in the config file, this time I got better F-score 0.83.

Test results:

Thanks again.

1 reply

polm Nov 15, 2022

Glad you got it working, thanks for letting us know it worked OK!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Error while training Spacy NER with Hindi dataset #11726

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Error while training Spacy NER with Hindi dataset #11726

Uh oh!

Uh oh!

srikamalteja Oct 31, 2022

Info about spaCy

Replies: 2 comments · 3 replies

Uh oh!

polm Nov 1, 2022

Uh oh!

srikamalteja Nov 1, 2022 Author

Uh oh!

polm Nov 2, 2022

Uh oh!

srikamalteja Nov 14, 2022 Author

Uh oh!

polm Nov 15, 2022

srikamalteja
Oct 31, 2022

Replies: 2 comments 3 replies

polm
Nov 1, 2022

srikamalteja Nov 1, 2022
Author

srikamalteja
Nov 14, 2022
Author