Query on Multi language model #11984

srikamalteja · 2022-12-15T16:56:11Z

srikamalteja
Dec 15, 2022

Hi,

My query is, can I be able to train the Spacy blank model to support multiple languages using XLM-roberta-base pretrained model?

If so, is the below config is correct?

[paths]
lang = “xx”

[components.transformer.model]
name = “xlm-roberta-base”

Complete sample config file,

[paths]
train = "../input/dataset/train.spacy"
dev = "../input/dataset/test.spacy"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = “xx”
pipeline = ["transformer","ner"]
batch_size = 256
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 400
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = “xlm-roberta-base”
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 200
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v2"
progress_bar = true
output_file = "./logs"
console_output = true

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

polm · 2022-12-16T04:03:57Z

polm
Dec 16, 2022

You can use a config like the one you have for input in multiple languages, and you actually don't even have to use the xx language code. The question is whether that'll do what you want.

The main thing the language code setting does is change the tokenizer. The xx tokenizer is the same as the English tokenizer, but without any of the tokenizer exceptions and so on, so the output will be quite different sometimes. It can also change a number of lexical features, the noun_chunks feature, your default lemmatizer settings, , and so on. However if you're just training an NER model from scratch it'll work mostly the same.

Note there are potential issues with training on input in multiple languages. For example, a word with different meanings in each language might confuse the model, or if the languages share too few words the model might be over-extended and have trouble learning. xlm-roberta is trained with these in mind, but you still might expect decreased performance compared to a monolingual transformer.

Also note it probably won't work well for languages that really need a custom tokenizer because they don't use spaces to separate words, like Japanese.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Query on Multi language model #11984

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Query on Multi language model #11984

Uh oh!

srikamalteja Dec 15, 2022

Replies: 1 comment

Uh oh!

polm Dec 16, 2022

srikamalteja
Dec 15, 2022

polm
Dec 16, 2022