Releases · stanfordnlp/stanza

26 Feb 06:57

AngledLuffa

v1.11.1

516b071

v1.11.1 Latest

Latest

New annotator!

Add a connection to the Morpheme Segmentation processor: 450ca74 #1527 https://github.com/TheWelcomer/MorphSeg Thank you @TheWelcomer !

Interface improvements

Use platformdirs to put the downloaded models in the system cache directory by default. Thank you @McSinyx ! #1541 Note that this means if you have not set a default path, your existing ~/stanza_resources will be obsolete and you will now have models (with a version number) in .cache/stanza

New / Updated Models

Add Abkhaz models from the fasttextwiki word vectors and the abnc UD dataset. This involved making the tagger & depparse train finetuned word vectors with a lower cutoff for small pretrains, as the fasttextwiki vectors were quite small for Abkhaz. #485 49f97a4 76f3335 Can add more test-only UD datasets on request, but the results seem low enough that we aren't doing it by default
ANG NER model downloaded from here: https://github.com/dmetola/Old_English-OEDT/tree/main 714072d

Bugfixes

Patch the depparse to not produce <PAD> as a relation type. A more principled fix would be to rebuild all the models, but this will work for now 284e9b4

Model Improvements

When training smaller POS datasets, finetune more words if the embedding is small. Makes it more likely that a small embedding is useful, since we can cover everything in the training set. 76f3335
Process !!! and ??? the same as ! and ? in the pos and depparse, addressing the downstream errors caused by unknown strings of punct. 5fd1d50 #1532
Train the tokenizer to recognize non-ascii variants of ! and ? with augmentation, addressing the tokenizer errors found for punct that doesn't exist in the training data #1532 d69c33f
Modify the depparse model to scale scores so that only one root is ever chosen. See https://aclanthology.org/2020.emnlp-main.390.pdf https://aclanthology.org/2021.emnlp-main.823/ 88c0cf6 c50fa5c

Other improvements

Fix random typos 4af05f6 Thanks @thecaptain789
Add an early termination for coref option, as requested in #1531 1d30e90
Update the semgrex client to allow for results to come back in non-sentence order, allowing for future addition of sort operators 46eb340
Fix a minor memory waste f8d62fe
Use the UD udtools package instead of having our own copy of the scoring script b20cd3a
As requested in #1523, allow for speaker information passing to the coref annotator: c4201b9 dc50998 1df3f8b
Add a convenience method to retag a conllu file in a Pipeline. call pipe.process_conllu(text) where text is a conllu file. 74fbdc4

Contributors

McSinyx, TheWelcomer, and thecaptain789

Assets 2

05 Oct 06:45

AngledLuffa

v1.11.0

156f86b

v1.11.0

Training upgrades

Should now be possible to train all annotators on Windows: stanfordnlp/stanza-train#20 #1439 The issue was twofold: a perl script shell call (which could actually be installed, but was annoying for non-perl users) and an overreliance on temp files, which can be opened twice in Unix but not in Windows. Fixed in 2677e77 d5c7b7f #1514

Model upgrades

Tokenizer can support the pretrained charlm now. This significantly improves the MWT performance on Hebrew, for example. #1511
Building tokenizers with pretrained charlm exposed a possible issue with the tokenizer including spaces when tokenizing when an MWT is split across two words. The effect occurred in Hebrew, but an English example would be wo n't tokenized as a single token with embedded space. Augmenting the training to enforce word splits across those spaces fixed the issue. 52cea78
use PackedSequence for the tokenizer - is slower, but results are stable when using inputs of different lengths: 4433e83 #1472
If a Tokenizer training set consistently has spaces between the ends of words and punctuations, the resulting trained model may not properly recognize the same text with periods at the end of the word. For example, this is a test . vs this is a test. Reported in #1504 Fixed for VI by 6878d8e
Coref now includes a zeros predictor - this predicts when a mention for certain datasets (such as Spanish) is a pro-drop mention. This behavior occurs by adding an empty node to the sentence. It can be disabled with the coref_use_zeros=False flag to the Pipeline. #1502

Model improvements

Sindhi pipeline based on the ISRA UD dataset, published at SyntaxFest 2025, with annotation support from MLtwist: https://aclanthology.org/2025.udw-1.11/
Tamil coreference model from KBC
update English lemmatizer with more verbs and ADJ from Prof. Lapalme
also, French lemmatizer changes with corrections from Prof. Lapalme
create a German lemmatizer using GSD data and a set of ADJ from Wiktionary
add GRC models mixed with a copy of the data with the diacritics stripped. because those work worse on GRC with diacritics, the originals are still the default: 5beca58
add a Thai TUD dataset from https://github.com/nlp-chula/TUD (not yet included in UD): bca078c
NER model for ANG: 68a56aa https://github.com/dmetola/Old_English-OEDT/tree/main
NER models for Hindi, Telegu, and Urdu: #1469, model built from https://github.com/ltrc/IL-NER, added in a4902df

Other interface improvements

Fix conparser SyntaxWarning: #1513 thanks to @orenl
improve efficiency of reading conllu documents: f15f0bc
sort CoNLLU features when outputting a doc, as is standard: aa20fbb
semgrex interface improvements: search all files, only output failed matches, process all documents at once
turn coref max_train_len into a parameter: 1f98d8f #1465
allow for combined depparse models with multiple training files in a zip file (easier to mix training data): be94ac6
lemmatizer can skip blank lemmas (useful when training using partially complete lemma data): 7c34714
if using pretokenized text in the NER, try to use the token text to extract the text (previously would crash): ab249f6
don't retokenize pretokenized sentences: #1466 #1464
remove stray test output files: 2e4735a thanks to @otakutyrant

Constituency parser

relative attention layer, similar to that used in https://aclanthology.org/2023.findings-emnlp.25/ #1474
output some basic analysis of errors: 5503c4c
current best conparser published at SyntaxFest 2025: https://aclanthology.org/2025.iwpt-1.4/

Package dependency updates

remove verbose from ReduceLROnPlateau: 1015b6b thanks to @otakutyrant
update usage of xml.etree.ElementTree to match updated python interface: 7ca8750 thanks to @otakutyrant
cover up a jieba warning - package has not been updated in many years, not likely to be updated to fix deprecation errors any time soon. 0afdb61 thanks to @otakutyrant
drop support for Python 3.8: 6420c3d thanks to @otakutyrant
update tomli version requirement, #1444 thanks to @BLKSerene

Contributors

orenl, BLKSerene, and otakutyrant

Assets 2

29 Dec 06:54

AngledLuffa

v1.10.1

af3d42b

v1.10.1 - rebuild with UD 2.15

In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish. We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.

Other notable changes:

Include a contextual lemmatizer in English for 's -> be or have in the default_accurate package. Also built is a HI model. Others potentially to follow. Now with fewer bugs at startup. #1422
Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold ad1f938
Pytorch compatibility: set weights_only=True when loading models. #1430 #1429
augment MWT tokenization to accommodate unexpected ' characters, including " used in "s - #1437 #1436
when training the lemmatizer, take advantage of CorrectForm annotations in the UD treebanks dbdf429
add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: 99f7038
add VLSP 2023 constituency dataset: 1159d0d

Bugfixes:

raise_for_status earlier when failing to download something, so that the proper error gets displayed.
Thank you @pattersam #1432
Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: 53081c2
reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: 1a36efb #1436
similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: 215c69e
missing text for a Document does not cause the NER model to crash: 0732628 #1428
tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: f59ccd8 #1423

Contributors

pattersam

Assets 2

23 Dec 04:27

AngledLuffa

v1.10.0

ad17b27

v1.10.0 - rebuild with UD 2.15

Other notable changes:

Include a contextual lemmatizer in English for 's -> be or have in the default_accurate package. Also built is a HI model. Others potentially to follow. #1422
Upgrade the FR NER model to a gold edited version of WikiNER: https://huggingface.co/datasets/danrun/WikiNER-fr-gold ad1f938
Pytorch compatibility: set weights_only=True when loading models. #1430 #1429
augment MWT tokenization to accommodate unexpected ' characters, including " used in "s - #1437 #1436
when training the lemmatizer, take advantage of CorrectForm annotations in the UD treebanks dbdf429
add hand-lemmatized French verbs and English words to the "combined" lemmatizers, thanks to Prof. Lapalme: 99f7038
add VLSP 2023 constituency dataset: 1159d0d

Bugfixes:

raise_for_status earlier when failing to download something, so that the proper error gets displayed.
Thank you @pattersam #1432
Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: 53081c2
reset the start/end character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: 1a36efb #1436
similar to the start/end char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: 215c69e
missing text for a Document does not cause the NER model to crash: 0732628 #1428
tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: f59ccd8 #1423

Contributors

pattersam

Assets 2

12 Sep 23:17

AngledLuffa

v1.9.2

539760c

Multilingual Coref

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romance languages, one for the Slavic languages available in UDCoref #1406

new features

streamlit visualizer for semgrex/ssurgeon #1396
updates to the constituency parser ensemble #1387
accuracy improvements to the IN_ORDER oracle #1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

Spanish combined models #1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: #1413 3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
actually include the visualization: #1421 thank you @bollwyvl

Contributors

bollwyvl

Assets 2

12 Sep 19:40

AngledLuffa

v1.9.1

174768a

Multilingual Coref

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406

new features

streamlit visualizer for semgrex/ssurgeon #1396
updates to the constituency parser ensemble #1387
accuracy improvements to the IN_ORDER oracle #1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

Spanish combined models #1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: #1413 3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0
actually include the visualization: #1421 thank you @bollwyvl

Contributors

bollwyvl

Assets 2

12 Sep 07:23

AngledLuffa

v1.9.0

b999102

Multilingual Coref

multilingual coref!

Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref #1406

new features

streamlit visualizer for semgrex/ssurgeon #1396
updates to the constituency parser ensemble #1387
accuracy improvements to the IN_ORDER oracle #1391
Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words. Currently for EN and HE #1417 #1419
download_method=None now turns off HF downloads as well, for use in instances with no access to internet #1408 #1399

new models

Spanish combined models #1395
Add IACLT knesset to the HE combined models
NER based on IACLT
XCL (Classical Armenian) models with word vectors from Caval

bugfixes

update tqdm usage to remove some duplicate code: #1413 3de69ca
long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: #1410
Occasionally train the tokenizer with the sentence final punctuation of a batch removed. This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation. This was also related to the Spanish tokenization issue 56350a0

Assets 2

0 Join discussion

20 Apr 18:58

AngledLuffa

v1.8.2

6e442a6

Old English, MWT improvements, and better memory management of Peft

Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.

Old English

Add Old English (ANG) annotation! Thank you to @dmetola #1365

MWT improvements

Fix words ending with -nna split into MWT stanfordnlp/handparsed-treebank@2c48d40 #1366
Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks) #1371 #1378
Mark start_char and end_char on an MWT if it is composed of exactly its subwords 2384089 #1361

Peft memory management

Previous versions were loading multiple copies of the transformer in order to use adapters. To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names. This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models. huggingface/peft#1523 #1381 #1384

Other bugfixes and minor upgrades

Fix crash when trying to load previously unknown language #1360 381736f
Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched: d180ae0 #1367
Try to avoid OOM in the POS in the Pipeline by reducing its max batch length 4271813
Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka) 597d48f

Other upgrades

Add * to the list of functional tags to drop in the constituency parser, helping Icelandic annotation 57bfa8b #1356 (comment)
Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: 4048cae 15b136b
Add a constituency model for German 7a4f48c 86ddaab #1368

Contributors

Jemoka and dmetola

Assets 2

01 Mar 06:47

AngledLuffa

v1.8.1

c2d72bd

PEFT Integration (with bugfixes)

Integrating PEFT into several different annotators

We integrate PEFT into our training pipeline for several different models. This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the default_accurate model.

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
NER also trained with peft: unfortunately, no consistent improvements to scores #1336
depparse includes peft: no consistent improvements yet #1337 #1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
wandb support for coref #1338
Coref annotator breaks length ties using POS if available #1326 c4c3de5

Bugfixes

Using a proxy with download_resources_json was broken: #1318 #1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
Coref training rounding error #1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
Clarify error when a language is only partially handled: da01644 #1310

Additional 1.8.1 Bugfixes

Older POS models not loaded correctly... need to use .get() 13ee3d5 #1357
Debug logging for the Constituency retag pipeline to better support someone working on Icelandic 6e2520f #1356
device arg in MultilingualPipeline would crash if device was passed for an individual Pipeline: 44058a0

Contributors

ider-zh and sterliakov

Assets 2

25 Feb 07:38

AngledLuffa

v1.8.0

17eb6fc

PEFT integration

Integrating PEFT into several different annotators

The biggest gains observed are with the constituency parser and the sentiment classifier.

Previously, the default_accurate package used transformers where the head was trained but the transformer itself was not finetuned.

Model improvements

POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results #1320
Sentiment trained with peft on the transformer: noticeably improves results for each model. SST scores go from 68 F1 w/ charlm, to 70 F1 w/ transformer, to 74-75 F1 with finetuned or Peft finetuned transformer. #1335
NER also trained with peft: unfortunately, no consistent improvements to scores #1336
depparse includes peft: no consistent improvements yet #1337 #1344
Dynamic oracle for top-down constituent parser scheme. Noticeable improvement in the scores for the topdown parser #1341
Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies. Example improvement, 87.01 to 88.11 on ID_ICON dataset. #1347
Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used. Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset. #1348
Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data. Typical example would be split email addresses in the EWT training set. #1346 #1345

Features

Include SpacesAfter annotations on words in the CoNLL output of documents: #1315 #1322
Lemmatizer operates in caseless mode if all of its training data was caseless. Most relevant to the UD Latin treebanks. #1331 #1330
wandb support for coref #1338
Coref annotator breaks length ties using POS if available #1326 c4c3de5

Bugfixes

Using a proxy with download_resources_json was broken: #1318 #1317 Thank you @ider-zh
Fix deprecation warnings for escape sequences: #1321 #1293 Thank you @sterliakov
Coref training rounding error #1342
Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice #1354
V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits. No idea if this actually produces reasonable results for words after the token limit. #1350 #1294
Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: #1333 #1339 f1fbaaa
Clarify error when a language is only partially handled: da01644 #1310

Contributors

ider-zh and sterliakov

Assets 2

0 Join discussion

Releases: stanfordnlp/stanza

v1.11.1

New annotator!

Interface improvements

New / Updated Models

Bugfixes

Model Improvements

Other improvements

Contributors

Uh oh!

v1.11.0

Training upgrades

Model upgrades

Model improvements

Other interface improvements

Constituency parser

Package dependency updates

Contributors

Uh oh!

v1.10.1 - rebuild with UD 2.15

Contributors

Uh oh!

v1.10.0 - rebuild with UD 2.15

Contributors

Uh oh!

Multilingual Coref

multilingual coref!

new features

new models

bugfixes

Contributors

Uh oh!

Multilingual Coref

multilingual coref!

new features

new models

bugfixes

Contributors

Uh oh!

Multilingual Coref

multilingual coref!

new features

new models

bugfixes

Uh oh!

Old English, MWT improvements, and better memory management of Peft

Old English

MWT improvements

Peft memory management

Other bugfixes and minor upgrades

Other upgrades

Contributors

Uh oh!

PEFT Integration (with bugfixes)

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Additional 1.8.1 Bugfixes

Contributors

Uh oh!

PEFT integration

Integrating PEFT into several different annotators

Model improvements

Features

Bugfixes

Contributors

Uh oh!