-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
The FlairTagger (and possibly CRFTagger) ignores empty documents. The length of the output documents does not match the length of the input documents.
We should either allow empty documents, or raise a warning and that no empty strings should be passed.
Reproducible example
from pprint import pprint
from deidentify.base import Document
from deidentify.taggers import FlairTagger
from deidentify.tokenizer import TokenizerFactory
documents = [
Document(name="doc_01", text=""),
Document(name="doc_02", text="Stukje tekst met de naam Jan Jansen."),
Document(name="doc_03", text=""),
]
tokenizer = TokenizerFactory().tokenizer(corpus="ons", disable=("tagger", "ner"))
tagger = FlairTagger(
model="model_bilstmcrf_ons_fast-v0.2.0", tokenizer=tokenizer, verbose=False
)
annotated_docs = tagger.annotate(documents)
print(f"len(documents) = {len(documents)}")
print(f"len(annotated_docs) = {len(annotated_docs)}")
pprint(annotated_docs)Actual:
len(documents) = 3
len(annotated_docs) = 1
[Document(name=doc_02). Chars: 36, Annotations: 1]
Expected:
len(documents) = 3
len(annotated_docs) = 3
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Fields
Give feedbackNo fields configured for issues without a type.