-
Notifications
You must be signed in to change notification settings - Fork 32.2k
Description
System Info
transformersversion:5.0.0.dev0- Platform:
Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 - Python version:
3.12.3 huggingface_hubversion:1.3.2safetensorsversion:0.7.0accelerateversion:1.12.0- Accelerate config:
not installed - DeepSpeed version:
not installed - PyTorch version (accelerator?):
2.9.1+cu128 (CUDA) - GPU type:
NVIDIA L4 - NVIDIA driver version:
550.90.07 - CUDA version:
12.4
Who can help?
@zucchini-nlp (multimodal model)
@ArthurZucker (tokenizer)
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
NER use case:
from transformers import LayoutLMv2Tokenizer
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
words = ["Total", "Amount", ":", "$1,234.56"]
boxes = [[100, 200, 300, 250], [310, 200, 450, 250], [460, 200, 480, 250], [490, 200, 650, 250]]
word_labels = [0, 0, 0, 1]
try:
encoding = tokenizer(words, boxes=boxes, word_labels=word_labels)
print(encoding["labels"])
except Exception as e:
print(e)Batched training data prep with truncation/padding:
from transformers import LayoutLMv2Processor
from datasets import load_dataset
import textwrap
try:
processor = LayoutLMv2Processor.from_pretrained(
"microsoft/layoutlmv2-base-uncased",
apply_ocr=False
)
dataset = load_dataset("nielsr/funsd", split="train")
images = [img.convert("RGB") for img in dataset["image"]]
words = list(dataset["words"])
boxes = list(dataset["bboxes"])
word_labels = list(dataset["ner_tags"])
encoding = processor(
images,
words,
boxes=boxes,
word_labels=word_labels,
padding="max_length",
truncation=True,
return_tensors="pt",
)
print(encoding["input_ids"].shape)
except Exception as e:
print("\n".join(textwrap.wrap(str(e), width=160)))LayoutLMv2Tokenizer crash with an AttributeError when word_labels is passed for NER token classification. In a different use case, calling the processor with padding="max_length" and truncation=True raises a downstream ValueError asking to set the aforementioned flags (more details in the PR; the screenshots in the PR show what happens after the first attr issue is fixed but before the second fix is made), despite both flags being set correctly.
Current Repro Output:
Expected behavior
β encoding["labels"] should return a list in which subword tokens are masked with the default ignore_index (-100) in nn.CrossEntropyLoss
β encoding["input_ids"].shape should return the expected torch.Size().