Custom pipeline with an MS LMv3 model #142

frivas-at-navteca · 2023-04-25T09:39:56Z

frivas-at-navteca
Apr 25, 2023

Hello everyone, first post here and a newbie with the library.

First of all I just want to thank you Janis for creating this library also I watched your video when talking about. One thing I have noticed is how humbly you spoke about this huge and amazing amount of work you have put out there for everyone to use. This is definitely an underrated library. I am so happy I discovered it.

I have been working on a custom pipeline and as a newbie I have faced a few challenges, among them using the HFLayoutLmv3TokenClassifier using the example provided in the docstring of the class.

The categories list provided does not match the required structure, we are providing a list instead of a Mapping[str, TypeOrStr] it is true that is optional however if it is not provided categories_bio and categories_semantics have to be provided. The thing when I check the categories from the training data:

from datasets import load_dataset
dataset = load_dataset("nielsr/funsd")
labels = dataset['train'].features['ner_tags'].feature.names
# ['O', 'B-HEADER', 'I-HEADER', 'B-QUESTION', 'I-QUESTION', 'B-ANSWER', 'I-ANSWER']
label = {v: k for v, k in enumerate(labels)}
# {0: 'O', 1: 'B-HEADER', 2: 'I-HEADER', 3: 'B-QUESTION', 4: 'I-QUESTION', 5: 'B-ANSWER', 6: 'I-ANSWER'}

That is what I get which is similar to the example provided, however that is not the expected format.

Also I have checked an example from the getting started notebook

categories=dd.ModelCatalog.get_profile("fasttext/lid.176.bin").categories

I get something like this:

{'__label__en': <Languages.english>,
 '__label__ru': <Languages.russian>,
 '__label__de': <Languages.german>,
 '__label__fr': <Languages.french>,
 '__label__it': <Languages.italian>,
 '__label__ja': <Languages.japanese>,
 '__label__es': <Languages.spanish>,
 '__label__ceb': <Languages.cebuano>,
 '__label__tr': <Languages.turkish>,
 '__label__pt': <Languages.portuguese>,
 '__label__uk': <Languages.ukrainian>,
....

I am not entirely sure how to get this working or at least how to create the categories data that is expected. may I kindly ask you to clarify a little bit what is data I should provide HFLayoutLmv3TokenClassifier with to be able to use it in the pipeline?

Thank you very much!

Reference:

deepdoctection/deepdoctection/extern/hflayoutlm.py

Line 479 in 16f286a

class HFLayoutLmv3TokenClassifier(HFLayoutLmTokenClassifierBase):

Answered by JaMe76

Apr 25, 2023

Hi,

thank you very much for your feedback. Really appreciate that!

Now for your question:

Have you seen the tutorial for training LayoutLMv1 on Funsd?

https://deepdoctection.readthedocs.io/en/latest/tutorials/layoutlm_for_token_classification/

This is not the model you want to try but it gives you a starting point.

Regarding your issues:

Hugging Face datasets and deepdoctection datasets are not the same and are not compatible. You will need to download and place the dataset in the respective directory by yourself. Having the dataset in place you can retrieve the deepdoctection dataset using

dataset_train = dd.get_dataset("funsd")

The advantage of the concept: Once you establish a dataf…

View full answer

JaMe76 · 2023-04-25T14:07:26Z

JaMe76
Apr 25, 2023
Maintainer

Hi,

thank you very much for your feedback. Really appreciate that!

Now for your question:

Have you seen the tutorial for training LayoutLMv1 on Funsd?

https://deepdoctection.readthedocs.io/en/latest/tutorials/layoutlm_for_token_classification/

This is not the model you want to try but it gives you a starting point.

Regarding your issues:

Hugging Face datasets and deepdoctection datasets are not the same and are not compatible. You will need to download and place the dataset in the respective directory by yourself. Having the dataset in place you can retrieve the deepdoctection dataset using

dataset_train = dd.get_dataset("funsd")

The advantage of the concept: Once you establish a dataflow like

df = dataset_train.dataflow.build()
df.reset_state()

for dp in df:
    ...

All datapoints dp will already have the intrinsic data structure of the library. You can also retrieve the categories of a labelled dataset.
Categories are defined by groups in an String-Enum structure , e.g. <TokenClassWithTag.b_answer> or b_answer. TypeOrStr refers to this Enum type.

You can get LayoutLMv3 base model from the model registry:

profile = dd.ModelCatalog.get_profile("microsoft/layoutlmv3-base/pytorch_model.bin")

But profile.categories is empty due to the fact that the model is only the base model without dense top layer. The head will be determined by the number of categories once the model is fine-tuned on the specific dataset.

Conceptually, there is nothing different to the transformer library because all deepdoctection model classes are wrappers of the model classes of the library that implements them. The only thing, that is not used is transformers.LayoutLMv3Processor and transformer.LayoutLMv3TokenizerFast because this would impose some limitations (e.g. choosing the OCR package or using approaches like sliding windows).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Custom pipeline with an MS LMv3 model #142

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Custom pipeline with an MS LMv3 model #142

Uh oh!

frivas-at-navteca Apr 25, 2023

Replies: 1 comment

Uh oh!

JaMe76 Apr 25, 2023 Maintainer

frivas-at-navteca
Apr 25, 2023

JaMe76
Apr 25, 2023
Maintainer