Creation of custom entity model using spacy and fine tuning using transformer #9468

aniyya · 2021-10-14T14:13:32Z

aniyya
Oct 14, 2021

I am trying to create a entity model and fine tune using BERT model. The training data is cleaned html data which is scraped from different websites and tagged and verified. The issue is there are several urls of same category and the data format is different. Some data are very large. A sample data is given

Exotic Inspirations TSO
HOMECONCERTS EVENTSABOUTSTAFFEDUCATIONSUPPORT USMoreUse tab to navigate through the menu itemsBUY TICKETS SATURDAY, FEBRUARY 5, 2022 I PEROT THEATRE I 7:30 PM
MASTERWORKS IIIExotic Inspirations
Featuring:
Ted Ludwig
Jazz GuitarTed Ludwig is one superb jazz guitarist His fluid heartfelt melodic style coupled with a strong harmonic sophistication has made him a compelling and impressive jazz musician As Ludwigs fingers caress and dance across his 7-string guitar, the New Orleans native delivers a passion and precision that is jazz guitar playing at its most engaging and accessible best
- Offbeat Magazine, 2013
Guitarist Earns his Ovation Ludwigs playing was, as always, brilliant, a sort of calm center in the midst of an orchestral eerie storm which Sparr tone-paints Ludwigs own journey through the aftermath of Hurricane Katrina
- Eric Harrison, Arkansas Democrat GazetteRepertoire:Ravel: Suite from Mother Goose
Sparr: Katrina: Concerto for Jazz Guitar and Orchestra
Tchaikovsky: Suite from Swan Lake
Concert Preview: 6:40 PM
CLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 8707733401
Back to Events 2019 Texarkana Symphony Orchestra

This is the data which is cleaned from a single url. Thousands of similar data is there and most of them are much larger ones. SO when we use BERT to fine model there is a case where the length of the token cannot be more than 512. For these type of data how is it possible to fine tune if the length of token is more than 512

Answered by polm

Oct 15, 2021

spaCy handles long documents in Transformers using span getters, which pass slices of the original document to the Transformer to get vectors and then combine those to get the final representation. This makes it so that 256 tokens isn't a hard limit on the length of a Doc.

If you don't like the way the default span getters work it's possible to implement a custom one - for your example document, it might make sense to split on newlines, for example.

View full answer

polm · 2021-10-15T05:43:20Z

polm
Oct 15, 2021

spaCy handles long documents in Transformers using span getters, which pass slices of the original document to the Transformer to get vectors and then combine those to get the final representation. This makes it so that 256 tokens isn't a hard limit on the length of a Doc.

If you don't like the way the default span getters work it's possible to implement a custom one - for your example document, it might make sense to split on newlines, for example.

3 replies

aniyya Oct 15, 2021
Author

Thankyou @polm for a quick reply. I will check as you suggested. I must say that the discussion forum provided by spacy team is absolutely superb.

aniyya Oct 15, 2021
Author

During training i used span getters with settings like below

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

Training completed successfully and
When i gave the following document as input for testing the performance of trained model,

Exotic Inspirations TSO
HOMECONCERTS EVENTSABOUTSTAFFEDUCATIONSUPPORT USMoreUse tab to navigate through the menu itemsBUY TICKETS SATURDAY, FEBRUARY 5, 2022 I PEROT THEATRE I 7:30 PM
MASTERWORKS IIIExotic Inspirations
Featuring:
Ted Ludwig
Jazz GuitarTed Ludwig is one superb jazz guitarist His fluid heartfelt melodic style coupled with a strong harmonic sophistication has made him a compelling and impressive jazz musician As Ludwigs fingers caress and dance across his 7-string guitar, the New Orleans native delivers a passion and precision that is jazz guitar playing at its most engaging and accessible best
- Offbeat Magazine, 2013
Guitarist Earns his Ovation Ludwigs playing was, as always, brilliant, a sort of calm center in the midst of an orchestral eerie storm which Sparr tone-paints Ludwigs own journey through the aftermath of Hurricane Katrina
- Eric Harrison, Arkansas Democrat GazetteRepertoire:Ravel: Suite from Mother Goose
Sparr: Katrina: Concerto for Jazz Guitar and Orchestra
Tchaikovsky: Suite from Swan Lake
Concert Preview: 6:40 PM
CLICK HERE to purchase tickets, or contact the Texarkana Symphony Orchestra at 8707733401
Back to Events 2019 Texarkana Symphony Orchestra

I am getting the following warning message

Token indices sequence length is longer than the specified maximum sequence length for this model (3265 > 512). Running this sequence through the model will result in indexing errors

adrianeboyd Oct 15, 2021

This means that some span is being truncated because it's too long. Typically this is something like a long URL that's one token in spacy but dozens of BPE/wordpiece tokens. If it's frequent, you can try lowering window and stride (at the cost of processing speed), but if it's rare, it probably doesn't make a big difference in the final NER annotation and you can ignore it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Creation of custom entity model using spacy and fine tuning using transformer #9468

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Creation of custom entity model using spacy and fine tuning using transformer #9468

Uh oh!

aniyya Oct 14, 2021

Replies: 1 comment · 3 replies

Uh oh!

polm Oct 15, 2021

Uh oh!

aniyya Oct 15, 2021 Author

Uh oh!

aniyya Oct 15, 2021 Author

Uh oh!

adrianeboyd Oct 15, 2021

aniyya
Oct 14, 2021

Replies: 1 comment 3 replies

polm
Oct 15, 2021

aniyya Oct 15, 2021
Author

aniyya Oct 15, 2021
Author