Creation of custom entity model using spacy and fine tuning using transformer #9468
-
I am trying to create a entity model and fine tune using BERT model. The training data is cleaned html data which is scraped from different websites and tagged and verified. The issue is there are several urls of same category and the data format is different. Some data are very large. A sample data is given
This is the data which is cleaned from a single url. Thousands of similar data is there and most of them are much larger ones. SO when we use BERT to fine model there is a case where the length of the token cannot be more than 512. For these type of data how is it possible to fine tune if the length of token is more than 512 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
spaCy handles long documents in Transformers using span getters, which pass slices of the original document to the Transformer to get vectors and then combine those to get the final representation. This makes it so that 256 tokens isn't a hard limit on the length of a Doc. If you don't like the way the default span getters work it's possible to implement a custom one - for your example document, it might make sense to split on newlines, for example. |
Beta Was this translation helpful? Give feedback.
spaCy handles long documents in Transformers using span getters, which pass slices of the original document to the Transformer to get vectors and then combine those to get the final representation. This makes it so that 256 tokens isn't a hard limit on the length of a Doc.
If you don't like the way the default span getters work it's possible to implement a custom one - for your example document, it might make sense to split on newlines, for example.