Setting-up training file & config to allow SpanCat to use transformer embedding from adjacent sentences #11341
-
Hi all, I am fascinated by the recent addition of SpanCat and I have already played around with it for some use cases (they work nicely)! I am planning to train a SpanCat model that considers contexts beyond the immediate sentences when predicting categories. When training a SpanCat model, so the prediction is made using Transformer embedding which spans across immediate sentences, how should I set up the IOB dataset as well as training configuration? Here are some more details on the project. I appreciate any insights from the community! Dataset
The following is an example sentence for how the annotation scheme looks in the IOB format.
Question regarding the IOB dataset
Intended behavior during training
Does batch setting help to tailor to my needs?
Here is the config file I have, which worked fine when I used single-sentence dataset ( = the basic config works as intended).
I really appreciate any insights into how to make this possible! Thank you so much in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
The most important thing for your data is that your categories are well designed and the annotations reflect the output you want consistently. The architecture used to get features from your raw text will not change that. It sounds like you have a lot of questions about how spaCy works internally. It's good to understand what your tools are doing, but in order to answer these it's probably easiest if you try the default settings, just to have a complete functional pipeline, and then tweak it from there so you can measure iterative improvements. Regarding "document boundaries" - spaCy just uses lists of Docs in training. It doesn't have any other conception of a division within a Doc, and many components, including Transformers, don't consider sentence boundaries. In this case it sounds like you should just treat each of your three-sentence fragments as a Doc. So this would be set up exactly the same as your single-sentence dataset. You could also reassemble your fragments into larger docs if you have the metadata to do that, but it's not clear that either approach would be superior, so I would start with the simpler approach (using the fragments directly) to get a baseline. (It's not clear to me what resource constraints caused you to split the docs into three sentence segments. Was it limited hardware or limited annotation resources or...?) To attach the metadata to the Doc you can use underscore attributes. Regarding how the Transformer handles long documents, see the docs on span getters - basically long docs are sliced up before being passed to the Transformer, and then the results of the slices are combined to get the representation of each token. Exactly how this is done is configurable. |
Beta Was this translation helpful? Give feedback.
The most important thing for your data is that your categories are well designed and the annotations reflect the output you want consistently. The architecture used to get features from your raw text will not change that.
It sounds like you have a lot of questions about how spaCy works internally. It's good to understand what your tools are doing, but in order to answer these it's probably easiest if you try the default settings, just to have a complete functional pipeline, and then tweak it fr…