Train custom sentence segmentation model #10011
-
I would like to train a custom segmentation model. I’ve looked over the documentation about trainable pipelines and it’s a bit too advanced for me. I was wondering if someone could distill this to a few simple steps for me. Thanks. First I need a configuration file which contains settings which determine the kind of machine learning model that is used - what are the most critical factors here? Does Spacy always use a specific built-in neural network algorithm for their training (from TensorFlow something?) or do the structures of learning algorithms vary? Does the user decide the number of layers in the network? Does segmentation fall under “parser” or what kind of pipeline is it? I then have to train it on data. How do I prepare the data? I understand I should show the model a string and then how I want it segmented, some number of times. From the command line, do I have to manually write the location of the sents in the JSON format, or is there any interactive mode where I break up sentences and Spacy logs that in the format it uses? (I can’t afford Prodigy.) I didn’t understand the format I saw - I saw strings and usually two integers in brackets. Should I be writing the character index of the sentence start? Is it ok if the strings I pass contain several sentence breakpoints, like 10 points for segmentation per string? How many examples do I need and how does one know? On what principle is that based? I saw you can create a spacy.blank(“en”) object and pass the doc object the data that way with doc(text=text, sents=sents) or something like that, where sents is the list of sentences, but I don’t know how to extrapolate that to 100 examples, if I should put it all in one Spacy object or one for each example? Once I have the data file, I just call spacy train, and then I be sure to load that into Spacy when I use it - fair enough. Could someone please assist me with filling in the gaps in my knowledge here? Thank you very much. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
You want a SentenceRecognizer. If you are having trouble with the docs it's fine to just use the default settings.
You can find notes on making training data for a sentence segmentation model here. The JSON model isn't suitable for manual preparation - it should be easier to just put one sentence per line and merge them together into a training doc (as described at the link). An example should have several sentences so that the model can learn sentence boundaries, but it shouldn't be so long that it's hard to keep in memory. 10 sentences or so per Doc should be fine. You probably need a few hundred examples for training to get started. There's no particular way to calculate that, it's just a guess based on experience with other models. |
Beta Was this translation helpful? Give feedback.
You want a SentenceRecognizer. If you are having trouble with the docs it's fine to just use the default settings.
You can find notes on making training data for a sentence segmentation model here. The JSON model isn't suitable for manual preparation - it should be easier to just put one sentence p…