Streaming dataset for custom NER training #11065
-
I'm trying to figure out how to stream a large dataset to custom train a transformer NER model on new labels. I've reviewed the "Data Utilities" section on streaming datasets and reviewed the example on "create_even_odd_corpus" where the final yielded Example is in the form: yield Example.from_dict( However, for an NER model I can't tell from the documentation what format the. My current training set is in the form of: I've reviewed the discussion in this thread where there was a recommendation to split the training sets into small DocBins and save/load to disk but this seems more like a hack. If anyone has any advice, I'd really appreciate it. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
So that example uses If you want to use Also note you can stream your corpus just by setting Also the advice in #8456 is still current and valid. |
Beta Was this translation helpful? Give feedback.
So that example uses
Example.from_dict
, but first note there's no requirement you do that at all, the only requirement is that you supply an Example. If you already have code that creates Examples from your training data you can re-use that (and I think we have code to deal with that format in our docs already).If you want to use
from_dict
, it links to docs describing the required input format. NER uses theentities
attribute.Also note you can stream your corpus just by setting
max_epochs = -1
, and a custom corpus reader isn't required to do that. There are other things to look out for - you probably want to shuffle your data, and you may need to supply labels up front - but streaming i…