how to split corpus into train/dev ? #7822
Replies: 2 comments
-
We can consider whether more built-in options make sense here. Currently your best option is write your own corpus reader. Here's what the provided one looks like: spaCy/spacy/training/corpus.py Lines 22 to 39 in ed561cf You'd want to modify the options and how the corpus is returned to split a single file into partitions. |
Beta Was this translation helpful? Give feedback.
-
Doing the first 80% last 20% split assumes the corpus is already thoroughly randomized. But than it should be easy enough to just split the DATASET_path into two directories by hand (assuming there are multiple smaller docBins). But for those using the convert utility perhaps this could be a convert utility feature... |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is it possible to add a feature to
Corpus
to get iterable according persentage from the start and from the end so I can devide my dataset into train and dev ?train=Corpus (DATASET_Path , 0.8 )
# 80% from the startdev=Corpus (DATASET_Path , -0.2 )
# 20% from the endThere is an option
limit
but it won't work if we need to set a limit from the end (to getdev
Corpus)Your Environment
Beta Was this translation helpful? Give feedback.
All reactions