how to split corpus into train/dev ? #7822

d5555 · 2021-04-18T17:04:49Z

d5555
Apr 18, 2021

Is it possible to add a feature to Corpus to get iterable according persentage from the start and from the end so I can devide my dataset into train and dev ?
train=Corpus (DATASET_Path , 0.8 ) # 80% from the start
dev=Corpus (DATASET_Path , -0.2 ) # 20% from the end

There is an option limit but it won't work if we need to set a limit from the end (to get dev Corpus)

Your Environment

Operating System:
spaCy Version Used: 3.0.5

adrianeboyd · 2021-04-19T08:05:39Z

adrianeboyd
Apr 19, 2021

We can consider whether more built-in options make sense here.

Currently your best option is write your own corpus reader. Here's what the provided one looks like:

spaCy/spacy/training/corpus.py

Lines 22 to 39 in ed561cf

    
           @util.registry.readers("spacy.Corpus.v1") 
        
           def create_docbin_reader( 
        
               path: Optional[Path], 
        
               gold_preproc: bool, 
        
               max_length: int = 0, 
        
               limit: int = 0, 
        
               augmenter: Optional[Callable] = None, 
        
           ) -> Callable[["Language"], Iterable[Example]]: 
        
               if path is None: 
        
                   raise ValueError(Errors.E913) 
        
               util.logger.debug(f"Loading corpus from path: {path}") 
        
               return Corpus( 
        
                   path, 
        
                   gold_preproc=gold_preproc, 
        
                   max_length=max_length, 
        
                   limit=limit, 
        
                   augmenter=augmenter, 
        
               )

You'd want to modify the options and how the corpus is returned to split a single file into partitions.

0 replies

mbrunecky · 2021-10-13T16:09:18Z

mbrunecky
Oct 13, 2021

Doing the first 80% last 20% split assumes the corpus is already thoroughly randomized. But than it should be easy enough to just split the DATASET_path into two directories by hand (assuming there are multiple smaller docBins).
Another approach to splitting the dataset into 'train' and 'dev' is getting a random value for each doc and if the random value is >= 0.8 the doc goes into 'train' else into 'dev'.
I am not sure this enhancements is needed, because (at least in my experience) preparing the dataset takes a lot of effort (I am creating docBins with my code), and splitting it into train/dev is just a small piece of that.

But for those using the convert utility perhaps this could be a convert utility feature...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

how to split corpus into train/dev ? #7822

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

how to split corpus into train/dev ? #7822

Uh oh!

Uh oh!

d5555 Apr 18, 2021

Your Environment

Replies: 2 comments

Uh oh!

adrianeboyd Apr 19, 2021

Uh oh!

mbrunecky Oct 13, 2021

d5555
Apr 18, 2021

adrianeboyd
Apr 19, 2021

mbrunecky
Oct 13, 2021