Skip to content

common-voice/wikipedia-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

wikipedia-data

/word-frequency/ contains analysis on word frequency from a full wikipedia dump in different languages, we used cvstools to generate it.

  • word-frecuency.es.txt - Spanish wikipedia (2955930 words)

/complex-words/ contains analysis on less common words used in different wikipedias, this can be used as blacklist to clean up words that are complex, non-native or with weird characters combination. Each language has a different word frequency limit.

  • complex.es.txt - Spanish wikipedia, words with 80 or less repetitions (2827258 words)

About

Different analysis and files from wikipedia text analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors