Skip to content

dylanburati/wikiplain

Repository files navigation

wikiplain

A toolkit for processing Wikimedia XML dumps and Wikitext. Also includes a part-of-speech tagging TCP service.

I use these to take an English Wikipedia snapshot, a collection of Reddit post logs, and the UMBC webbase corpus and estimate the level of name recognition for each article's subject. This helps when curating the default People, Places, and Characters decks in my trivia game.

About

Isolate pages from WIkimedia dumps and process them with Pandoc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors