GitHub - dylanburati/wikiplain: Isolate pages from WIkimedia dumps and process them with Pandoc

wikiplain

A toolkit for processing Wikimedia XML dumps and Wikitext. Also includes a part-of-speech tagging TCP service.

I use these to take an English Wikipedia snapshot, a collection of Reddit post logs, and the UMBC webbase corpus and estimate the level of name recognition for each article's subject. This helps when curating the default People, Places, and Characters decks in my trivia game.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
pywikiplain		pywikiplain
services		services
src		src
vendor/nom-sql		vendor/nom-sql
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
__init__.pyi		__init__.pyi
get_enwiki.sh		get_enwiki.sh
poetry.lock		poetry.lock
py.typed		py.typed
pyproject.toml		pyproject.toml
sql.pyi		sql.pyi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wikiplain

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wikiplain

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages