Data extraction from MediaWiki pages made easy.
Metafacture-Mediawiki is a plugin for Metafacture. It provides modules for extracting information from MediaWiki pages such as Wikipedia articles. Currently, modules for extracting links and templates exist. Adding new extraction modules is easy.
The plugin relies on the excellent Sweble wikitext parser for parsing wikitext into abstract syntax trees.
- Extracts basic metadata information about pages from MediaWiki xml documents
- Extracts simple information from wikitext using regular expressions (fast but not suitable for complex tasks)
- Wraps the Sweble wikitext parser for conveniently parsing wikitext into an abstract syntax tree within a Flux flow
- Extracts links and templates from abstract syntax trees created by Sweble and turns them into a Metafacture event stream
- Makes writing additional extraction modules easy
- Supports running multiple extraction modules hassle-free
Metafacture-Mediawiki can be used as a plugin in the Metafacture distribution or as a Java library in your own programs.
The Metafacture-Mediawiki plugin will soon be available for download on the Culturegraph Software Website
Metafacture-Mediawiki release will soon be available on Maven Central
Development snapshots are distributed via Sonatype OSS. To use the snapshots add the Sonatype repository and the Metafacture-Mediawiki dependency to your project's POM:
<repositories>
<repository>
<id>sonatype-nexus-snapshots</id>
<name>Sonatype Nexus Snapshots</name>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.culturegraph</groupId>
<artifactId>metafacture-mediawiki</artifactId>
<version>0.0.0-SNAPSHOT</version>
</dependency>
</dependencies>
The documentation of Metafacture-Mediawiki can be found in the Wiki.
Copyright 2013 Deutsche Nationalbibliothek.
Metafacture-Mediawiki is distributed under the Apache 2.0 License.