-
Notifications
You must be signed in to change notification settings - Fork 53
Open
Description
A quick search shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using existing server-side tools (2).
Assess interest
- Assess how many Wikipedia languages are not in UNILEX. See Comparing languages of LinguaLibre vs UNILEX unicode-org/unilex#14 .
- Assess quality of wikipedia raw text data in minority languages.
- Compare gain to other available public corpora such Tatoeba (358 languages).
Crawling via API
By using and loading available list of articles per wikipedia, then scrap the sites. If too large, could be limited to max=n articles.
Given an iso code such as Ndonga's ng :
- download List of page titles in main namespace archive (see below)
- get the articles into a python list variable (python)
- code a crawler in /Lib/corpuscrawler/util.py, following other crawler as examples 1, which query Wikipedia API, extract the valuable text, save the text. (python)
- Update relevant crawlers /Lib/corpuscrawler/
Wikipedia API provides text
Various formats available:
format: The format of the output.jsont: Output data in JSON format.jsonfmt: Output data in JSON format (pretty-print in HTML).nonet: Output nothing.phpt: Output data in serialised PHP format.phpfmt: Output data in serialised PHP format (pretty-print in HTML).rawfmt: Output data, including debugging elements, in JSON format (pretty-print in HTML).xmlt: Output data in XML format.xmlfmt: Output data in XML format (pretty-print in HTML).
List of Wikipedia (~300)
- List_of_Wikipedias
- List of dumps - Wikipedia and others wiki projects.
List of articles per Wikipedia
For convenience, I use the tiny Ndonga (ng) Wikipedia (8 articles), easier to explore by hand.
For larger demo, you could also inspect similar URLs with the iso of :
| Language | Native | iso | Articles |
|---|---|---|---|
| Ndonga | Oshiwambo | ng | 8 |
| Inuktitut | ᐃᓄᒃᑎᑐᑦ/inuktitut | iu | 514 |
| Samoan | Gagana Samoa | sm | 985 |
| Igbo | Igbo | ig | 2,085 |
| Central Bikol | Bikol Central | bcl | 10,824 |
Namespaces
On all wikis. See also here
0: (main)1: Talk:2: User:3: User_talk:
Dumps' & paths
- List of dumps
- /ngwiki/20200220 - manual (change the date)
- /ngwiki/latest - directory
- /ngwiki-latest-all-titles.gz
- /ngwiki-latest-all-titles-in-ns0.gz - articles only
Using Wikipedia extractors ?
- https://github.com/attardi/wikiextractor#installation
- https://github.com/hugolpz/induction-wikipedia-corpus / https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html
- https://github.com/telecoms-intelligence/induction-wikipedia-corpus
Hybrid approach
- ISO: get the list of all local wiki's iso codes.
- Downloads: loop over each language code, download the dump.
- Extract: use extractor above, zip each language
- Cloud: put text result online.
- Crawl: in
util.py, code a simple crawler which get just that .zip, convert back to txt content, add to the corpora.
cc: @brawer
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels