Add Wikipedia crawler ? (300+ languages)

A [quick search](https://github.com/google/corpuscrawler/search?q=wikipedia) shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using existing server-side tools (2). 

## Assess interest
1. Assess how many Wikipedia languages are not in UNILEX. See https://github.com/unicode-org/unilex/issues/14 .
2. Assess quality of wikipedia raw text data in minority languages.
3. Compare gain to other available public corpora such Tatoeba (358 languages).

## Crawling via API
By using and loading available list of articles per wikipedia, then scrap the sites. If too large, could be limited to `max=n` articles.

Given an iso code such as Ndonga's `ng` :
- [ ] download _List of page titles in main namespace_ archive (see below)
- [ ] get the articles into a python list variable (python)
- [ ] code a crawler in [/Lib/corpuscrawler/util.py](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/util.py), following other crawler as examples [1](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/util.py#L791-L809), which query Wikipedia API, extract the valuable text, save the text. (python)
- [ ] Update relevant crawlers [/Lib/corpuscrawler/](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/) 

### Wikipedia API provides text
- [en + wikipedia + Dragon + xmlfm](https://en.wikipedia.org/w/api.php?action=query&titles=Dragon&prop=extracts&explaintext&redirects&converttitles&callback=?&format=xmlfm)
- [ca + wikipedia + Drac + jsonfm](https://ca.wikipedia.org/w/api.php?action=query&titles=Drac&prop=extracts&explaintext&redirects&converttitles&callback=?&format=jsonfm)

Various formats available: 
* `format` : The format of the output.
  * `jsont` : Output data in JSON format.
  * `jsonfmt` : Output data in JSON format (pretty-print in HTML).
  * `nonet` : Output nothing.
  * `phpt` : Output data in serialised PHP format.
  * `phpfmt` : Output data in serialised PHP format (pretty-print in HTML).
  * `rawfmt` : Output data, including debugging elements, in JSON format (pretty-print in HTML).
  * `xmlt` : Output data in XML format.
  * `xmlfmt` : Output data in XML format (pretty-print in HTML).

### List of Wikipedia (~300)
* [List_of_Wikipedias](https://en.wikipedia.org/wiki/List_of_Wikipedias#Notes_for_details_table)
* [List of dumps](https://dumps.wikimedia.org/backup-index.html) - Wikipedia and others wiki projects.

### List of articles per Wikipedia
For convenience, I use the tiny Ndonga (`ng`) Wikipedia (8 articles), easier to explore by hand.
* [List of all page titles](https://dumps.wikimedia.org/ngwiki/20210220/ngwiki-20210220-all-titles.gz)
* [List of page titles in main namespace](https://dumps.wikimedia.org/ngwiki/20210220/ngwiki-20210220-all-titles-in-ns0.gz)

For larger demo, you could also inspect similar URLs with the iso of :

Language | Native | iso | Articles
-- | -- | -- | --
Ndonga | Oshiwambo | ng | 8
Inuktitut | ᐃᓄᒃᑎᑐᑦ/inuktitut | iu | 514
Samoan | Gagana Samoa | sm | 985
Igbo | Igbo | ig | 2,085
Central Bikol | Bikol Central | bcl | 10,824

### Namespaces
On all wikis. See also [here](https://en.wikipedia.org/wiki/Wikipedia:Namespace)
* `0`: (main)
* `1`: Talk:
* `2`: User:
* `3`: User_talk:

### Dumps' & paths
* [List of dumps](https://dumps.wikimedia.org/backup-index.html) 
  * [/ngwiki/20200220](https://dumps.wikimedia.org/ngwiki/20200220/) - manual (change the date)
  * [/ngwiki/latest](https://dumps.wikimedia.org/ngwiki/latest/) - directory
    * [/ngwiki-latest-all-titles.gz](https://dumps.wikimedia.org/ngwiki/latest/ngwiki-latest-all-titles.gz)
    * [/ngwiki-latest-all-titles-in-ns0.gz](https://dumps.wikimedia.org/ngwiki/latest/ngwiki-latest-all-titles-in-ns0.gz) - articles only

## Using Wikipedia extractors ?
- https://github.com/attardi/wikiextractor#installation
- https://github.com/hugolpz/induction-wikipedia-corpus / https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html
- https://github.com/telecoms-intelligence/induction-wikipedia-corpus

## Hybrid approach
- ISO: get the list of all local wiki's iso codes.
- Downloads: loop over each language code, download the dump.
- Extract: use extractor above, zip each language
- Cloud: put text result online.
- Crawl: in `util.py`, code a simple crawler which get just that .zip, convert back to txt content, add to the corpora.

cc: @brawer 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Wikipedia crawler ? (300+ languages) #78

Assess interest

Crawling via API

Wikipedia API provides text

List of Wikipedia (~300)

List of articles per Wikipedia

Namespaces

Dumps' & paths

Using Wikipedia extractors ?

Hybrid approach

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Language	Native	iso	Articles
Ndonga	Oshiwambo	ng	8
Inuktitut	ᐃᓄᒃᑎᑐᑦ/inuktitut	iu	514
Samoan	Gagana Samoa	sm	985
Igbo	Igbo	ig	2,085
Central Bikol	Bikol Central	bcl	10,824

Add Wikipedia crawler ? (300+ languages) #78

Description

Assess interest

Crawling via API

Wikipedia API provides text

List of Wikipedia (~300)

List of articles per Wikipedia

Namespaces

Dumps' & paths

Using Wikipedia extractors ?

Hybrid approach

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions