Skip to content

Commit 155d81b

Browse files
authored
Add detailed description for data (#4)
* update descriptions * update wiki name for versioning * add facts * warning about license * update tables (add new fields) * minor docs update * license warning more prominent * rename field * change 'size' field * update tables & script for generation * renamed `size` => `file_size`; removed `type` * fix field in generation script, regenerate README
1 parent c803452 commit 155d81b

File tree

3 files changed

+238
-98
lines changed

3 files changed

+238
-98
lines changed

README.md

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
11
# Gensim data
22

3-
This repository keeps the models and datasets for the [gensim](https://github.com/RaRe-Technologies/gensim) download API. It serves as the data storage, and shouldn't be used directly (unless you're adding new datasets to it).
3+
This repository contains the pre-trained models and text corpora for the [Gensim](https://github.com/RaRe-Technologies/gensim) download API. It serves as a data storage for Gensim and shouldn't be used directly.
44

5-
💡 When you use the gensim download API, **all data will be stored in the `~/gensim-data` folder**.
5+
💡 When you use the Gensim download API, **all data will be stored in your `~/gensim-data` folder**.
66

7-
In current repository, all data stored as attachment files in [github-releases](https://github.com/RaRe-Technologies/gensim-data/releases)
7+
This repository stores the actual (large) data files as attachments in its [releases](https://github.com/RaRe-Technologies/gensim-data/releases) on Github.
8+
9+
🔴 **Each dataset comes with its own license, which the users should study carefully before using the dataset!**
810

911
## Quickstart
1012

11-
To load a model/dataset, use either the Python or command line interface:
13+
To load a model or corpus, use either the Python or command line interface:
1214

1315
- **Python API**
1416

@@ -82,31 +84,31 @@ To load a model/dataset, use either the Python or command line interface:
8284
```
8385

8486
## Available data
85-
86-
### Corpora
87-
88-
| name | source | description |
89-
|------|--------|-------------|
90-
| 20-newsgroups | http://qwone.com/~jason/20Newsgroups/ | The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups |
91-
| fake-news | Kaggle | It contains text and metadata scraped from 244 websites tagged as 'bullshit' here by the BS Detector Chrome Extension by Daniel Sieradski. |
92-
| text8 | http://mattmahoney.net/dc/text8.zip | Cleaned small sample from wikipedia |
93-
| wiki-en | https://dumps.wikimedia.org/enwiki/20171001/ | Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz` |
94-
95-
### Pretrained models
96-
97-
| name | description | related papers | preprocessing | parameters |
98-
|------|-------------|------------|--------|---------------|
99-
| glove-twitter-100 | Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-100.txt` | dimensions = 100 |
100-
| glove-twitter-200 | Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-200.txt` | dimensions = 200 |
101-
| glove-twitter-25 | Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-25.txt` | dimensions = 25 |
102-
| glove-twitter-50 | Pre-trained vectors, 2B tweets, 27B tokens, 1.2M vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-50.txt` | dimensions = 50 |
103-
| glove-wiki-gigaword-100 | Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-100.txt` | dimensions = 100 |
104-
| glove-wiki-gigaword-200 | Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-200.txt` | dimentions = 200 |
105-
| glove-wiki-gigaword-300 | Pre-trained vectors, Wikipedia 2014 + Gigaword 5, 6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-300.txt` | dimensions = 300 |
106-
| glove-wiki-gigaword-50 | Pre-trained vectors ,Wikipedia 2014 + Gigaword 5,6B tokens, 400K vocab, uncased. https://nlp.stanford.edu/projects/glove/ | https://nlp.stanford.edu/pubs/glove.pdf | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt` | dimensions = 50 |
107-
| word2vec-google-news-300 | Pre-trained vectors trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality', https://code.google.com/archive/p/word2vec/ | https://arxiv.org/abs/1301.3781, https://arxiv.org/abs/1310.4546, https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf | - | dimensions = 300 |
108-
109-
(generated by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
87+
### Datasets
88+
| name | file size | read_more | description | license |
89+
|------|-----------|-----------|-------------|---------|
90+
| 20-newsgroups | 13 MB | <ul><li>http://qwone.com/~jason/20Newsgroups/</li></ul> | The notorious collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups. | not found |
91+
| fake-news | 19 MB | <ul><li>https://www.kaggle.com/mrisdal/fake-news</li></ul> | News dataset, contains text and metadata from 244 websites and represents 12,999 posts in total from a specific window of 30 days. The data was pulled using the webhose.io API, and because it's coming from their crawler, not all websites identified by their BS Detector are present in this dataset. Data sources that were missing a label were simply assigned a label of 'bs'. There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read. | https://creativecommons.org/publicdomain/zero/1.0/ |
92+
| quora-duplicate-questions | 20 MB | <ul><li>https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs</li></ul> | Over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair or not. | probably https://www.quora.com/about/tos |
93+
| text8 | 31 MB | <ul><li>http://mattmahoney.net/dc/textdata.html</li></ul> | First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets. | not found |
94+
| wiki-english-20171001 | 6214 MB | <ul><li>https://dumps.wikimedia.org/enwiki/20171001/</li></ul> | Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz` | https://dumps.wikimedia.org/legal.html |
95+
96+
### Models
97+
| name | num vectors | file size | base dataset | read_more | description | parameters | preprocessing | license |
98+
|------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|
99+
| glove-twitter-100 | 1193514 | 387 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/) | <ul><li>dimension - 100</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-100.txt`. | http://opendatacommons.org/licenses/pddl/ |
100+
| glove-twitter-200 | 1193514 | 758 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 200</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-200.txt`. | http://opendatacommons.org/licenses/pddl/ |
101+
| glove-twitter-25 | 1193514 | 104 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 25</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-25.txt`. | http://opendatacommons.org/licenses/pddl/ |
102+
| glove-twitter-50 | 1193514 | 199 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/) | <ul><li>dimension - 50</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-50.txt`. | http://opendatacommons.org/licenses/pddl/ |
103+
| glove-wiki-gigaword-100 | 400000 | 128 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 100</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-100.txt`. | http://opendatacommons.org/licenses/pddl/ |
104+
| glove-wiki-gigaword-200 | 400000 | 252 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 200</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-200.txt`. | http://opendatacommons.org/licenses/pddl/ |
105+
| glove-wiki-gigaword-300 | 400000 | 376 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 300</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-300.txt`. | http://opendatacommons.org/licenses/pddl/ |
106+
| glove-wiki-gigaword-50 | 400000 | 65 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 50</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`. | http://opendatacommons.org/licenses/pddl/ |
107+
| word2vec-google-news-300 | 3000000 | 1662 MB | Google News (about 100 billion words) | <ul><li>https://code.google.com/archive/p/word2vec/</li> <li>https://arxiv.org/abs/1301.3781</li> <li>https://arxiv.org/abs/1310.4546</li> <li>https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf</li></ul> | Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/). | <ul><li>dimension - 300</li></ul> | - | not found |
108+
109+
110+
111+
(generated automatically by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
110112

111113

112114
# Want to add a new corpus or model?

generate_table.py

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -16,25 +16,32 @@ def generate_table(fn):
1616

1717
print("## Available data")
1818
print("### Datasets")
19-
print("| name | source | description |")
20-
print("|------|--------|-------------|")
19+
print("| name | file size | read_more | description | license |")
20+
print("|------|-----------|-----------|-------------|---------|")
2121
for name, other in datasets:
2222
if name.startswith("__testing_"):
2323
continue
24-
print("| {name} | {source} | {description} |".format(
25-
name=name, source=other["source"], description=other["description"]
24+
25+
links = "<ul>" + " ".join("<li>{}</li>".format(link) for link in other["read_more"]) + "</ul>"
26+
print("| {name} | {size} | {links} | {description} | {license} |".format(
27+
name=name, links=links, description=other["description"],
28+
size="{} MB".format(other["file_size"] // 2 ** 20), license=other["license"]
2629
))
2730

2831
print("")
2932
print("### Models")
30-
print("| name | description | papers | preprocessing | parameters |")
31-
print("|------|-------------|------------|--------|---------------|")
33+
print("| name | num vectors | file size | base dataset | read_more | description | parameters | preprocessing | license |")
34+
print("|------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|")
3235
for name, other in models:
3336
if name.startswith("__testing_"):
3437
continue
35-
print("| {name} | {description} | {papers} | {preprocessing} | {parameters} |".format(
36-
name=name, description=other["description"], parameters=other.get("parameters", "-"),
37-
papers=other.get("papers", "-"), preprocessing=other.get("preprocessing", "-")
38+
39+
links = "<ul>" + " ".join("<li>{}</li>".format(link) for link in other["read_more"]) + "</ul>"
40+
parameters = "<ul>" + " ".join("<li>{} - {}</li>".format(k, v) for (k, v) in other["parameters"].items()) + "</ul>"
41+
print("| {name} | {num_vectors} | {size} | {base_dataset} | {links} | {description} | {parameters} | {preprocessing} | {license} |".format(
42+
name=name, num_vectors=other["num_records"], size="{} MB".format(other["file_size"] // 2 ** 20),
43+
base_dataset=other["base_dataset"], links=links, description=other["description"], parameters=parameters,
44+
preprocessing=other.get("preprocessing", "-"), license=other["license"]
3845
))
3946
print("\n(generated by {} based on {})".format(__file__, fn))
4047

0 commit comments

Comments
 (0)