Skip to content

Commit 40ad3e4

Browse files
markroxormenshikh-iv
authored andcommitted
Add numberbatch word embeddings. Fix #9 (#10)
* numberbatch to json * numberbatch to json * added checksum * update conceptnet info * update name
1 parent fa71854 commit 40ad3e4

File tree

2 files changed

+18
-1
lines changed

2 files changed

+18
-1
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,7 @@ To load a model or corpus, use either the Python or command line interface:
9898
### Models
9999
| name | num vectors | file size | base dataset | read_more | description | parameters | preprocessing | license |
100100
|------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|
101+
| conceptnet-numberbatch-17-06-300 | 1917247 | 1168 MB | ConceptNet, word2vec, GloVe, and OpenSubtitles 2016 | <ul><li>http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972</li> <li>https://github.com/commonsense/conceptnet-numberbatch</li> <li>http://conceptnet.io/</li></ul> | ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. | <ul><li>dimension - 300</li></ul> | - | https://github.com/commonsense/conceptnet-numberbatch/blob/master/LICENSE.txt |
101102
| glove-twitter-100 | 1193514 | 387 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/) | <ul><li>dimension - 100</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-100.txt`. | http://opendatacommons.org/licenses/pddl/ |
102103
| glove-twitter-200 | 1193514 | 758 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 200</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-200.txt`. | http://opendatacommons.org/licenses/pddl/ |
103104
| glove-twitter-25 | 1193514 | 104 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 25</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-twitter-25.txt`. | http://opendatacommons.org/licenses/pddl/ |
@@ -107,7 +108,8 @@ To load a model or corpus, use either the Python or command line interface:
107108
| glove-wiki-gigaword-300 | 400000 | 376 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 300</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-300.txt`. | http://opendatacommons.org/licenses/pddl/ |
108109
| glove-wiki-gigaword-50 | 400000 | 65 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | <ul><li>https://nlp.stanford.edu/projects/glove/</li> <li>https://nlp.stanford.edu/pubs/glove.pdf</li></ul> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). | <ul><li>dimension - 50</li></ul> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`. | http://opendatacommons.org/licenses/pddl/ |
109110
| word2vec-google-news-300 | 3000000 | 1662 MB | Google News (about 100 billion words) | <ul><li>https://code.google.com/archive/p/word2vec/</li> <li>https://arxiv.org/abs/1301.3781</li> <li>https://arxiv.org/abs/1310.4546</li> <li>https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf</li></ul> | Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https://code.google.com/archive/p/word2vec/). | <ul><li>dimension - 300</li></ul> | - | not found |
110-
| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) | <ul><li>https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models</li> <li>http://rusvectores.org/en/</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/3</li></ul> | Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words. | <ul><li>window_size - 10</li> <li>dimension - 300</li></ul> | The corpus was lemmatized and tagged with Universal PoS | not found |
111+
| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) | <ul><li>https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models</li> <li>http://rusvectores.org/en/</li> <li>https://github.com/RaRe-Technologies/gensim-data/issues/3</li></ul> | Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words. | <ul><li>window_size - 10</li> <li>dimension - 300</li></ul> | The corpus was lemmatized and tagged with Universal PoS | https://creativecommons.org/licenses/by/4.0/deed.en |
112+
111113

112114
(table generated automatically by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
113115

list.json

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,21 @@
122122
}
123123
},
124124
"models": {
125+
"conceptnet-numberbatch-17-06-300": {
126+
"num_records": 1917247,
127+
"file_size": 1225497562,
128+
"base_dataset": "ConceptNet, word2vec, GloVe, and OpenSubtitles 2016",
129+
"reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/conceptnet-numberbatch-17-06-300/__init__.py",
130+
"license": "https://github.com/commonsense/conceptnet-numberbatch/blob/master/LICENSE.txt",
131+
"parameters": {
132+
"dimension": 300
133+
},
134+
"description": "ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting.",
135+
"read_more": ["http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972", "https://github.com/commonsense/conceptnet-numberbatch", "http://conceptnet.io/"],
136+
"checksum": "fd642d457adcd0ea94da0cd21b150847",
137+
"file_name": "conceptnet-numberbatch-17-06-300.gz",
138+
"parts": 1
139+
},
125140
"word2vec-ruscorpora-300": {
126141
"num_records": 184973,
127142
"file_size": 208427381,

0 commit comments

Comments
 (0)