Skip to content

Commit 3b36c8a

Browse files
rohit901mpenkov
andauthored
fix broken link in documentation (#3148)
* fixes broken link in documentation * make -C docs/src html * Update CHANGELOG.md Co-authored-by: Michael Penkov <[email protected]>
1 parent f3f6d2d commit 3b36c8a

10 files changed

+145
-69
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Changes
2222
* [#3125](https://github.com/RaRe-Technologies/gensim/pull/3125): Improve & unify docs for dirichlet priors, by [@jonaschn](https://github.com/jonaschn)
2323
* [#3133](https://github.com/RaRe-Technologies/gensim/pull/3133): Update link to Hoffman paper (online VB LDA), by [@jonaschn](https://github.com/jonaschn)
2424
* [#3141](https://github.com/RaRe-Technologies/gensim/pull/3141): Update link for online LDA paper, by [@dymil](https://github.com/dymil)
25+
* [#3148](https://github.com/RaRe-Technologies/gensim/pull/3148): Fix broken link in documentation, by [@rohit901](https://github.com/rohit901)
2526
* [#3155](https://github.com/RaRe-Technologies/gensim/pull/3155): Correct parameter name in documentation of fasttext.py, by [@bizzyvinci](https://github.com/bizzyvinci)
2627

2728
## 4.0.1, 2021-04-01
50.8 KB
Loading
1.36 KB
Loading

docs/src/auto_examples/core/run_corpora_and_vector_spaces.ipynb

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\nCorpora and Vector Spaces\n=========================\n\nDemonstrates transforming text into a vector space representation.\n\nAlso introduces corpus streaming and persistence to disk in various formats.\n\n"
18+
"\n# Corpora and Vector Spaces\n\nDemonstrates transforming text into a vector space representation.\n\nAlso introduces corpus streaming and persistence to disk in various formats.\n"
1919
]
2020
},
2121
{
@@ -33,7 +33,7 @@
3333
"cell_type": "markdown",
3434
"metadata": {},
3535
"source": [
36-
"First, let\u2019s create a small corpus of nine short documents [1]_:\n\n\nFrom Strings to Vectors\n------------------------\n\nThis time, let's start from documents represented as strings:\n\n\n"
36+
"First, let\u2019s create a small corpus of nine short documents [1]_:\n\n\n## From Strings to Vectors\n\nThis time, let's start from documents represented as strings:\n\n\n"
3737
]
3838
},
3939
{
@@ -141,7 +141,7 @@
141141
"cell_type": "markdown",
142142
"metadata": {},
143143
"source": [
144-
"By now it should be clear that the vector feature with ``id=10`` stands for the question \"How many\ntimes does the word `graph` appear in the document?\" and that the answer is \"zero\" for\nthe first six documents and \"one\" for the remaining three.\n\n\nCorpus Streaming -- One Document at a Time\n-------------------------------------------\n\nNote that `corpus` above resides fully in memory, as a plain Python list.\nIn this simple example, it doesn't matter much, but just to make things clear,\nlet's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.\nInstead, let's assume the documents are stored in a file on disk, one document per line. Gensim\nonly requires that a corpus must be able to return one document vector at a time:\n\n\n"
144+
"By now it should be clear that the vector feature with ``id=10`` stands for the question \"How many\ntimes does the word `graph` appear in the document?\" and that the answer is \"zero\" for\nthe first six documents and \"one\" for the remaining three.\n\n\n## Corpus Streaming -- One Document at a Time\n\nNote that `corpus` above resides fully in memory, as a plain Python list.\nIn this simple example, it doesn't matter much, but just to make things clear,\nlet's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.\nInstead, let's assume the documents are stored in a file on disk, one document per line. Gensim\nonly requires that a corpus must be able to return one document vector at a time:\n\n\n"
145145
]
146146
},
147147
{
@@ -152,7 +152,7 @@
152152
},
153153
"outputs": [],
154154
"source": [
155-
"from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus:\n def __iter__(self):\n for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())"
155+
"from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus:\n def __iter__(self):\n for line in open('https://radimrehurek.com/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())"
156156
]
157157
},
158158
{
@@ -177,7 +177,7 @@
177177
"cell_type": "markdown",
178178
"metadata": {},
179179
"source": [
180-
"Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that\neach document occupies one line in a single file is not important; you can mold\nthe `__iter__` function to fit your input format, whatever it is.\nWalking directories, parsing XML, accessing the network...\nJust parse your input to retrieve a clean list of tokens in each document,\nthen convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.\n\n"
180+
"Download the sample `mycorpus.txt file here <https://radimrehurek.com/mycorpus.txt>`_. The assumption that\neach document occupies one line in a single file is not important; you can mold\nthe `__iter__` function to fit your input format, whatever it is.\nWalking directories, parsing XML, accessing the network...\nJust parse your input to retrieve a clean list of tokens in each document,\nthen convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.\n\n"
181181
]
182182
},
183183
{
@@ -224,14 +224,14 @@
224224
},
225225
"outputs": [],
226226
"source": [
227-
"# collect statistics about all tokens\ndictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt'))\n# remove stop words and words that appear only once\nstop_ids = [\n dictionary.token2id[stopword]\n for stopword in stoplist\n if stopword in dictionary.token2id\n]\nonce_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]\ndictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once\ndictionary.compactify() # remove gaps in id sequence after words that were removed\nprint(dictionary)"
227+
"# collect statistics about all tokens\ndictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt'))\n# remove stop words and words that appear only once\nstop_ids = [\n dictionary.token2id[stopword]\n for stopword in stoplist\n if stopword in dictionary.token2id\n]\nonce_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]\ndictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once\ndictionary.compactify() # remove gaps in id sequence after words that were removed\nprint(dictionary)"
228228
]
229229
},
230230
{
231231
"cell_type": "markdown",
232232
"metadata": {},
233233
"source": [
234-
"And that is all there is to it! At least as far as bag-of-words representation is concerned.\nOf course, what we do with such a corpus is another question; it is not at all clear\nhow counting the frequency of distinct words could be useful. As it turns out, it isn't, and\nwe will need to apply a transformation on this simple representation first, before\nwe can use it to compute any meaningful document vs. document similarities.\nTransformations are covered in the next tutorial\n(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),\nbut before that, let's briefly turn our attention to *corpus persistency*.\n\n\nCorpus Formats\n---------------\n\nThere exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.\n`Gensim` implements them via the *streaming corpus interface* mentioned earlier:\ndocuments are read from (resp. stored to) disk in a lazy fashion, one document at\na time, without the whole corpus being read into main memory at once.\n\nOne of the more notable file formats is the `Market Matrix format <http://math.nist.gov/MatrixMarket/formats.html>`_.\nTo save a corpus in the Matrix Market format:\n\ncreate a toy corpus of 2 documents, as a plain Python list\n\n"
234+
"And that is all there is to it! At least as far as bag-of-words representation is concerned.\nOf course, what we do with such a corpus is another question; it is not at all clear\nhow counting the frequency of distinct words could be useful. As it turns out, it isn't, and\nwe will need to apply a transformation on this simple representation first, before\nwe can use it to compute any meaningful document vs. document similarities.\nTransformations are covered in the next tutorial\n(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),\nbut before that, let's briefly turn our attention to *corpus persistency*.\n\n\n## Corpus Formats\n\nThere exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.\n`Gensim` implements them via the *streaming corpus interface* mentioned earlier:\ndocuments are read from (resp. stored to) disk in a lazy fashion, one document at\na time, without the whole corpus being read into main memory at once.\n\nOne of the more notable file formats is the `Market Matrix format <http://math.nist.gov/MatrixMarket/formats.html>`_.\nTo save a corpus in the Matrix Market format:\n\ncreate a toy corpus of 2 documents, as a plain Python list\n\n"
235235
]
236236
},
237237
{
@@ -357,7 +357,7 @@
357357
"cell_type": "markdown",
358358
"metadata": {},
359359
"source": [
360-
"In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:\njust load a document stream using one format and immediately save it in another format.\nAdding new formats is dead easy, check out the `code for the SVMlight corpus\n<https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py>`_ for an example.\n\nCompatibility with NumPy and SciPy\n----------------------------------\n\nGensim also contains `efficient utility functions <http://radimrehurek.com/gensim/matutils.html>`_\nto help converting from/to numpy matrices\n\n"
360+
"In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:\njust load a document stream using one format and immediately save it in another format.\nAdding new formats is dead easy, check out the `code for the SVMlight corpus\n<https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py>`_ for an example.\n\n## Compatibility with NumPy and SciPy\n\nGensim also contains `efficient utility functions <http://radimrehurek.com/gensim/matutils.html>`_\nto help converting from/to numpy matrices\n\n"
361361
]
362362
},
363363
{
@@ -393,7 +393,7 @@
393393
"cell_type": "markdown",
394394
"metadata": {},
395395
"source": [
396-
"What Next\n---------\n\nRead about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\nReferences\n----------\n\nFor a complete reference (Want to prune the dictionary to a smaller size?\nOptimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.\n\n.. [1] This is the same corpus as used in\n `Deerwester et al. (1990): Indexing by Latent Semantic Analysis <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_, Table 2.\n\n"
396+
"## What Next\n\nRead about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\n## References\n\nFor a complete reference (Want to prune the dictionary to a smaller size?\nOptimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.\n\n.. [1] This is the same corpus as used in\n `Deerwester et al. (1990): Indexing by Latent Semantic Analysis <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_, Table 2.\n\n"
397397
]
398398
},
399399
{
@@ -424,7 +424,7 @@
424424
"name": "python",
425425
"nbconvert_exporter": "python",
426426
"pygments_lexer": "ipython3",
427-
"version": "3.6.5"
427+
"version": "3.8.5"
428428
}
429429
},
430430
"nbformat": 4,

docs/src/auto_examples/core/run_corpora_and_vector_spaces.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@
138138

139139
class MyCorpus:
140140
def __iter__(self):
141-
for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):
141+
for line in open('https://radimrehurek.com/mycorpus.txt'):
142142
# assume there's one document per line, tokens separated by whitespace
143143
yield dictionary.doc2bow(line.lower().split())
144144

@@ -154,7 +154,7 @@ def __iter__(self):
154154
# in RAM at once. You can even create the documents on the fly!
155155

156156
###############################################################################
157-
# Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that
157+
# Download the sample `mycorpus.txt file here <https://radimrehurek.com/mycorpus.txt>`_. The assumption that
158158
# each document occupies one line in a single file is not important; you can mold
159159
# the `__iter__` function to fit your input format, whatever it is.
160160
# Walking directories, parsing XML, accessing the network...
@@ -180,7 +180,7 @@ def __iter__(self):
180180
# Similarly, to construct the dictionary without loading all texts into memory:
181181

182182
# collect statistics about all tokens
183-
dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt'))
183+
dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt'))
184184
# remove stop words and words that appear only once
185185
stop_ids = [
186186
dictionary.token2id[stopword]
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
6b98413399bca9fd1ed8fe420da85692
1+
55a8a886f05e5005c5f66d57569ee79d

0 commit comments

Comments
 (0)