|
15 | 15 | "cell_type": "markdown", |
16 | 16 | "metadata": {}, |
17 | 17 | "source": [ |
18 | | - "\nCorpora and Vector Spaces\n=========================\n\nDemonstrates transforming text into a vector space representation.\n\nAlso introduces corpus streaming and persistence to disk in various formats.\n\n" |
| 18 | + "\n# Corpora and Vector Spaces\n\nDemonstrates transforming text into a vector space representation.\n\nAlso introduces corpus streaming and persistence to disk in various formats.\n" |
19 | 19 | ] |
20 | 20 | }, |
21 | 21 | { |
|
33 | 33 | "cell_type": "markdown", |
34 | 34 | "metadata": {}, |
35 | 35 | "source": [ |
36 | | - "First, let\u2019s create a small corpus of nine short documents [1]_:\n\n\nFrom Strings to Vectors\n------------------------\n\nThis time, let's start from documents represented as strings:\n\n\n" |
| 36 | + "First, let\u2019s create a small corpus of nine short documents [1]_:\n\n\n## From Strings to Vectors\n\nThis time, let's start from documents represented as strings:\n\n\n" |
37 | 37 | ] |
38 | 38 | }, |
39 | 39 | { |
|
141 | 141 | "cell_type": "markdown", |
142 | 142 | "metadata": {}, |
143 | 143 | "source": [ |
144 | | - "By now it should be clear that the vector feature with ``id=10`` stands for the question \"How many\ntimes does the word `graph` appear in the document?\" and that the answer is \"zero\" for\nthe first six documents and \"one\" for the remaining three.\n\n\nCorpus Streaming -- One Document at a Time\n-------------------------------------------\n\nNote that `corpus` above resides fully in memory, as a plain Python list.\nIn this simple example, it doesn't matter much, but just to make things clear,\nlet's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.\nInstead, let's assume the documents are stored in a file on disk, one document per line. Gensim\nonly requires that a corpus must be able to return one document vector at a time:\n\n\n" |
| 144 | + "By now it should be clear that the vector feature with ``id=10`` stands for the question \"How many\ntimes does the word `graph` appear in the document?\" and that the answer is \"zero\" for\nthe first six documents and \"one\" for the remaining three.\n\n\n## Corpus Streaming -- One Document at a Time\n\nNote that `corpus` above resides fully in memory, as a plain Python list.\nIn this simple example, it doesn't matter much, but just to make things clear,\nlet's assume there are millions of documents in the corpus. Storing all of them in RAM won't do.\nInstead, let's assume the documents are stored in a file on disk, one document per line. Gensim\nonly requires that a corpus must be able to return one document vector at a time:\n\n\n" |
145 | 145 | ] |
146 | 146 | }, |
147 | 147 | { |
|
152 | 152 | }, |
153 | 153 | "outputs": [], |
154 | 154 | "source": [ |
155 | | - "from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus:\n def __iter__(self):\n for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())" |
| 155 | + "from smart_open import open # for transparently opening remote files\n\n\nclass MyCorpus:\n def __iter__(self):\n for line in open('https://radimrehurek.com/mycorpus.txt'):\n # assume there's one document per line, tokens separated by whitespace\n yield dictionary.doc2bow(line.lower().split())" |
156 | 156 | ] |
157 | 157 | }, |
158 | 158 | { |
|
177 | 177 | "cell_type": "markdown", |
178 | 178 | "metadata": {}, |
179 | 179 | "source": [ |
180 | | - "Download the sample `mycorpus.txt file here <./mycorpus.txt>`_. The assumption that\neach document occupies one line in a single file is not important; you can mold\nthe `__iter__` function to fit your input format, whatever it is.\nWalking directories, parsing XML, accessing the network...\nJust parse your input to retrieve a clean list of tokens in each document,\nthen convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.\n\n" |
| 180 | + "Download the sample `mycorpus.txt file here <https://radimrehurek.com/mycorpus.txt>`_. The assumption that\neach document occupies one line in a single file is not important; you can mold\nthe `__iter__` function to fit your input format, whatever it is.\nWalking directories, parsing XML, accessing the network...\nJust parse your input to retrieve a clean list of tokens in each document,\nthen convert the tokens via a dictionary to their ids and yield the resulting sparse vector inside `__iter__`.\n\n" |
181 | 181 | ] |
182 | 182 | }, |
183 | 183 | { |
|
224 | 224 | }, |
225 | 225 | "outputs": [], |
226 | 226 | "source": [ |
227 | | - "# collect statistics about all tokens\ndictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt'))\n# remove stop words and words that appear only once\nstop_ids = [\n dictionary.token2id[stopword]\n for stopword in stoplist\n if stopword in dictionary.token2id\n]\nonce_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]\ndictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once\ndictionary.compactify() # remove gaps in id sequence after words that were removed\nprint(dictionary)" |
| 227 | + "# collect statistics about all tokens\ndictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/mycorpus.txt'))\n# remove stop words and words that appear only once\nstop_ids = [\n dictionary.token2id[stopword]\n for stopword in stoplist\n if stopword in dictionary.token2id\n]\nonce_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq == 1]\ndictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once\ndictionary.compactify() # remove gaps in id sequence after words that were removed\nprint(dictionary)" |
228 | 228 | ] |
229 | 229 | }, |
230 | 230 | { |
231 | 231 | "cell_type": "markdown", |
232 | 232 | "metadata": {}, |
233 | 233 | "source": [ |
234 | | - "And that is all there is to it! At least as far as bag-of-words representation is concerned.\nOf course, what we do with such a corpus is another question; it is not at all clear\nhow counting the frequency of distinct words could be useful. As it turns out, it isn't, and\nwe will need to apply a transformation on this simple representation first, before\nwe can use it to compute any meaningful document vs. document similarities.\nTransformations are covered in the next tutorial\n(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),\nbut before that, let's briefly turn our attention to *corpus persistency*.\n\n\nCorpus Formats\n---------------\n\nThere exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.\n`Gensim` implements them via the *streaming corpus interface* mentioned earlier:\ndocuments are read from (resp. stored to) disk in a lazy fashion, one document at\na time, without the whole corpus being read into main memory at once.\n\nOne of the more notable file formats is the `Market Matrix format <http://math.nist.gov/MatrixMarket/formats.html>`_.\nTo save a corpus in the Matrix Market format:\n\ncreate a toy corpus of 2 documents, as a plain Python list\n\n" |
| 234 | + "And that is all there is to it! At least as far as bag-of-words representation is concerned.\nOf course, what we do with such a corpus is another question; it is not at all clear\nhow counting the frequency of distinct words could be useful. As it turns out, it isn't, and\nwe will need to apply a transformation on this simple representation first, before\nwe can use it to compute any meaningful document vs. document similarities.\nTransformations are covered in the next tutorial\n(`sphx_glr_auto_examples_core_run_topics_and_transformations.py`),\nbut before that, let's briefly turn our attention to *corpus persistency*.\n\n\n## Corpus Formats\n\nThere exist several file formats for serializing a Vector Space corpus (~sequence of vectors) to disk.\n`Gensim` implements them via the *streaming corpus interface* mentioned earlier:\ndocuments are read from (resp. stored to) disk in a lazy fashion, one document at\na time, without the whole corpus being read into main memory at once.\n\nOne of the more notable file formats is the `Market Matrix format <http://math.nist.gov/MatrixMarket/formats.html>`_.\nTo save a corpus in the Matrix Market format:\n\ncreate a toy corpus of 2 documents, as a plain Python list\n\n" |
235 | 235 | ] |
236 | 236 | }, |
237 | 237 | { |
|
357 | 357 | "cell_type": "markdown", |
358 | 358 | "metadata": {}, |
359 | 359 | "source": [ |
360 | | - "In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:\njust load a document stream using one format and immediately save it in another format.\nAdding new formats is dead easy, check out the `code for the SVMlight corpus\n<https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py>`_ for an example.\n\nCompatibility with NumPy and SciPy\n----------------------------------\n\nGensim also contains `efficient utility functions <http://radimrehurek.com/gensim/matutils.html>`_\nto help converting from/to numpy matrices\n\n" |
| 360 | + "In this way, `gensim` can also be used as a memory-efficient **I/O format conversion tool**:\njust load a document stream using one format and immediately save it in another format.\nAdding new formats is dead easy, check out the `code for the SVMlight corpus\n<https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py>`_ for an example.\n\n## Compatibility with NumPy and SciPy\n\nGensim also contains `efficient utility functions <http://radimrehurek.com/gensim/matutils.html>`_\nto help converting from/to numpy matrices\n\n" |
361 | 361 | ] |
362 | 362 | }, |
363 | 363 | { |
|
393 | 393 | "cell_type": "markdown", |
394 | 394 | "metadata": {}, |
395 | 395 | "source": [ |
396 | | - "What Next\n---------\n\nRead about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\nReferences\n----------\n\nFor a complete reference (Want to prune the dictionary to a smaller size?\nOptimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.\n\n.. [1] This is the same corpus as used in\n `Deerwester et al. (1990): Indexing by Latent Semantic Analysis <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_, Table 2.\n\n" |
| 396 | + "## What Next\n\nRead about `sphx_glr_auto_examples_core_run_topics_and_transformations.py`.\n\n## References\n\nFor a complete reference (Want to prune the dictionary to a smaller size?\nOptimize converting between corpora and NumPy/SciPy arrays?), see the `apiref`.\n\n.. [1] This is the same corpus as used in\n `Deerwester et al. (1990): Indexing by Latent Semantic Analysis <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>`_, Table 2.\n\n" |
397 | 397 | ] |
398 | 398 | }, |
399 | 399 | { |
|
424 | 424 | "name": "python", |
425 | 425 | "nbconvert_exporter": "python", |
426 | 426 | "pygments_lexer": "ipython3", |
427 | | - "version": "3.6.5" |
| 427 | + "version": "3.8.5" |
428 | 428 | } |
429 | 429 | }, |
430 | 430 | "nbformat": 4, |
|
0 commit comments