|
4 | 4 | Convert a collection of raw documents to matrix representing a bag-of-words structure.
|
5 | 5 |
|
6 | 6 | Essentially, a bag-of-words approach to representing documents in a matrix is comprised of
|
7 |
| -a count of every word in the document corpus/collection for every document. This is a simple |
8 |
| -but often quite powerful way of representing documents as vectors. The end representation is |
| 7 | +a count of every word in the document corpus/collection for every document. This is a simple |
| 8 | +but often quite powerful way of representing documents as vectors. The resulting representation is |
9 | 9 | a matrix with rows representing every document in the corpus and columns representing every word
|
10 |
| -in the corpus. The value for each cell is the raw count of a particular word in a particular |
| 10 | +in the corpus. The value for each cell is the raw count of a particular word in a particular |
11 | 11 | document.
|
12 | 12 |
|
13 | 13 | Similarly to the `TfidfTransformer`, the vocabulary considered can be restricted
|
14 | 14 | to words occuring in a maximum or minimum portion of documents.
|
15 | 15 |
|
16 | 16 | The parameters `max_doc_freq` and `min_doc_freq` restrict the vocabulary
|
17 |
| -that the transformer will consider. `max_doc_freq` indicates that terms in only |
18 |
| -up to the specified percentage of documents will be considered. For example, if |
| 17 | +that the transformer will consider. `max_doc_freq` indicates that terms in only |
| 18 | +up to the specified percentage of documents will be considered. For example, if |
19 | 19 | `max_doc_freq` is set to 0.9, terms that are in more than 90% of documents
|
20 |
| -will be removed. Similarly, the `min_doc_freq` parameter restricts terms in the |
21 |
| -other direction. A value of 0.01 means that only terms that are at least in 1% of |
| 20 | +will be removed. Similarly, the `min_doc_freq` parameter restricts terms in the |
| 21 | +other direction. A value of 0.01 means that only terms that are at least in 1% of |
22 | 22 | documents will be included.
|
23 | 23 | """
|
24 | 24 | mutable struct BagOfWordsTransformer <: AbstractTextTransformer
|
|
0 commit comments