Skip to content

Commit c14b763

Browse files
authored
Merge pull request #115621 from likebupt/add-2-modules-0519
add 2 modules, group data into bins, convert word to vector
2 parents 2980ddb + b1c65f7 commit c14b763

File tree

4 files changed

+265
-9
lines changed

4 files changed

+265
-9
lines changed
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: "Convert Word to Vector"
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to use three provided Word2Vec models to extract a vocabulary and its corresponding word embeddings from a corpus of text.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: reference
9+
10+
author: likebupt
11+
ms.author: keli19
12+
ms.date: 05/19/2020
13+
---
14+
# Convert Word to Vector
15+
16+
This article describes how to use the **Convert word to Vector** module in Azure Machine Learning designer (Preview), to apply various different Word2Vec models (Word2Vec, FastText, Glove pre-trained model) on the corpus of text that you specified as input, and generate a vocabulary with words embeddings.
17+
18+
This module uses the Gensim library. For more information about Gensim, see its [official website](https://radimrehurek.com/gensim/apiref.html) that includes tutorials and explanation of algorithms.
19+
20+
### More about Convert Word to Vector
21+
22+
Generally speaking, converting word to vector, or word vectorization, is a natural language processing process, which uses language models or techniques to map words into vector space, that is, to represent each word by a vector of real numbers, and meanwhile, it allows words with similar meanings have similar representations.
23+
24+
Word embeddings can be used as initial input for NLP downstream tasks such as text classification, sentiment analysis etc.
25+
26+
Among various word embedding technologies, in this module, we implemented three widely used methods, including two online-training models, Word2Vec and FastText, and one pre-trained model, glove-wiki-gigaword-100. Online-training models are trained on your input data, while pre-trained models are trained off-line on a larger text corpus, (for example, Wikipedia, Google News) usually contains about 100 billion words, then, word embedding stays constant during word vectorization. Pre-trained word models provide benefits such as reduced training time, better word vectors encoded, and improved overall performance.
27+
28+
+ Word2Vec is one of the most popular techniques to learn word embeddings using shallow neural network, theory is discussed in this paper, available as a PDF download: [Efficient Estimation of Word Representations in Vector Space, Mikolov, Tomas, et al](https://arxiv.org/pdf/1301.3781.pdf). Implementation in this module is based on [gensim library for Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html).
29+
30+
+ FastText theory is explained in this paper, available as a PDF download: [Enriching Word Vectors with Subword Information, Bojanowski, Piotr, et al](https://arxiv.org/pdf/1607.04606.pdf). Implementation in this module is based on [gensim library for FastText](https://radimrehurek.com/gensim/models/fasttext.html).
31+
32+
+ Glove pre-trained model: glove-wiki-gigaword-100, is a collection of pre-trained vectors based on Wikipedia text corpus, which contains 5.6B tokens and 400K uncased vocabulary, pdf is available: [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf).
33+
34+
## How to configure Convert Word to Vector
35+
36+
This module requires a dataset that contains a column of text, preprocessed text is better.
37+
38+
1. Add the **Convert Word to Vector** module to your pipeline.
39+
40+
2. As input for the module, provide a dataset containing one or more text columns.
41+
42+
3. For **Target column**, choose only one columns containing text to process.
43+
44+
In general, because this module creates a vocabulary from text, the content of different columns differs, which leads to different vocabulary contents, therefore, module only accept one target column.
45+
46+
4. For **Word2Vec strategy**, choose from `GloVe pretrained English Model`, `Gensim Word2Vec`, and `Gensim FastText`.
47+
48+
5. if **Word2Vec strategy** is `Gensim Word2Vec` or `Gensim FastText`:
49+
50+
+ **Word2Vec Training Algorithm**. Choose from `Skip_gram` and `CBOW`. Difference is introduced in the original [paper](https://arxiv.org/pdf/1301.3781.pdf).
51+
52+
Default method is `Skip_gram`.
53+
54+
+ **Length of word embedding**. Specify the dimensionality of the word vectors. Corresponds to the `size` parameter in gensim.
55+
56+
Default embedding_size is 100.
57+
58+
+ **Context window size**. Specify the maximum distance between the word being predicted and the current word. Corresponds to the `window` parameter in gensim.
59+
60+
Default window size is 5.
61+
62+
+ **Number of epochs**. Specify the number of epochs (iterations) over the corpus. Corresponds to the `iter` parameter in gensim.
63+
64+
Default epochs number is 5.
65+
66+
6. For **Maximum vocabulary size**, Specify the maximum number of the words in generated vocabulary.
67+
68+
If there are more unique words than this, then prune the infrequent ones.
69+
70+
Default vocabulary size is 10000.
71+
72+
7. For **Minimum word count**, provide a minimum word count, which makes module ignores all words, which have a frequency lower than this value.
73+
74+
Default value is 5.
75+
76+
8. Submit the pipeline.
77+
78+
## Examples
79+
80+
The module has one output:
81+
82+
+ **Vocabulary with embeddings**: Contains the generated vocabulary, together with each word's embedding, one dimension occupies one column.
83+
84+
### Result examples
85+
86+
To illustrate how the **Convert Word to Vector** module works, the following example applies this module with the default settings to the preprocessed Wikipedia SP 500 Dataset provided in Azure Machine Learning (preview).
87+
88+
#### Source dataset
89+
90+
The dataset contains a category column, as well as the full text fetched from Wikipedia. This table shows only a few representative examples.
91+
92+
|text|
93+
|----------|
94+
|nasdaq 100 component s p 500 component foundation founder location city apple campus 1 infinite loop street infinite loop cupertino california cupertino california location country united states...|
95+
|br nasdaq 100 nasdaq 100 component br s p 500 s p 500 component industry computer software foundation br founder charles geschke br john warnock location adobe systems...|
96+
|s p 500 s p 500 component industry automotive industry automotive predecessor general motors corporation 1908 2009 successor...|
97+
|s p 500 s p 500 component industry conglomerate company conglomerate foundation founder location city fairfield connecticut fairfield connecticut location country usa area...|
98+
|br s p 500 s p 500 component foundation 1903 founder william s harley br arthur davidson harley davidson founder arthur davidson br walter davidson br william a davidson location...|
99+
100+
#### Output vocabulary with embeddings
101+
102+
The following table contains the output of this module taking Wikipedia SP 500 dataset as input. The leftmost column indicates the vocabulary, its embedding vector is represented by values of remaining columns in the same row.
103+
104+
|Vocabulary|Embedding dim 0|Embedding dim 1|Embedding dim 2|Embedding dim 3|Embedding dim 4|Embedding dim 5|...|Embedding dim 99|
105+
|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------|
106+
|nasdaq|-0.375865|0.609234|0.812797|-0.002236|0.319071|-0.591986|...|0.364276
107+
|component|0.081302|0.40001|0.121803|0.108181|0.043651|-0.091452|...|0.636587
108+
|s|-0.34355|-0.037092|-0.012167|0.151542|0.601019|0.084501|...|0.149419
109+
|p|-0.133407|0.073244|0.170396|0.326706|0.213463|-0.700355|...|0.530901
110+
foundation|-0.166819|0.10883|-0.07933|-0.073753|0.262137|0.045725|...|0.27487
111+
founder|-0.297408|0.493067|0.316709|-0.031651|0.455416|-0.284208|...|0.22798
112+
location|-0.375213|0.461229|0.310698|0.213465|0.200092|0.314288|...|0.14228
113+
city|-0.460828|0.505516|-0.074294|-0.00639|0.116545|0.494368|...|-0.2403
114+
apple|0.05779|0.672657|0.597267|-0.898889|0.099901|0.11833|...|0.4636
115+
campus|-0.281835|0.29312|0.106966|-0.031385|0.100777|-0.061452|...|0.05978
116+
infinite|-0.263074|0.245753|0.07058|-0.164666|0.162857|-0.027345|...|-0.0525
117+
loop|-0.391421|0.52366|0.141503|-0.105423|0.084503|-0.018424|...|-0.0521
118+
119+
In this example, we used the default `Gensim Word2Vec` as the **Word2Vec strategy**, **Training Algorithm** is `Skip-gram`, **Length of word Embedding** is 100, therefore, we have 100 embedding columns.
120+
121+
## Technical notes
122+
123+
This section contains tips and answers to frequently asked questions.
124+
125+
+ Difference between online-train and pretrained model
126+
127+
In this **Convert word to Vector module**, we provided three different strategies, two online-training models, and one pre-trained model. Online-training model uses your input dataset as training data, generates vocabulary and word vectors during training, while pre-trained model is already trained by much larger text corpus such as Wikipedia or Twitter text, thus pre-trained model is actually a collection of (word, embedding) pair.
128+
129+
If Glove pre-trained model is chosen as word vectorization strategy, it summarizes a vocabulary from the input dataset and generates embedding vector for each word from the pre-trained model, without online training, the use of pre-trained model could save training time, and has a better performance especially when the input dataset size is relatively small.
130+
131+
+ Embedding size
132+
133+
In general, the length of word embedding is set to a few hundred (for example, 100, 200, 300) to achieve good performance, because small embedding size means small vector space, which may cause word embedding collisions.
134+
135+
For pretrained models, length of word embeddings are fixed, in this implementation, embedding size of glove-wiki-gigaword-100 is 100.
136+
137+
138+
## Next steps
139+
140+
See the [set of modules available](module-reference.md) to Azure Machine Learning.
141+
142+
For a list of errors specific to the designer(preview) modules, see [Machine Learning Error codes](designer-error-codes.md).
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
---
2+
title: "Group Data into Bins"
3+
titleSuffix: Azure Machine Learning
4+
description: Learn how to use the Group Data into Bins module to group numbers or change the distribution of continuous data.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: reference
9+
10+
author: likebupt
11+
ms.author: keli19
12+
ms.date: 05/19/2020
13+
---
14+
# Group Data into Bins
15+
16+
This article describes how to use the [Group Data into Bins](group-data-into-bins.md) module in Azure Machine Learning designer (preview), to group numbers or change the distribution of continuous data.
17+
18+
The [Group Data into Bins](group-data-into-bins.md) module supports multiple options for binning data. You can customize how the bin edges are set and how values are apportioned into the bins. For example, you can:
19+
20+
+ Manually type a series of values to serve as the bin boundaries.
21+
+ Assign values to bins by using *quantiles*, or percentile ranks.
22+
+ Force an even distribution of values into the bins.
23+
24+
### More about binning and grouping
25+
26+
*Binning* or grouping data (sometimes called *quantization*) is an important tool in preparing numerical data for machine learning, and is useful in scenarios like these:
27+
28+
+ A column of continuous numbers has too many unique values to model effectively, so you automatically or manually assign the values to groups, to create a smaller set of discrete ranges.
29+
30+
+ Replace a column of numbers with categorical values that represent specific ranges.
31+
32+
For example, you might want to group values in an age column by specifying custom ranges, such as 1-15, 16-22, 23-30, and so forth for user demographics.
33+
34+
+ A dataset has a few extreme values, all well outside the expected range, and these values have an outsized influence on the trained model. To mitigate the bias in the model, you might transform the data to a uniform distribution, using the quantiles method.
35+
36+
With this method, the [Group Data into Bins](group-data-into-bins.md) module determines the ideal bin locations and bin widths to ensure that approximately the same number of samples fall into each bin. Then, depending on the normalization method you choose, the values in the bins are either transformed either to percentiles or mapped to a bin number.
37+
38+
### Examples of binning
39+
40+
The following diagram shows the distribution of numeric values before and after binning with the **quantiles** method. Notice that compared to the raw data at left, the data has been binned and transformed to a unit-normal scale.
41+
42+
'An example can be found from the result of this pipeline run: https://ml.azure.com/visualinterface/authoring/Normal/87270db9-4651-448e-bd28-8ef7428084dc?wsid=%2Fsubscriptions%2Fe9b2ec51-5c94-4fa8-809a-dc1e695e4896%2Fresourcegroups%2Fmodule-ws-rg%2Fworkspaces%2Fmodule-prerelease-119&flight=cm&tid=72f988bf-86f1-41af-91ab-2d7cd011db47&smtendpoint=https%3A%2F%2Fsmt-test1.azureml-test.net'
43+
44+
Because there are so many ways to group data, all customizable, we recommend that you experiment with different methods and values.
45+
46+
## How to configure Group Data into Bins
47+
48+
1. Add the **Group Data Into Bins** module to your pipeline in Designer (preview). You can find this module in the category **Data Transformation**.
49+
50+
2. Connect the dataset that has numerical data to bin. Quantization can be applied only to columns containing numeric data.
51+
52+
If the dataset contains non-numeric columns, use the [Select Columns in Dataset](select-columns-in-dataset.md) module to select a subset of columns to work with.
53+
54+
3. Specify the binning mode. The binning mode determines other parameters so be sure to select the **Binning mode** option first. The following types of binning are supported:
55+
56+
**Quantiles**: The quantile method assigns values to bins based on percentile ranks. Quantiles is also known as equal height binning.
57+
58+
**Equal Width**: With this option, you must specify the total number of bins. The values from the data column are placed in the bins such that each bin has the same interval between starting and end values. As a result, some bins might have more values if data is clumped around a certain point.
59+
60+
**Custom Edges**: You can specify the values that begin each bin. The edge value is always the lower boundary of the bin. For example, assume you want to group values into two bins, one with values greater than 0, and one with values less than or equal to 0. In this case, for bin edges, you would type 0 in **Comma-separated list of bin edges**. The output of the module would be 1 and 2, indicating the bin index for each row value. Please note that the comma-separated value list must be in an ascending order, such as 1, 3, 5, 7.
61+
62+
4. **Number of bins**: If you are using the **Quantiles**, and **Equal Width** binning modes, use this option to specify how many bins, or *quantiles*, that you want to create.
63+
64+
5. For **Columns to bin**, use the Column Selector to choose the columns that have the values you want to bin. Columns must be a numeric data type.
65+
66+
The same binning rule is applied to all applicable columns that you choose. Therefore, if you need to bin some columns by using a different method, use a separate instance of [Group Data into Bins](group-data-into-bins.md) for each set of columns.
67+
68+
> [!WARNING]
69+
> If you choose a column that is not an allowed type, a run-time error is generated. The module returns an error as soon as it finds any column of a disallowed type. If you get an error, review all selected columns. The error does not list all invalid columns.
70+
71+
6. For **Output mode**, indicate how you want to output the quantized values.
72+
73+
+ **Append**: Creates a new column with the binned values and appends that to the input table.
74+
75+
+ **Inplace**: Replaces the original values with the new values in the dataset.
76+
77+
+ **ResultOnly**: Returns just the result columns.
78+
79+
7. If you select the **Quantiles** binning mode, use the **Quantile normalization** option to determine how values are normalized prior to sorting into quantiles. Note that normalizing values transforms the values, but does not affect the final number of bins.
80+
81+
The following normalization types are supported:
82+
83+
+ **Percent**: Values are normalized within the range [0,100]
84+
85+
+ **PQuantile**: Values are normalized within the range [0,1]
86+
87+
+ **QuantileIndex**: Values are normalized within the range [1,number of bins]
88+
89+
8. If you choose the **Custom Edges** option, type a comma-separated list of numbers to use as *bin edges* in the + **Comma-separated list of bin edges** text box. The values mark the point that divides bins, Therefore, if you type one bin edge value, two bins will be generated; if you type two bin edge values, three bins will be generated, and so forth.
90+
91+
The values must be sorted in order that the bins are created, from lowest to highest.
92+
93+
10. **Tag columns as categorical**: Select this option to indicate that the quantized columns should be handled as categorical variables.
94+
95+
11. Submit the pipeline.
96+
97+
### Results
98+
99+
The [Group Data into Bins](group-data-into-bins.md) module returns a dataset in which each element has been binned according to the specified mode.
100+
101+
It also returns a **Binning transformation**, which is a function that can be passed to the [Apply Transformation](apply-transformation.md) module to bin new samples of data using the same binning mode and parameters.
102+
103+
> [!TIP]
104+
> Remember, if you use binning on your training data, you must use the same binning method on data that you use for testing and prediction. This includes the binning method, bin locations, and bin widths.
105+
>
106+
> To ensure that data is always transformed by using the same binning method, we recommend that you save useful data transformations, and then apply them to other datasets, by using the [Apply Transformation](apply-transformation.md) module.
107+
108+
## Next steps
109+
110+
See the [set of modules available](module-reference.md) to Azure Machine Learning.

0 commit comments

Comments
 (0)