You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The dataset object we instantiated has a `describe()` method that generates statistics for each feature column in the dataset. Category densities will be calculated for each categorical feature value and statistics such as mean, median, and standard deviation will be output for the continuous feature columns.
234
+
The dataset object we instantiated has a `describe()` method that generates statistics for each feature column in the dataset. Category densities will be calculated for each categorical feature value and statistics such as mean, median, and standard deviation will be output for the continuous feature columns. The return value is a report object that can be echoed out to the terminal.
235
235
236
236
```php
237
237
$stats = $dataset->describe();
238
+
239
+
echo $stats;
238
240
```
239
241
240
242
Here is the output of the first two columns in the credit card dataset. We can see that the first column `credit_limit` has a mean of 167,484 and the distribution of values is skewed to the left. We also know that column two `gender` contains two categories and that there are more females than males (60 / 40) represented in this dataset. Generate and examine the dataset stats for yourself and see if you can identify any other interesting characteristics of the dataset.
@@ -265,27 +267,21 @@ Here is the output of the first two columns in the credit card dataset. We can s
265
267
]
266
268
```
267
269
268
-
### Visualizing the Dataset
269
-
The credit card dataset has 25 features and after one hot encoding it becomes 93. Thus, the vector space for this dataset is *93-dimensional*. Visualizing this type of high-dimensional data with the human eye is only possible by reducing the number of dimensions to something that makes sense to plot on a chart (1 - 3 dimensions). Such dimensionality reduction is called *Manifold Learning* because it seeks to find a lower-dimensional manifold of the data. Here we will use a popular manifold learning algorithm called [t-SNE](https://docs.rubixml.com/en/latest/embedders/t-sne.html) to help us visualize the data by embedding it into only two dimensions.
270
-
271
-
Before we continue, we'll need to prepare the dataset for embedding since, like Logistic Regression, T-SNE is only compatible with continuous features. We can perform the necessary transformations on the dataset by passing the transformers to the `apply()` method on the dataset object like we did earlier in the tutorial.
270
+
In addition, we'll save the stats to a JSON file so we can reference it later.
272
271
273
272
```php
274
-
use Rubix\ML\Transformers\OneHotEncoder;
275
-
use Rubix\ML\Transformers\ZScaleStandardizer;
276
-
277
-
$dataset->apply(new OneHotEncoder())
278
-
->apply(new ZScaleStandardizer());
273
+
$stats->toJSON()->write('stats.json');
279
274
```
280
275
281
-
> **Note:** Centering and standardizing the data with [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) or another standardizer is not always necessary, however, it just so happens that both Logistic Regression and t-SNE benefit when the data are centered and standardized.
276
+
### Visualizing the Dataset
277
+
The credit card dataset has 25 features and after one hot encoding it becomes 93. Thus, the vector space for this dataset is *93-dimensional*. Visualizing this type of high-dimensional data with the human eye is only possible by reducing the number of dimensions to something that makes sense to plot on a chart (1 - 3 dimensions). Such dimensionality reduction is called *Manifold Learning* because it seeks to find a lower-dimensional manifold of the data. Here we will use a popular manifold learning algorithm called [t-SNE](https://docs.rubixml.com/en/latest/embedders/t-sne.html) to help us visualize the data by embedding it into only two dimensions.
282
278
283
-
We don't need the entire dataset to generate a decent embedding so we'll take 1,000 random samples from the dataset and only embed those. The `head()` method on the dataset object will return the first *n* samples and labels from the dataset in a new dataset object. Randomizing the dataset beforehand will remove the bias as to the sequence that the data was collected and inserted.
279
+
We don't need the entire dataset to generate a decent embedding so we'll take 2,000 random samples from the dataset and only embed those. The `head()` method on the dataset object will return the first *n* samples and labels from the dataset in a new dataset object. Randomizing the dataset beforehand will remove the bias as to the sequence that the data was collected and inserted.
284
280
285
281
```php
286
282
use Rubix\ML\Datasets\Labeled;
287
283
288
-
$dataset = $dataset->randomize()->head(1000);
284
+
$dataset = $dataset->randomize()->head(2000);
289
285
```
290
286
291
287
### Instantiating the Embedder
@@ -298,6 +294,18 @@ $embedder = new TSNE(2, 20.0, 20);
298
294
```
299
295
300
296
### Embedding the Dataset
297
+
Before we continue, we'll need to prepare the dataset for embedding since, like Logistic Regression, T-SNE is only compatible with continuous features. We can perform the necessary transformations on the dataset by passing the transformers to the `apply()` method on the dataset object like we did earlier in the tutorial.
298
+
299
+
```php
300
+
use Rubix\ML\Transformers\OneHotEncoder;
301
+
use Rubix\ML\Transformers\ZScaleStandardizer;
302
+
303
+
$dataset->apply(new OneHotEncoder())
304
+
->apply(new ZScaleStandardizer());
305
+
```
306
+
307
+
> **Note:** Centering and standardizing the data with [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) or another standardizer is not always necessary, however, it just so happens that both Logistic Regression and t-SNE benefit when the data are centered and standardized.
308
+
301
309
Since an Embedder is a [Transformer](https://docs.rubixml.com/en/latest/transformers/api.md) at heart, you can use the newly instantiated t-SNE embedder to embed the samples in a dataset using the `apply()` method.
302
310
303
311
```php
@@ -307,7 +315,7 @@ $dataset->apply($embedder);
307
315
When the embedding is complete, we can save the dataset to a file so we can open it later in our favorite plotting software.
0 commit comments