Skip to content

Commit 3edc33c

Browse files
committed
Update to Rubix ML 0.1.0
1 parent a5bb4e4 commit 3edc33c

File tree

5 files changed

+4116
-2108
lines changed

5 files changed

+4116
-2108
lines changed

README.md

Lines changed: 23 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ Since Logistic Regression implements the [Verbose](https://docs.rubixml.com/en/l
7777
```php
7878
use Rubix\ML\Other\Loggers\Screen;
7979

80-
$estimator->setLogger(new Screen('credit'));
80+
$estimator->setLogger(new Screen());
8181
```
8282

8383
### Training
@@ -231,10 +231,12 @@ $dataset = Labeled::fromIterator(new CSV('dataset.csv', true))
231231
```
232232

233233
### Describing the Dataset
234-
The dataset object we instantiated has a `describe()` method that generates statistics for each feature column in the dataset. Category densities will be calculated for each categorical feature value and statistics such as mean, median, and standard deviation will be output for the continuous feature columns.
234+
The dataset object we instantiated has a `describe()` method that generates statistics for each feature column in the dataset. Category densities will be calculated for each categorical feature value and statistics such as mean, median, and standard deviation will be output for the continuous feature columns. The return value is a report object that can be echoed out to the terminal.
235235

236236
```php
237237
$stats = $dataset->describe();
238+
239+
echo $stats;
238240
```
239241

240242
Here is the output of the first two columns in the credit card dataset. We can see that the first column `credit_limit` has a mean of 167,484 and the distribution of values is skewed to the left. We also know that column two `gender` contains two categories and that there are more females than males (60 / 40) represented in this dataset. Generate and examine the dataset stats for yourself and see if you can identify any other interesting characteristics of the dataset.
@@ -265,27 +267,21 @@ Here is the output of the first two columns in the credit card dataset. We can s
265267
]
266268
```
267269

268-
### Visualizing the Dataset
269-
The credit card dataset has 25 features and after one hot encoding it becomes 93. Thus, the vector space for this dataset is *93-dimensional*. Visualizing this type of high-dimensional data with the human eye is only possible by reducing the number of dimensions to something that makes sense to plot on a chart (1 - 3 dimensions). Such dimensionality reduction is called *Manifold Learning* because it seeks to find a lower-dimensional manifold of the data. Here we will use a popular manifold learning algorithm called [t-SNE](https://docs.rubixml.com/en/latest/embedders/t-sne.html) to help us visualize the data by embedding it into only two dimensions.
270-
271-
Before we continue, we'll need to prepare the dataset for embedding since, like Logistic Regression, T-SNE is only compatible with continuous features. We can perform the necessary transformations on the dataset by passing the transformers to the `apply()` method on the dataset object like we did earlier in the tutorial.
270+
In addition, we'll save the stats to a JSON file so we can reference it later.
272271

273272
```php
274-
use Rubix\ML\Transformers\OneHotEncoder;
275-
use Rubix\ML\Transformers\ZScaleStandardizer;
276-
277-
$dataset->apply(new OneHotEncoder())
278-
->apply(new ZScaleStandardizer());
273+
$stats->toJSON()->write('stats.json');
279274
```
280275

281-
> **Note:** Centering and standardizing the data with [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) or another standardizer is not always necessary, however, it just so happens that both Logistic Regression and t-SNE benefit when the data are centered and standardized.
276+
### Visualizing the Dataset
277+
The credit card dataset has 25 features and after one hot encoding it becomes 93. Thus, the vector space for this dataset is *93-dimensional*. Visualizing this type of high-dimensional data with the human eye is only possible by reducing the number of dimensions to something that makes sense to plot on a chart (1 - 3 dimensions). Such dimensionality reduction is called *Manifold Learning* because it seeks to find a lower-dimensional manifold of the data. Here we will use a popular manifold learning algorithm called [t-SNE](https://docs.rubixml.com/en/latest/embedders/t-sne.html) to help us visualize the data by embedding it into only two dimensions.
282278

283-
We don't need the entire dataset to generate a decent embedding so we'll take 1,000 random samples from the dataset and only embed those. The `head()` method on the dataset object will return the first *n* samples and labels from the dataset in a new dataset object. Randomizing the dataset beforehand will remove the bias as to the sequence that the data was collected and inserted.
279+
We don't need the entire dataset to generate a decent embedding so we'll take 2,000 random samples from the dataset and only embed those. The `head()` method on the dataset object will return the first *n* samples and labels from the dataset in a new dataset object. Randomizing the dataset beforehand will remove the bias as to the sequence that the data was collected and inserted.
284280

285281
```php
286282
use Rubix\ML\Datasets\Labeled;
287283

288-
$dataset = $dataset->randomize()->head(1000);
284+
$dataset = $dataset->randomize()->head(2000);
289285
```
290286

291287
### Instantiating the Embedder
@@ -298,6 +294,18 @@ $embedder = new TSNE(2, 20.0, 20);
298294
```
299295

300296
### Embedding the Dataset
297+
Before we continue, we'll need to prepare the dataset for embedding since, like Logistic Regression, T-SNE is only compatible with continuous features. We can perform the necessary transformations on the dataset by passing the transformers to the `apply()` method on the dataset object like we did earlier in the tutorial.
298+
299+
```php
300+
use Rubix\ML\Transformers\OneHotEncoder;
301+
use Rubix\ML\Transformers\ZScaleStandardizer;
302+
303+
$dataset->apply(new OneHotEncoder())
304+
->apply(new ZScaleStandardizer());
305+
```
306+
307+
> **Note:** Centering and standardizing the data with [Z Scale Standardizer](https://docs.rubixml.com/en/latest/transformers/z-scale-standardizer.html) or another standardizer is not always necessary, however, it just so happens that both Logistic Regression and t-SNE benefit when the data are centered and standardized.
308+
301309
Since an Embedder is a [Transformer](https://docs.rubixml.com/en/latest/transformers/api.md) at heart, you can use the newly instantiated t-SNE embedder to embed the samples in a dataset using the `apply()` method.
302310

303311
```php
@@ -307,7 +315,7 @@ $dataset->apply($embedder);
307315
When the embedding is complete, we can save the dataset to a file so we can open it later in our favorite plotting software.
308316

309317
```php
310-
file_put_contents('embedding.csv', $dataset->toCsv());
318+
$dataset->toCSV()->write('embedding.csv');
311319
```
312320

313321
Now we're ready to execute the explore script and plot the embedding using our favorite plotting software.

composer.json

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,7 @@
2020
],
2121
"require": {
2222
"php": ">=7.2",
23-
"league/csv": "^9.5",
24-
"rubix/ml": "0.1.0-rc3"
23+
"rubix/ml": "0.1.0"
2524
},
2625
"suggest": {
2726
"ext-tensor": "For faster training and inference"

0 commit comments

Comments
 (0)