Skip to content

Commit 039ac83

Browse files
committed
Fix links to docs
1 parent 0789526 commit 039ac83

File tree

2 files changed

+18
-18
lines changed

2 files changed

+18
-18
lines changed

LICENSE

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
MIT License
22

3-
Copyright (c) 2020 Rubix ML
4-
Copyright (c) 2020 Andrew DalPino
3+
Copyright (c) 2021 Rubix ML
4+
Copyright (c) 2021 Andrew DalPino
55

66
Permission is hereby granted, free of charge, to any person obtaining a copy
77
of this software and associated documentation files (the "Software"), to deal

README.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Rubix ML - Credit Card Default Predictor
2-
An example Rubix ML project that predicts the probability of a customer defaulting on their credit card bill next month using a [Logistic Regression](https://docs.rubixml.com/classifiers/logistic-regression.html) estimator and a 30,000 sample dataset of credit card customers. We'll also describe the dataset using statistics and visualize it using a manifold learning technique called [t-SNE](https://docs.rubixml.com/embedders/t-sne.html).
2+
An example Rubix ML project that predicts the probability of a customer defaulting on their credit card bill next month using a [Logistic Regression](https://docs.rubixml.com/latest/classifiers/logistic-regression.html) estimator and a 30,000 sample dataset of credit card customers. We'll also describe the dataset using statistics and visualize it using a manifold learning technique called [t-SNE](https://docs.rubixml.com/latest/embedders/t-sne.html).
33

44
- **Difficulty:** Medium
55
- **Training time:** Minutes
@@ -20,12 +20,12 @@ $ composer create-project rubix/credit
2020
## Tutorial
2121

2222
### Introduction
23-
The dataset provided to us contains 30,000 labeled samples from customers of a Taiwanese credit card issuer. Our objective is to train an estimator that predicts the probability of a customer defaulting on their credit card bill the next month. Since this is a *binary* classification problem (*will* default or *won't* default) we can use the binary classifier [Logistic Regression](https://docs.rubixml.com/classifiers/logistic-regression.html) which implements the Probabilistic interface to make our predictions. Logistic Regression is a supervised learner that trains a linear model using an algorithm called *Gradient Descent* under the hood.
23+
The dataset provided to us contains 30,000 labeled samples from customers of a Taiwanese credit card issuer. Our objective is to train an estimator that predicts the probability of a customer defaulting on their credit card bill the next month. Since this is a *binary* classification problem (*will* default or *won't* default) we can use the binary classifier [Logistic Regression](https://docs.rubixml.com/latest/classifiers/logistic-regression.html) which implements the Probabilistic interface to make our predictions. Logistic Regression is a supervised learner that trains a linear model using an algorithm called *Gradient Descent* under the hood.
2424

2525
> **Note:** The source code for this example can be found in the [train.php](https://github.com/RubixML/Credit/blob/master/train.php) file in project root.
2626
2727
### Extracting the Data
28-
In Rubix ML, data are passed in specialized containers called [Dataset objects](https://docs.rubixml.com/datasets/api.html). We'll start by extracting the data provided in the `dataset.csv` file using the built-in [CSV](https://docs.rubixml.com/extractors/csv.html) extractor and then instantiating a [Labeled](https://docs.rubixml.com/datasets/labeled.html) dataset object from it using the `fromIterator()` factory method.
28+
In Rubix ML, data are passed in specialized containers called [Dataset objects](https://docs.rubixml.com/latest/datasets/api.html). We'll start by extracting the data provided in the `dataset.csv` file using the built-in [CSV](https://docs.rubixml.com/latest/extractors/csv.html) extractor and then instantiating a [Labeled](https://docs.rubixml.com/latest/datasets/labeled.html) dataset object from it using the `fromIterator()` factory method.
2929

3030
```php
3131
use Rubix\ML\Datasets\Labeled;
@@ -35,11 +35,11 @@ $dataset = Labeled::fromIterator(new CSV('dataset.csv', true));
3535
```
3636

3737
### Dataset Preparation
38-
Since data types cannot be inferred from the CSV format, the entire dataset will be loaded in as strings. We'll need to convert the numeric types to their integer and floating point number counterparts before proceeding. Lucky for us, the [Numeric String Converter](https://docs.rubixml.com/transformers/numeric-string-converter.html) accomplishes this task automatically.
38+
Since data types cannot be inferred from the CSV format, the entire dataset will be loaded in as strings. We'll need to convert the numeric types to their integer and floating point number counterparts before proceeding. Lucky for us, the [Numeric String Converter](https://docs.rubixml.com/latest/transformers/numeric-string-converter.html) accomplishes this task automatically.
3939

40-
The categorical features such as gender, education, and marital status - as well as the continuous features such as age and credit limit are now in the appropriate format. However, the Logistic Regression estimator is not compatible with categorical features directly so we'll need to [One Hot Encode](https://docs.rubixml.com/transformers/one-hot-encoder.html) them to convert them into continuous ones. *One hot* encoding takes a categorical feature column and transforms the values into a vector of binary features where the feature that represents the active category is high (1) and all others are low (0).
40+
The categorical features such as gender, education, and marital status - as well as the continuous features such as age and credit limit are now in the appropriate format. However, the Logistic Regression estimator is not compatible with categorical features directly so we'll need to [One Hot Encode](https://docs.rubixml.com/latest/transformers/one-hot-encoder.html) them to convert them into continuous ones. *One hot* encoding takes a categorical feature column and transforms the values into a vector of binary features where the feature that represents the active category is high (1) and all others are low (0).
4141

42-
In addition, it is a good practice to center and scale the dataset as it helps speed up the convergence of the Gradient Descent learning algorithm. To do that, we'll chain another transformation to the dataset called [Z Scale Standardizer](https://docs.rubixml.com/transformers/z-scale-standardizer.html) which standardizes the data by dividing each column over its Z score.
42+
In addition, it is a good practice to center and scale the dataset as it helps speed up the convergence of the Gradient Descent learning algorithm. To do that, we'll chain another transformation to the dataset called [Z Scale Standardizer](https://docs.rubixml.com/latest/transformers/z-scale-standardizer.html) which standardizes the data by dividing each column over its Z score.
4343

4444
```php
4545
use Rubix\ML\Transformers\NumericStringConverter;
@@ -58,11 +58,11 @@ We'll need to set some of the data aside so that it can be used later for testin
5858
```
5959

6060
### Instantiating the Learner
61-
You'll notice that [Logistic Regression](https://docs.rubixml.com/classifiers/logistic-regression.html) has a few parameters to consider. These parameters are called *hyper-parameters* as they have a global effect on the behavior of the algorithm during training and inference. For this example, we'll specify the first three hyper-parameters, the *batch size* and the Gradient Descent *optimizer* with its *learning rate*.
61+
You'll notice that [Logistic Regression](https://docs.rubixml.com/latest/classifiers/logistic-regression.html) has a few parameters to consider. These parameters are called *hyper-parameters* as they have a global effect on the behavior of the algorithm during training and inference. For this example, we'll specify the first three hyper-parameters, the *batch size* and the Gradient Descent *optimizer* with its *learning rate*.
6262

6363
As previously mentioned, Logistic Regression trains using an algorithm called Gradient Descent. Specifically, it uses a form of GD called *Mini-batch* Gradient Descent that feeds small batches of the randomized dataset through the learner at a time. The size of the batch is determined by the *batch size* hyper-parameter. A small batch size typically trains faster but produces a rougher gradient for the learner to traverse. For our example, we'll pick 256 samples per batch but feel free to play with this setting on your own.
6464

65-
The next hyper-parameter is the GD Optimizer which controls the update step of the algorithm. Most optimizers have a global learning rate setting that allows you to control the size of each Gradient Descent step. The [Step Decay](https://docs.rubixml.com/neural-network/optimizers/step-decay.html) optimizer gradually decreases the learning rate by a given factor every *n* steps from its global setting. This allows training to be fast at first and then slow down as it get closer to reaching the minima of the gradient. We'll choose to decay the learning rate every 100 steps with a starting rate of 0.01. To instantiate the learner, pass the hyper-parameters to the Logistic Regression constructor.
65+
The next hyper-parameter is the GD Optimizer which controls the update step of the algorithm. Most optimizers have a global learning rate setting that allows you to control the size of each Gradient Descent step. The [Step Decay](https://docs.rubixml.com/latest/neural-network/optimizers/step-decay.html) optimizer gradually decreases the learning rate by a given factor every *n* steps from its global setting. This allows training to be fast at first and then slow down as it get closer to reaching the minima of the gradient. We'll choose to decay the learning rate every 100 steps with a starting rate of 0.01. To instantiate the learner, pass the hyper-parameters to the Logistic Regression constructor.
6666

6767
```php
6868
use Rubix\ML\Classifiers\LogisticRegression;
@@ -72,7 +72,7 @@ $estimator = new LogisticRegression(256, new StepDecay(0.01, 100));
7272
```
7373

7474
### Setting a Logger
75-
Since Logistic Regression implements the [Verbose](https://docs.rubixml.com/verbose.html) interface, we can hand it a [PSR-3](https://www.php-fig.org/psr/psr-3/) compatible logger instance and it will log helpful information to the console during training. We'll use the [Screen](https://docs.rubixml.com/other/loggers/screen.html) logger that comes built-in with Rubix ML, but feel free to choose any great PHP logger such as [Monolog](https://github.com/Seldaek/monolog) or [Analog](https://github.com/jbroadway/analog) to do the job as well.
75+
Since Logistic Regression implements the [Verbose](https://docs.rubixml.com/latest/verbose.html) interface, we can hand it a [PSR-3](https://www.php-fig.org/psr/psr-3/) compatible logger instance and it will log helpful information to the console during training. We'll use the [Screen](https://docs.rubixml.com/latest/other/loggers/screen.html) logger that comes built-in with Rubix ML, but feel free to choose any great PHP logger such as [Monolog](https://github.com/Seldaek/monolog) or [Analog](https://github.com/jbroadway/analog) to do the job as well.
7676

7777
```php
7878
use Rubix\ML\Other\Loggers\Screen;
@@ -88,7 +88,7 @@ $estimator->train($dataset);
8888
```
8989

9090
### Training Loss
91-
The `steps()` method on Logistic Regression outputs the value of the [Cross Entropy](https://docs.rubixml.com/neural-network/cost-functions/cross-entropy.html) cost function at each epoch from the last training session. You can plot those values by dumping them to a CSV file and then importing them into your favorite plotting software such as [Plotly](https://plot.ly/) or [Tableu](https://public.tableau.com/en-us/s/).
91+
The `steps()` method on Logistic Regression outputs the value of the [Cross Entropy](https://docs.rubixml.com/latest/neural-network/cost-functions/cross-entropy.html) cost function at each epoch from the last training session. You can plot those values by dumping them to a CSV file and then importing them into your favorite plotting software such as [Plotly](https://plot.ly/) or [Tableu](https://public.tableau.com/en-us/s/).
9292

9393
```php
9494
$losses = $estimator->steps();
@@ -101,7 +101,7 @@ You'll notice that the loss should be decreasing at each epoch and changes in th
101101
### Cross Validation
102102
Once the learner has been trained, the next step is to determine if the final model can generalize well to the real world. For this process, we'll need the testing data that we set aside earlier. We'll go ahead and generate two reports that compare the predictions outputted by the estimator with the ground truth labels from the testing set.
103103

104-
The [Multiclass Breakdown](https://docs.rubixml.com/cross-validation/reports/multiclass-breakdown.html) report gives us detailed metrics (Accuracy, F1 Score, MCC) about the model's performance at the class level. In addition, [Confusion Matrix](https://docs.rubixml.com/cross-validation/reports/confusion-matrix.html) is a table that compares the number of predictions for a particular class with the actual ground truth. We can wrap both of these reports in an [Aggregate Report](https://docs.rubixml.com/cross-validation/reports/aggregate-report.html) to generate them both at the same time.
104+
The [Multiclass Breakdown](https://docs.rubixml.com/latest/cross-validation/reports/multiclass-breakdown.html) report gives us detailed metrics (Accuracy, F1 Score, MCC) about the model's performance at the class level. In addition, [Confusion Matrix](https://docs.rubixml.com/latest/cross-validation/reports/confusion-matrix.html) is a table that compares the number of predictions for a particular class with the actual ground truth. We can wrap both of these reports in an [Aggregate Report](https://docs.rubixml.com/latest/cross-validation/reports/aggregate-report.html) to generate them both at the same time.
105105

106106
```php
107107
use Rubix\ML\CrossValidation\Reports\AggregateReport;
@@ -274,7 +274,7 @@ $stats->toJSON()->write('stats.json');
274274
```
275275

276276
### Visualizing the Dataset
277-
The credit card dataset has 25 features and after one hot encoding it becomes 93. Thus, the vector space for this dataset is *93-dimensional*. Visualizing this type of high-dimensional data with the human eye is only possible by reducing the number of dimensions to something that makes sense to plot on a chart (1 - 3 dimensions). Such dimensionality reduction is called *Manifold Learning* because it seeks to find a lower-dimensional manifold of the data. Here we will use a popular manifold learning algorithm called [t-SNE](https://docs.rubixml.com/embedders/t-sne.html) to help us visualize the data by embedding it into only two dimensions.
277+
The credit card dataset has 25 features and after one hot encoding it becomes 93. Thus, the vector space for this dataset is *93-dimensional*. Visualizing this type of high-dimensional data with the human eye is only possible by reducing the number of dimensions to something that makes sense to plot on a chart (1 - 3 dimensions). Such dimensionality reduction is called *Manifold Learning* because it seeks to find a lower-dimensional manifold of the data. Here we will use a popular manifold learning algorithm called [t-SNE](https://docs.rubixml.com/latest/embedders/t-sne.html) to help us visualize the data by embedding it into only two dimensions.
278278

279279
We don't need the entire dataset to generate a decent embedding so we'll take 2,500 random samples from the dataset and only embed those. The `head()` method on the dataset object will return the first *n* samples and labels from the dataset in a new dataset object. Randomizing the dataset beforehand will remove the bias as to the sequence that the data was collected and inserted.
280280

@@ -285,7 +285,7 @@ $dataset = $dataset->randomize()->head(2500);
285285
```
286286

287287
### Instantiating the Embedder
288-
[T-SNE](https://docs.rubixml.com/embedders/t-sne.html) stands for t-Distributed Stochastic Neighbor Embedding and is a powerful non-linear dimensionality reduction algorithm suited for visualizing high-dimensional datasets. The first hyper-parameter is the number of dimensions of the target embedding. Since we want to be able to plot the embedding as a 2-d scatterplot we'll set this parameter to the integer `2`. The next hyper-parameter is the learning rate which controls the rate at which the embedder updates the target embedding. The last hyper-parameter we'll set is called the *perplexity* and can the thought of as the number of nearest neighbors to consider when computing the variance of the distribution of a sample. Refer to the documentation for a full description of the hyper-parameters.
288+
[T-SNE](https://docs.rubixml.com/latest/embedders/t-sne.html) stands for t-Distributed Stochastic Neighbor Embedding and is a powerful non-linear dimensionality reduction algorithm suited for visualizing high-dimensional datasets. The first hyper-parameter is the number of dimensions of the target embedding. Since we want to be able to plot the embedding as a 2-d scatterplot we'll set this parameter to the integer `2`. The next hyper-parameter is the learning rate which controls the rate at which the embedder updates the target embedding. The last hyper-parameter we'll set is called the *perplexity* and can the thought of as the number of nearest neighbors to consider when computing the variance of the distribution of a sample. Refer to the documentation for a full description of the hyper-parameters.
289289

290290
```php
291291
use Rubix\ML\Embedders\TSNE;
@@ -304,9 +304,9 @@ $dataset->apply(new OneHotEncoder())
304304
->apply(new ZScaleStandardizer());
305305
```
306306

307-
> **Note:** Centering and standardizing the data with [Z Scale Standardizer](https://docs.rubixml.com/transformers/z-scale-standardizer.html) or another standardizer is not always necessary, however, it just so happens that both Logistic Regression and t-SNE benefit when the data are centered and standardized.
307+
> **Note:** Centering and standardizing the data with [Z Scale Standardizer](https://docs.rubixml.com/latest/transformers/z-scale-standardizer.html) or another standardizer is not always necessary, however, it just so happens that both Logistic Regression and t-SNE benefit when the data are centered and standardized.
308308
309-
Since an Embedder is a [Transformer](https://docs.rubixml.com/transformers/api.md) at heart, you can use the newly instantiated t-SNE embedder to embed the samples in a dataset using the `apply()` method.
309+
Since an Embedder is a [Transformer](https://docs.rubixml.com/latest/transformers/api.md) at heart, you can use the newly instantiated t-SNE embedder to embed the samples in a dataset using the `apply()` method.
310310

311311
```php
312312
$dataset->apply($embedder);
@@ -330,7 +330,7 @@ Here is an example of what a typical 2-dimensional embedding looks like when plo
330330
> **Note**: Due to the stochastic nature of the t-SNE algorithm, every embedding will look a little different from the last. The important information is contained in the overall *structure* of the data.
331331
332332
### Next Steps
333-
Congratulations on completing the tutorial! The Logistic Regression estimator we just trained is able to achieve the same results as in the original paper, however, there are other estimators in Rubix ML to choose from that may perform better. Consider the same problem using an ensemble method such as [AdaBoost](https://docs.rubixml.com/classifiers/adaboost.html) or [Random Forest](https://docs.rubixml.com/classifiers/random-forest.html) as a next step.
333+
Congratulations on completing the tutorial! The Logistic Regression estimator we just trained is able to achieve the same results as in the original paper, however, there are other estimators in Rubix ML to choose from that may perform better. Consider the same problem using an ensemble method such as [AdaBoost](https://docs.rubixml.com/latest/classifiers/adaboost.html) or [Random Forest](https://docs.rubixml.com/latest/classifiers/random-forest.html) as a next step.
334334

335335
## Slide Deck
336336
You can refer to the [slide deck](https://docs.google.com/presentation/d/1ZteG0Rf3siS_o-8x2r2AWw95ntcCggmmEHUfwQiuCnk/edit?usp=sharing) that accompanies this example project if you need extra help or a more in depth look at the math behind Logistic Regression, Gradient Descent, and the Cross Entropy cost function.

0 commit comments

Comments
 (0)