Added LinearRegressor.SGD constructor (#231)

gyrdym · web-flow · commit c2e771e9f517 · 2022-05-22T22:01:23.000+03:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,8 @@
 # Changelog
 
+## 16.14.0
+- `LinearRegressor.SGD` constructor added
+
 ## 16.13.0
 - `RandomBinaryProjectionSearcher`:
     - Distance type considered
diff --git a/README.md b/README.md
@@ -44,6 +44,10 @@ it in web applications.
     - [LogisticRegressor](https://pub.dev/documentation/ml_algo/latest/ml_algo/LogisticRegressor-class.html). 
     A class that performs linear binary classification of data. To use this kind of classifier your data has to be 
     [linearly separable](https://en.wikipedia.org/wiki/Linear_separability).
+    
+    - [LogisticRegressor.SGD](https://pub.dev/documentation/ml_algo/latest/ml_algo/LogisticRegressor/LogisticRegressor.SGD.html). 
+    Implementation of the logistic regression algorithm based on stochastic gradient descent with L2 regularisation. 
+    To use this kind of classifier your data has to be [linearly separable](https://en.wikipedia.org/wiki/Linear_separability).
 
     - [SoftmaxRegressor](https://pub.dev/documentation/ml_algo/latest/ml_algo/SoftmaxRegressor-class.html). 
     A class that performs linear multiclass classification of data. To use this kind of classifier your data has to be 
@@ -100,7 +104,7 @@ in your dependencies:
 
 ````
 dependencies:
-  ml_dataframe: ^1.4.2
+  ml_dataframe: ^1.5.0
   ml_preprocessing: ^7.0.2
 ````
 
@@ -125,11 +129,8 @@ We have 2 options here:
 
 - Download the dataset from [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).
 
-- Or we may simply use [getPimaIndiansDiabetesDataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/getPimaIndiansDiabetesDataFrame.html) function
-from [ml_dataframe](https://pub.dev/packages/ml_dataframe) package. The function returns a ready to use [DataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/DataFrame-class.html) instance
-filled with `Pima Indians Diabetes Database` data.
-
-If we chose the first option, we should do the following: 
+<details>
+<summary>Instructions</summary>
 
 #### For a desktop application: 
 
@@ -142,18 +143,7 @@ final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv');
 
 #### For a flutter application:
 
-Be sure that you have ml_dataframe package version at least 1.0.0 and ml_algo package version at least 16.0.0 
-in your pubspec.yaml:
-
-````
-dependencies:
-  ...
-  ml_algo: ^16.11.2
-  ml_dataframe: ^1.4.2
-  ...
-````
-
-Then it's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:
+It's needed to add the dataset to the flutter assets by adding the following config in the pubspec.yaml:
 
 ````
 flutter:
@@ -168,10 +158,30 @@ can access the dataset:
 import 'package:flutter/services.dart' show rootBundle;
 import 'package:ml_dataframe/ml_dataframe.dart';
 
-final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
-final samples = DataFrame.fromRawCsv(rawCsvContent);
+void main() async {
+  final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
+  final samples = DataFrame.fromRawCsv(rawCsvContent);
+}
+```
+</details>
+
+- Or we may simply use [getPimaIndiansDiabetesDataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/getPimaIndiansDiabetesDataFrame.html) function
+from [ml_dataframe](https://pub.dev/packages/ml_dataframe) package. The function returns a ready to use [DataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/DataFrame-class.html) instance
+filled with `Pima Indians Diabetes Database` data.
+
+<details>
+<summary>Instructions</summary>
+
+```dart
+import 'package:ml_dataframe/ml_dataframe.dart';
+
+void main() {
+  final samples = getPimaIndiansDiabetesDataFrame();
+}
 ```
 
+</details>
+
 ### Prepare datasets for training and testing
 
 Data in this file is represented by 768 records and 8 features. The 9th column is a label column, it contains either 0 or 1 
@@ -475,7 +485,7 @@ final targetName = 'col_13';
 then let's shuffle the data:
 
 ```dart
-samples.shuffle();
+final shuffledSamples = samples.shuffle();
 ```
 
 Now it's the time to prepare data splits. Let's split the data into train and test subsets using the library's [splitData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/split_data.dart) 
@@ -501,7 +511,7 @@ e.g. stochastic gradient descent algorithm:
 
 ```dart
 final model = LinearRegressor.SGD(
-  samples
+  shuffledSamples
   targetName,
   iterationLimit: 90,
 );
@@ -511,7 +521,7 @@ or linear regression based on coordinate descent with Lasso regularization:
 
 ```dart
 final model = LinearRegressor.lasso(
-  samples
+  shuffledSamples,
   targetName,
   iterationLimit: 90,
 );
@@ -538,14 +548,16 @@ import 'dart:io';
 import 'package:ml_algo/ml_algo.dart';
 import 'package:ml_dataframe/ml_dataframe.dart';
 
-final file = File('housing_model.json');
-final encodedModel = await file.readAsString();
-final model = LinearRegressor.fromJson(encodedModel);
-final unlabelledData = await fromCsv('some_unlabelled_data.csv');
-final prediction = model.predict(unlabelledData);
-
-print(prediction.header);
-print(prediction.rows);
+void main() async {
+  final file = File('housing_model.json');
+  final encodedModel = await file.readAsString();
+  final model = LinearRegressor.fromJson(encodedModel);
+  final unlabelledData = await fromCsv('some_unlabelled_data.csv');
+  final prediction = model.predict(unlabelledData);
+    
+  print(prediction.header);
+  print(prediction.rows);
+}
 ```
 
 <details>
@@ -556,8 +568,7 @@ import 'package:ml_algo/ml_algo.dart';
 import 'package:ml_dataframe/ml_dataframe.dart';
 
 void main() async {
-  final samples = (await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' '))
-    ..shuffle();
+  final samples = (await fromCsv('datasets/housing.csv', headerExists: false, columnDelimiter: ' ')).shuffle();
   final targetName = 'col_13';
   final splits = splitData(samples, [0.8]);
   final trainData = splits[0];
@@ -582,8 +593,7 @@ import 'package:ml_dataframe/ml_dataframe.dart';
 
 void main() async {
   final rawCsvContent = await rootBundle.loadString('assets/datasets/housing.csv');
-  final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ')
-    ..shuffle();
+  final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ').shuffle();
   final targetName = 'col_13';
   final splits = splitData(samples, [0.8]);
   final trainData = splits[0];
diff --git a/e2e/logistic_regressor/logistic_regressor_sgd_test.dart b/e2e/logistic_regressor/logistic_regressor_sgd_test.dart
@@ -4,21 +4,19 @@ import 'package:ml_linalg/vector.dart';
 import 'package:test/test.dart';
 
 Future<Vector> evaluateLogisticRegressor(MetricType metric, DType dtype) {
-  final samples = getPimaIndiansDiabetesDataFrame().shuffle();
+  final samples = getPimaIndiansDiabetesDataFrame().shuffle(seed: 12);
   final numberOfFolds = 5;
-  final targetNames = ['Outcome'];
   final validator = CrossValidator.kFold(
     samples,
     numberOfFolds: numberOfFolds,
   );
-  final createClassifier = (DataFrame trainSamples) => LogisticRegressor(
+  final createClassifier = (DataFrame trainSamples) => LogisticRegressor.SGD(
         trainSamples,
-        targetNames.first,
-        optimizerType: LinearOptimizerType.gradient,
-        iterationsLimit: 100,
-        learningRateType: LearningRateType.exponential,
-        batchSize: trainSamples.rows.length,
-        probabilityThreshold: 0.5,
+        'Outcome',
+        seed: 10,
+        iterationsLimit: 50,
+        initialLearningRate: 1e-4,
+        learningRateType: LearningRateType.constant,
         dtype: dtype,
       );
 
@@ -29,7 +27,7 @@ Future<Vector> evaluateLogisticRegressor(MetricType metric, DType dtype) {
 }
 
 Future main() async {
-  group('LogisticRegressor', () {
+  group('LogisticRegressor.SGD', () {
     test(
         'should return adequate score on pima indians diabetes dataset using '
         'accuracy metric, dtype=DType.float32', () async {
diff --git a/lib/src/classifier/classifier.dart b/lib/src/classifier/classifier.dart
@@ -31,7 +31,7 @@ abstract class Classifier extends Predictor {
   ///    908    |    404    |    503    |     -100       |       100      |     -100
   ///
   /// If a prediction algorithm meets 100 in a target column, it will
-  /// interpret the value as a positive outcome for the appropriate class
+  /// interpret the value as a positive outcome for the corresponding class
   num get positiveLabel;
 
   /// A value using to encode negative class.
@@ -57,6 +57,6 @@ abstract class Classifier extends Predictor {
   ///    908    |    404    |    503    |     -100       |       100      |     -100
   ///
   /// If a prediction algorithm meets -100 in a target column, it will
-  /// interpret the value as a negative outcome for the appropriate class
+  /// interpret the value as a negative outcome for the corresponding class
   num get negativeLabel;
 }
diff --git a/lib/src/classifier/logistic_regressor/logistic_regressor.dart b/lib/src/classifier/logistic_regressor/logistic_regressor.dart
diff --git a/pubspec.yaml b/pubspec.yaml