Skip to content

Commit f65088d

Browse files
authored
Toy datasets used (#226)
1 parent 3ab125e commit f65088d

File tree

10 files changed

+37
-952
lines changed

10 files changed

+37
-952
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
# Changelog
22

3+
## 16.11.3
4+
- Toy datasets from `ml_dataframe` package used
5+
36
## 16.11.2
47
- `KDTree`:
58
- `fromIterable` constructor, default value for splitting strategy changed

README.md

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ in your dependencies:
9696

9797
````
9898
dependencies:
99-
ml_dataframe: ^1.0.0
99+
ml_dataframe: ^1.4.2
100100
ml_preprocessing: ^7.0.2
101101
````
102102

@@ -117,7 +117,15 @@ import 'package:ml_preprocessing/ml_preprocessing.dart';
117117

118118
### Read a dataset's file
119119

120-
Download the dataset from [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).
120+
We have 2 options here:
121+
122+
- Download the dataset from [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).
123+
124+
- Or we may simply use [loadPimaIndiansDiabetesDataset](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/loadPimaIndiansDiabetesDataset.html) function
125+
from [ml_dataframe](https://pub.dev/packages/ml_dataframe) package. The function returns a ready to use [DataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/DataFrame-class.html) instance
126+
filled with `Pima Indians Diabetes Database` data.
127+
128+
If we chose the first option, we should do the following:
121129

122130
#### For a desktop application:
123131

@@ -136,8 +144,8 @@ in your pubspec.yaml:
136144
````
137145
dependencies:
138146
...
139-
ml_algo: ^16.0.0
140-
ml_dataframe: ^1.0.0
147+
ml_algo: ^16.11.2
148+
ml_dataframe: ^1.4.2
141149
...
142150
````
143151

@@ -164,10 +172,10 @@ final samples = DataFrame.fromRawCsv(rawCsvContent);
164172

165173
Data in this file is represented by 768 records and 8 features. The 9th column is a label column, it contains either 0 or 1
166174
on each row. This column is our target - we should predict a class label for each observation. The column's name is
167-
`class variable (0 or 1)`. Let's store it:
175+
`Outcome`. Let's store it:
168176

169177
````dart
170-
final targetColumnName = 'class variable (0 or 1)';
178+
final targetColumnName = 'Outcome';
171179
````
172180

173181
Now it's the time to prepare data splits. Since we have a smallish dataset (only 768 records), we can't afford to
@@ -333,8 +341,10 @@ import 'package:ml_dataframe/ml_dataframe.dart';
333341
import 'package:ml_preprocessing/ml_preprocessing.dart';
334342
335343
void main() async {
344+
// Another option - to use a toy dataset:
345+
// final samples = await loadPimaIndiansDiabetesDataset();
336346
final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv', headerExists: true);
337-
final targetColumnName = 'class variable (0 or 1)';
347+
final targetColumnName = 'Outcome';
338348
final splits = splitData(samples, [0.7]);
339349
final validationData = splits[0];
340350
final testData = splits[1];
@@ -376,8 +386,10 @@ import 'package:ml_preprocessing/ml_preprocessing.dart';
376386
377387
void main() async {
378388
final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
389+
// Another option - to use a toy dataset:
390+
// final samples = await loadPimaIndiansDiabetesDataset();
379391
final samples = DataFrame.fromRawCsv(rawCsvContent);
380-
final targetColumnName = 'class variable (0 or 1)';
392+
final targetColumnName = 'Outcome';
381393
final splits = splitData(samples, [0.7]);
382394
final validationData = splits[0];
383395
final testData = splits[1];
@@ -565,7 +577,7 @@ import 'package:ml_algo/ml_algo.dart';
565577
import 'package:ml_dataframe/ml_dataframe.dart';
566578
567579
void main() async {
568-
final rawCsvContent = await rootBundle.loadString('assets/datasets/pima_indians_diabetes_database.csv');
580+
final rawCsvContent = await rootBundle.loadString('assets/datasets/housing.csv');
569581
final samples = DataFrame.fromRawCsv(rawCsvContent, fieldDelimiter: ' ')
570582
..shuffle();
571583
final targetName = 'col_13';
@@ -587,7 +599,8 @@ void main() async {
587599
Let's try to classify data from a well-known [Iris](https://www.kaggle.com/datasets/uciml/iris) dataset using a non-linear algorithm - [decision trees](https://en.wikipedia.org/wiki/Decision_tree)
588600

589601
First, you need to download the data and place it in a proper place in your file system. To do so you should follow the
590-
instructions which are given in the [Logistic regression](#logistic-regression) section.
602+
instructions which are given in the [Logistic regression](#logistic-regression) section. Or you may use [loadIrisDataset](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/loadIrisDataset.html)
603+
function that returns ready to use [DataFrame](https://pub.dev/documentation/ml_dataframe/latest/ml_dataframe/DataFrame-class.html) instance filled with `Iris`dataset.
591604

592605
After loading the data, it's needed to preprocess it. We should drop the `Id` column since the column doesn't make sense.
593606
Also, we need to encode the 'Species' column - originally, it contains 3 repeated string labels, to feed it to the classifier
@@ -599,7 +612,7 @@ import 'package:ml_dataframe/ml_dataframe.dart';
599612
import 'package:ml_preprocessing/ml_preprocessing.dart';
600613
601614
void main() async {
602-
final samples = (await fromCsv('path/to/iris/dataset.csv'))
615+
final samples = (await loadIrisDataset())
603616
.shuffle()
604617
.dropSeries(seriesNames: ['Id']);
605618

e2e/_datasets/iris.csv

Lines changed: 0 additions & 151 deletions
This file was deleted.

0 commit comments

Comments
 (0)