Skip to content

Commit 7a467e6

Browse files
authored
Reproducible ML sample (#202)
1 parent 8433070 commit 7a467e6

File tree

8 files changed

+260
-1
lines changed

8 files changed

+260
-1
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Some of the samples require LocalStack Pro features. Please make sure to properl
1818

1919
## Outline
2020

21-
| Sample Name | Description |
21+
| Sample Name | Description |
2222
| -------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
2323
| [Serverless Websockets](serverless-websockets) | API Gateway V2 websocket APIs deployed via the Serverless framework |
2424
| [RDS Database Queries](rds-db-queries) | Running queries locally against an RDS database |
@@ -54,6 +54,7 @@ Some of the samples require LocalStack Pro features. Please make sure to properl
5454
| [Glue for ETL jobs](glue-etl-jobs) | Using Glue API to run local ETL jobs |
5555
| [Message Queue broker](mq-broker) | Using MQ API to run local message queue brokers |
5656
| [ELB Load Balancing](elb-load-balancing) | Using ELBv2 Application Load Balancers locally, deployed via the Serverless framework |
57+
| [Reproducible ML](reproducible-ml) | Train, save and evaluate a scikit-learn machine learning model using AWS Lambda and S3 |
5758

5859

5960
## Checking out a single sample

reproducible-ml/Makefile

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
export AWS_ACCESS_KEY_ID ?= test
2+
export AWS_SECRET_ACCESS_KEY ?= test
3+
export AWS_DEFAULT_REGION=us-east-1
4+
5+
usage:
6+
@fgrep -h "##" $(MAKEFILE_LIST) | fgrep -v fgrep | sed -e 's/\\$$//' | sed -e 's/##//'
7+
8+
install:
9+
@which localstack || pip install localstack
10+
@which awslocal || pip install awscli-local
11+
12+
run:
13+
./run.sh
14+
15+
start:
16+
localstack start -d
17+
18+
stop:
19+
@echo
20+
localstack stop
21+
22+
ready:
23+
@echo Waiting on the LocalStack container...
24+
@localstack wait -t 30 && echo Localstack is ready to use! || (echo Gave up waiting on LocalStack, exiting. && exit 1)
25+
26+
logs:
27+
@localstack logs > logs.txt
28+
29+
.PHONY: usage install start run stop ready logs test-ci
30+

reproducible-ml/README.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# LocalStack Demo: Train, save and evaluate a scikit-learn machine learning model
2+
3+
In this tutorial, we will train a simple machine-learning model that recognizes handwritten digits on an image.
4+
We will use the following services:
5+
6+
* an S3 bucket to host our training data;
7+
* a Lambda function to train and save the model to an S3 bucket;
8+
* a Lambda layer that contains the dependencies for our training code;
9+
* a second Lambda function to download the saved model and perform a prediction with it.
10+
11+
## Prerequisites
12+
13+
* LocalStack
14+
* Docker
15+
* `awslocal` CLI
16+
17+
## Installing
18+
19+
To install the dependencies:
20+
```
21+
make install
22+
```
23+
24+
## Starting LocalStack
25+
26+
Make sure that LocalStack is started:
27+
```
28+
LOCALSTACK_API_KEY=... DEBUG=1 localstack start
29+
```
30+
31+
## Running
32+
33+
The entire workflow is executed by the `run.sh` script. To trigger it, execute:
34+
```
35+
make run
36+
```
37+
The model will be first trained by the `ml-train` Lambda function and then uploaded on the S3 bucket.
38+
A second Lambda function will download the model and run predictions on a test set of character inputs.
39+
The logs of the Lambda invocation should be visible in the LocalStack container output (with DEBUG=1 enabled):
40+
41+
```bash
42+
null
43+
>START RequestId: 65dc894d-25e0-168e-dea1-a3e8bfdb563b Version: $LATEST
44+
> --> prediction result: [8 8 4 9 0 8 9 8 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 9 6 7 8 9
45+
...
46+
...
47+
> 9 5 4 8 8 4 9 0 8 9 8]
48+
> END RequestId: 6...
49+
```

reproducible-ml/digits.csv.gz

56.2 KB
Binary file not shown.

reproducible-ml/digits.rst

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
.. _digits_dataset:
2+
3+
Optical recognition of handwritten digits dataset
4+
--------------------------------------------------
5+
6+
**Data Set Characteristics:**
7+
8+
:Number of Instances: 5620
9+
:Number of Attributes: 64
10+
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
11+
:Missing Attribute Values: None
12+
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
13+
:Date: July; 1998
14+
15+
This is a copy of the test set of the UCI ML hand-written digits datasets
16+
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
17+
18+
The data set contains images of hand-written digits: 10 classes where
19+
each class refers to a digit.
20+
21+
Preprocessing programs made available by NIST were used to extract
22+
normalized bitmaps of handwritten digits from a preprinted form. From a
23+
total of 43 people, 30 contributed to the training set and different 13
24+
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
25+
4x4 and the number of on pixels are counted in each block. This generates
26+
an input matrix of 8x8 where each element is an integer in the range
27+
0..16. This reduces dimensionality and gives invariance to small
28+
distortions.
29+
30+
For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
31+
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
32+
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
33+
1994.
34+
35+
.. topic:: References
36+
37+
- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
38+
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
39+
Graduate Studies in Science and Engineering, Bogazici University.
40+
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
41+
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
42+
Linear dimensionalityreduction using relevance weighted LDA. School of
43+
Electrical and Electronic Engineering Nanyang Technological University.
44+
2005.
45+
- Claudio Gentile. A New Approximate Maximal Margin Classification
46+
Algorithm. NIPS. 2000.

reproducible-ml/infer.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# simple Lambda function training a scikit-learn model on the digits classification dataset
2+
# see https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
3+
import boto3
4+
import numpy
5+
from joblib import load
6+
7+
8+
def handler(event, context):
9+
# download the model and the test set from S3
10+
s3_client = boto3.client("s3")
11+
s3_client.download_file(Bucket="pods-test", Key="test-set.npy", Filename="test-set.npy")
12+
s3_client.download_file(Bucket="pods-test", Key="model.joblib", Filename="model.joblib")
13+
14+
with open("test-set.npy", "rb") as f:
15+
X_test = numpy.load(f)
16+
17+
clf = load("model.joblib")
18+
19+
predicted = clf.predict(X_test)
20+
print("--> prediction result:", predicted)

reproducible-ml/run.sh

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/bin/bash
2+
3+
zip lambda.zip train.py
4+
zip infer.zip infer.py
5+
6+
# push training data to S3
7+
awslocal s3 mb s3://reproducible-ml
8+
awslocal s3 cp lambda.zip s3://reproducible-ml/lambda.zip
9+
awslocal s3 cp infer.zip s3://reproducible-ml/infer.zip
10+
awslocal s3 cp digits.rst s3://reproducible-ml/digits.rst
11+
awslocal s3 cp digits.csv.gz s3://reproducible-ml/digits.csv.gz
12+
13+
# define lamba function to training the ML data
14+
awslocal lambda create-function --function-name ml-train \
15+
--runtime python3.8 --role r1 --handler train.handler --timeout 600 \
16+
--code '{"S3Bucket":"reproducible-ml","S3Key":"lambda.zip"}' \
17+
--layers arn:aws:lambda:us-east-1:446751924810:layer:python-3-8-scikit-learn-0-22-0:3
18+
19+
awslocal lambda create-function --function-name ml-predict \
20+
--runtime python3.8 --role r1 --handler infer.handler --timeout 600 \
21+
--code '{"S3Bucket":"reproducible-ml","S3Key":"infer.zip"}' \
22+
--layers arn:aws:lambda:us-east-1:446751924810:layer:python-3-8-scikit-learn-0-22-0:3
23+
24+
# invoke the lambda function to train and save the model
25+
awslocal lambda invoke --function-name ml-train test.tmp
26+
27+
# invoke the lambda function to evaluate the model on the test set
28+
awslocal lambda invoke --function-name ml-predict test.tmp

reproducible-ml/train.py

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# simple Lambda function training a scikit-learn model on the digits classification dataset
2+
# see https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
3+
4+
import os
5+
import boto3
6+
import numpy
7+
from sklearn import datasets, svm, metrics
8+
from sklearn.utils import Bunch
9+
from sklearn.model_selection import train_test_split
10+
from joblib import dump, load
11+
import io
12+
13+
14+
def handler(event, context):
15+
16+
digits = load_digits()
17+
18+
# flatten the images
19+
n_samples = len(digits.images)
20+
data = digits.images.reshape((n_samples, -1))
21+
22+
# Create a classifier: a support vector classifier
23+
clf = svm.SVC(gamma=0.001)
24+
25+
# Split data into 50% train and 50% test subsets
26+
X_train, X_test, y_train, y_test = train_test_split(
27+
data, digits.target, test_size=0.5, shuffle=False
28+
)
29+
30+
# Learn the digits on the train subset
31+
clf.fit(X_train, y_train)
32+
33+
# Dump the trained model to S3
34+
s3_client = boto3.client("s3")
35+
buffer = io.BytesIO()
36+
dump(clf, buffer)
37+
s3_client.put_object(Body=buffer.getvalue(), Bucket="pods-test", Key="model.joblib")
38+
39+
# Save the test-set to the S3 bucket
40+
numpy.save('test-set.npy', X_test)
41+
with open('test-set.npy', 'rb') as f:
42+
s3_client.put_object(Body=f, Bucket="pods-test", Key="test-set.npy")
43+
44+
45+
def load_digits(*, n_class=10, return_X_y=False, as_frame=False):
46+
# download files from S3
47+
s3_client = boto3.client("s3")
48+
s3_client.download_file(Bucket="pods-test", Key="digits.csv.gz", Filename="digits.csv.gz")
49+
s3_client.download_file(Bucket="pods-test", Key="digits.rst", Filename="digits.rst")
50+
51+
# code below based on sklearn/datasets/_base.py
52+
53+
data = numpy.loadtxt('digits.csv.gz', delimiter=',')
54+
with open('digits.rst') as f:
55+
descr = f.read()
56+
target = data[:, -1].astype(numpy.int, copy=False)
57+
flat_data = data[:, :-1]
58+
images = flat_data.view()
59+
images.shape = (-1, 8, 8)
60+
61+
if n_class < 10:
62+
idx = target < n_class
63+
flat_data, target = flat_data[idx], target[idx]
64+
images = images[idx]
65+
66+
feature_names = ['pixel_{}_{}'.format(row_idx, col_idx)
67+
for row_idx in range(8)
68+
for col_idx in range(8)]
69+
70+
frame = None
71+
target_columns = ['target', ]
72+
if as_frame:
73+
frame, flat_data, target = datasets._convert_data_dataframe(
74+
"load_digits", flat_data, target, feature_names, target_columns)
75+
76+
if return_X_y:
77+
return flat_data, target
78+
79+
return Bunch(data=flat_data,
80+
target=target,
81+
frame=frame,
82+
feature_names=feature_names,
83+
target_names=numpy.arange(10),
84+
images=images,
85+
DESCR=descr)

0 commit comments

Comments
 (0)