Reproducible ML sample (#202)

giograno · web-flow · commit 7a467e62de62 · 2022-12-27T18:55:52.000+01:00
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ Some of the samples require LocalStack Pro features. Please make sure to properl
 
 ## Outline
 
-| Sample Name                                                 | Description                                                                                        |
+| Sample Name                                                    | Description                                                                                        |
 | -------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
 | [Serverless Websockets](serverless-websockets)                 | API Gateway V2 websocket APIs deployed via the Serverless framework                                |
 | [RDS Database Queries](rds-db-queries)                         | Running queries locally against an RDS database                                                    |
@@ -54,6 +54,7 @@ Some of the samples require LocalStack Pro features. Please make sure to properl
 | [Glue for ETL jobs](glue-etl-jobs)                             | Using Glue API to run local ETL jobs                                                               |
 | [Message Queue broker](mq-broker)                              | Using MQ API to run local message queue brokers                                                    |
 | [ELB Load Balancing](elb-load-balancing)                       | Using ELBv2 Application Load Balancers locally, deployed via the Serverless framework              |
+| [Reproducible ML](reproducible-ml)                             | Train, save and evaluate a scikit-learn machine learning model using AWS Lambda and S3                  |
 
 
 ## Checking out a single sample
diff --git a/reproducible-ml/Makefile b/reproducible-ml/Makefile
@@ -0,0 +1,30 @@
+export AWS_ACCESS_KEY_ID ?= test
+export AWS_SECRET_ACCESS_KEY ?= test
+export AWS_DEFAULT_REGION=us-east-1
+
+usage:       
+	@fgrep -h "##" $(MAKEFILE_LIST) | fgrep -v fgrep | sed -e 's/\\$$//' | sed -e 's/##//'
+
+install:     
+	@which localstack || pip install localstack
+	@which awslocal || pip install awscli-local
+
+run:         
+	./run.sh
+
+start:
+	localstack start -d
+
+stop:
+	@echo
+	localstack stop
+
+ready:
+	@echo Waiting on the LocalStack container...
+	@localstack wait -t 30 && echo Localstack is ready to use! || (echo Gave up waiting on LocalStack, exiting. && exit 1)
+
+logs:
+	@localstack logs > logs.txt
+	
+.PHONY: usage install start run stop ready logs test-ci
+
diff --git a/reproducible-ml/README.md b/reproducible-ml/README.md
@@ -0,0 +1,49 @@
+# LocalStack Demo: Train, save and evaluate a scikit-learn machine learning model
+
+In this tutorial, we will train a simple machine-learning model that recognizes handwritten digits on an image. 
+We will use the following services:
+
+* an S3 bucket to host our training data;
+* a Lambda function to train and save the model to an S3 bucket;
+* a Lambda layer that contains the dependencies for our training code;
+* a second Lambda function to download the saved model and perform a prediction with it.
+
+## Prerequisites
+
+* LocalStack
+* Docker
+* `awslocal` CLI
+
+## Installing
+
+To install the dependencies:
+```
+make install
+```
+
+## Starting LocalStack
+
+Make sure that LocalStack is started:
+```
+LOCALSTACK_API_KEY=... DEBUG=1 localstack start
+```
+
+## Running
+
+The entire workflow is executed by the `run.sh` script. To trigger it, execute:
+```
+make run
+```
+The model will be first trained by the `ml-train` Lambda function and then uploaded on the S3 bucket.
+A second Lambda function will download the model and run predictions on a test set of character inputs.
+The logs of the Lambda invocation should be visible in the LocalStack container output (with DEBUG=1 enabled):
+
+```bash
+null
+>START RequestId: 65dc894d-25e0-168e-dea1-a3e8bfdb563b Version: $LATEST
+> --> prediction result: [8 8 4 9 0 8 9 8 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 9 6 7 8 9
+...
+...
+>  9 5 4 8 8 4 9 0 8 9 8]
+> END RequestId: 6...
+```
diff --git a/reproducible-ml/digits.csv.gz b/reproducible-ml/digits.csv.gz
diff --git a/reproducible-ml/digits.rst b/reproducible-ml/digits.rst
@@ -0,0 +1,46 @@
+.. _digits_dataset:
+
+Optical recognition of handwritten digits dataset
+--------------------------------------------------
+
+**Data Set Characteristics:**
+
+    :Number of Instances: 5620
+    :Number of Attributes: 64
+    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
+    :Missing Attribute Values: None
+    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
+    :Date: July; 1998
+
+This is a copy of the test set of the UCI ML hand-written digits datasets
+https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
+
+The data set contains images of hand-written digits: 10 classes where
+each class refers to a digit.
+
+Preprocessing programs made available by NIST were used to extract
+normalized bitmaps of handwritten digits from a preprinted form. From a
+total of 43 people, 30 contributed to the training set and different 13
+to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
+4x4 and the number of on pixels are counted in each block. This generates
+an input matrix of 8x8 where each element is an integer in the range
+0..16. This reduces dimensionality and gives invariance to small
+distortions.
+
+For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
+T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
+L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
+1994.
+
+.. topic:: References
+
+  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
+    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
+    Graduate Studies in Science and Engineering, Bogazici University.
+  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
+  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
+    Linear dimensionalityreduction using relevance weighted LDA. School of
+    Electrical and Electronic Engineering Nanyang Technological University.
+    2005.
+  - Claudio Gentile. A New Approximate Maximal Margin Classification
+    Algorithm. NIPS. 2000.
diff --git a/reproducible-ml/infer.py b/reproducible-ml/infer.py
@@ -0,0 +1,20 @@
+# simple Lambda function training a scikit-learn model on the digits classification dataset
+# see https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
+import boto3
+import numpy
+from joblib import load
+
+
+def handler(event, context):
+    # download the model and the test set from S3
+    s3_client = boto3.client("s3")
+    s3_client.download_file(Bucket="pods-test", Key="test-set.npy", Filename="test-set.npy")
+    s3_client.download_file(Bucket="pods-test", Key="model.joblib", Filename="model.joblib")
+
+    with open("test-set.npy", "rb") as f:
+        X_test = numpy.load(f)
+
+    clf = load("model.joblib")
+
+    predicted = clf.predict(X_test)
+    print("--> prediction result:", predicted)
diff --git a/reproducible-ml/run.sh b/reproducible-ml/run.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+
+zip lambda.zip train.py
+zip infer.zip infer.py
+
+# push training data to S3
+awslocal s3 mb s3://reproducible-ml
+awslocal s3 cp lambda.zip s3://reproducible-ml/lambda.zip
+awslocal s3 cp infer.zip s3://reproducible-ml/infer.zip
+awslocal s3 cp digits.rst s3://reproducible-ml/digits.rst
+awslocal s3 cp digits.csv.gz s3://reproducible-ml/digits.csv.gz
+
+# define lamba function to training the ML data
+awslocal lambda create-function --function-name ml-train \
+  --runtime python3.8 --role r1 --handler train.handler --timeout 600 \
+   --code '{"S3Bucket":"reproducible-ml","S3Key":"lambda.zip"}' \
+   --layers arn:aws:lambda:us-east-1:446751924810:layer:python-3-8-scikit-learn-0-22-0:3
+
+awslocal lambda create-function --function-name ml-predict \
+  --runtime python3.8 --role r1 --handler infer.handler --timeout 600 \
+   --code '{"S3Bucket":"reproducible-ml","S3Key":"infer.zip"}' \
+   --layers arn:aws:lambda:us-east-1:446751924810:layer:python-3-8-scikit-learn-0-22-0:3
+
+# invoke the lambda function to train and save the model
+awslocal lambda invoke --function-name ml-train test.tmp
+
+# invoke the lambda function to evaluate the model on the test set
+awslocal lambda invoke --function-name ml-predict test.tmp
diff --git a/reproducible-ml/train.py b/reproducible-ml/train.py
@@ -0,0 +1,85 @@
+# simple Lambda function training a scikit-learn model on the digits classification dataset
+# see https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
+
+import os
+import boto3
+import numpy
+from sklearn import datasets, svm, metrics
+from sklearn.utils import Bunch
+from sklearn.model_selection import train_test_split
+from joblib import dump, load
+import io
+
+
+def handler(event, context):
+
+    digits = load_digits()
+
+    # flatten the images
+    n_samples = len(digits.images)
+    data = digits.images.reshape((n_samples, -1))
+
+    # Create a classifier: a support vector classifier
+    clf = svm.SVC(gamma=0.001)
+
+    # Split data into 50% train and 50% test subsets
+    X_train, X_test, y_train, y_test = train_test_split(
+        data, digits.target, test_size=0.5, shuffle=False
+    )
+
+    # Learn the digits on the train subset
+    clf.fit(X_train, y_train)
+
+    # Dump the trained model to S3
+    s3_client = boto3.client("s3")
+    buffer = io.BytesIO()
+    dump(clf, buffer)
+    s3_client.put_object(Body=buffer.getvalue(), Bucket="pods-test", Key="model.joblib")
+    
+    # Save the test-set to the S3 bucket
+    numpy.save('test-set.npy', X_test)
+    with open('test-set.npy', 'rb') as f:
+        s3_client.put_object(Body=f, Bucket="pods-test", Key="test-set.npy")
+
+
+def load_digits(*, n_class=10, return_X_y=False, as_frame=False):
+    # download files from S3
+    s3_client = boto3.client("s3")
+    s3_client.download_file(Bucket="pods-test", Key="digits.csv.gz", Filename="digits.csv.gz")
+    s3_client.download_file(Bucket="pods-test", Key="digits.rst", Filename="digits.rst")
+
+    # code below based on sklearn/datasets/_base.py
+
+    data = numpy.loadtxt('digits.csv.gz', delimiter=',')
+    with open('digits.rst') as f:
+        descr = f.read()
+    target = data[:, -1].astype(numpy.int, copy=False)
+    flat_data = data[:, :-1]
+    images = flat_data.view()
+    images.shape = (-1, 8, 8)
+
+    if n_class < 10:
+        idx = target < n_class
+        flat_data, target = flat_data[idx], target[idx]
+        images = images[idx]
+
+    feature_names = ['pixel_{}_{}'.format(row_idx, col_idx)
+                     for row_idx in range(8)
+                     for col_idx in range(8)]
+
+    frame = None
+    target_columns = ['target', ]
+    if as_frame:
+        frame, flat_data, target = datasets._convert_data_dataframe(
+            "load_digits", flat_data, target, feature_names, target_columns)
+
+    if return_X_y:
+        return flat_data, target
+
+    return Bunch(data=flat_data,
+                 target=target,
+                 frame=frame,
+                 feature_names=feature_names,
+                 target_names=numpy.arange(10),
+                 images=images,
+                 DESCR=descr)