Skip to content

Commit bbccb22

Browse files
authored
Merge pull request #243 from datashinobi/master
MLS Sample contribution
2 parents 9f60194 + 08c5583 commit bbccb22

File tree

5 files changed

+416
-0
lines changed

5 files changed

+416
-0
lines changed
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Build a predictive model with RevoScalePy using SQL Server 2017 Machine Learning Services
2+
3+
This sample shows how to create a predictive model using RevoScalePy in conjunction with python machine learning stack.
4+
5+
The dataset used in this tutorial is based on Vélib which is a large scale public service of rent bike in Paris, the service offers today about around 14500 bicycles and 1230 stations http://en.velib.paris.fr/.
6+
7+
The dataset is a one month long sampled on a 15 minutes frequency of the 8th district of Paris.
8+
9+
10+
### Contents
11+
12+
[About this sample](#about-this-sample)
13+
14+
[Before you begin](#before-you-begin)
15+
16+
[Sample details](#sample-details)
17+
18+
19+
20+
21+
## About this sample
22+
23+
24+
This sample consist of a binary classifier that predict whether a particular bike station is empty or not.
25+
26+
27+
28+
29+
- **Applies to:** SQL Server 2017 CTP2.0 or higher
30+
- **Key features:** SQL Server Machine Learning Services
31+
- **Workload:** SQL Server Machine Learning Services
32+
- **Programming Language:** Python, TSQL
33+
- **Author:** Yassine Khelifi
34+
35+
36+
37+
## Before you begin
38+
39+
To run this sample, you need the following prerequisites:
40+
1. [Download this DB backup file](https://sq14samples.blob.core.windows.net/data/velibDB.bak) and restore it using Setup.sql.
41+
42+
**Software prerequisites:**
43+
44+
45+
1. [SQL Server 2017 CTP2.0](https://www.microsoft.com/en-us/sql-server/sql-server-2017) (or higher) with Machine Learning Services (Python) installed
46+
2. [SQL Server Management Studio](https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms)
47+
3. [Python Tools for Visual Studio](https://www.visualstudio.com/vs/python/) or another Python IDE
48+
49+
## Run this sample
50+
1. From SQL Server Management Studio, or SQL Server Data Tools, connect to your SQL Server 2017 database and execute setup.sql to restore the sample DB you have downloaded
51+
52+
2. From Python Tools for Visual Studio, open the python tools command under tools menu, add the Machine Learning Services Python environment to the corresponding paths https://docs.microsoft.com/en-us/visualstudio/python/python-environments
53+
54+
* "C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES" if you run in-db Python Server
55+
* "C:\Program Files\Microsoft SQL Server\140\PYTHON_SERVER" if you have the standalone Machine Learning Server installed .
56+
57+
3. Create new Python project from existing code and point to the downloaded python source files, and the Machine Learning Services Python environment defined in step 2.
58+
59+
60+
61+
62+
63+
64+
## Sample details
65+
66+
#### datasource.py
67+
This Python script defines the class that pull data from Sql database and provides access to SQL Server Compute Context.
68+
69+
#### pipeline.sql
70+
This python file defines the machine learning pipeline that performs features engineering and the classifier that fits the RevoScalePy binary logistic regression.
71+
72+
#### runner.sql
73+
This python file defines the startup code and main method from which to excecute the solution.
74+
75+
#### setup.sql
76+
Restores the sample DB (Make sure to update the path to the .bak file)
77+
78+
79+
80+
81+
82+
## Disclaimers
83+
The dataset used in this sample is obtained from JCdecaux https://developer.jcdecaux.com/#/opendata/license
84+
85+
86+
87+
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
from revoscalepy.computecontext.RxComputeContext import RxComputeContext
2+
from revoscalepy.computecontext.RxInSqlServer import RxInSqlServer
3+
from revoscalepy.computecontext.RxInSqlServer import RxSqlServerData
4+
from revoscalepy.etl.RxImport import rx_import_datasource
5+
6+
7+
class DataSource():
8+
9+
def __init__(self, connectionstring):
10+
11+
"""Data source remote compute context
12+
13+
14+
Args:
15+
connectionstring: connection string to the SQL server.
16+
17+
18+
"""
19+
self.__connectionstring = connectionstring
20+
21+
22+
23+
def loaddata(self):
24+
dataSource = RxSqlServerData(sqlQuery = "select * from dbo.trainingdata", verbose=True, reportProgress =True,
25+
connectionString = self.__connectionstring)
26+
27+
self.__computeContext = RxInSqlServer(connectionString = self.__connectionstring, autoCleanup = True)
28+
data = rx_import_datasource(dataSource)
29+
30+
return data
31+
32+
def getcomputecontext(self):
33+
34+
if self.__computeContext is None:
35+
raise RuntimeError("Data must be loaded before requesting computecontext!")
36+
37+
return self.__computeContext
38+
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
'''
2+
Pipeline implementation
3+
4+
5+
'''
6+
7+
8+
9+
import numpy as np
10+
import pandas as pd
11+
import time
12+
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin
13+
from sklearn.preprocessing import StandardScaler
14+
from revoscalepy.functions.RxLogit import rx_logit_ex
15+
from revoscalepy.functions.RxPredict import rx_predict_ex
16+
17+
18+
#=========================
19+
20+
# Features engineering
21+
22+
#=========================
23+
24+
class OutliersHandler(BaseEstimator, TransformerMixin):
25+
"""Handle outliers"""
26+
27+
def fit(self, x, y = None):
28+
return self
29+
30+
def transform(self, df):
31+
32+
df.availablebikes = np.where(df.availablebikes > df.bikestands, df.bikestands, df.availablebikes)
33+
return df
34+
35+
class LabelDefiner(BaseEstimator, TransformerMixin):
36+
"""
37+
38+
Defines target variable
39+
Binary label 0 empty station, 1 otherwise
40+
41+
"""
42+
43+
def __init__(self, availability_threshold = 1):
44+
self.threshold = availability_threshold
45+
46+
def fit(self, x, y = None):
47+
return self
48+
49+
def transform(self, df):
50+
51+
df['label'] = np.where(df.availablebikes < self.threshold, 0, 1)
52+
return df
53+
54+
55+
56+
class DateTimeFeaturesExtractor(BaseEstimator, TransformerMixin):
57+
"""Extract Datetime features"""
58+
59+
def fit(self, x, y = None):
60+
return self
61+
62+
def transform(self, df):
63+
df['lastupdate']= pd.to_datetime(df['lastupdate'])
64+
df['day'] = df.lastupdate.dt.day.astype(int)
65+
df['month'] = df.lastupdate.dt.month.astype(int)
66+
df['hour']= df.lastupdate.dt.hour.astype(int)
67+
df['minute']= df.lastupdate.dt.minute.astype(int)
68+
df['isweekend'] = np.where(df['lastupdate'].dt.dayofweek > 4, 1, 0)
69+
df.sort_values(by='lastupdate', inplace = True)
70+
return df
71+
72+
73+
74+
class TSFeaturesExtractor(BaseEstimator, TransformerMixin):
75+
"""Extract time series related features"""
76+
77+
def __init__(self, max_lags = 4):
78+
self.__max_lags = max_lags
79+
80+
def fit(self, x, y=None):
81+
return self
82+
83+
def transform(self, df):
84+
85+
86+
df.sort_values(['lastupdate','stationid'], ascending = [True, True])
87+
88+
for i in range(self.__max_lags):
89+
df['lag' + str(i)] = df.groupby(['stationid'])['availablebikes'].shift(i + 1)
90+
91+
92+
df['1st_derivative'] = df.groupby('stationid')['lag0'].transform(lambda x: np.gradient(x))
93+
df['2nd_derivative'] = df.groupby('stationid')['1st_derivative'].transform(lambda x: np.gradient(x))
94+
df['fft_max_coeff'] = df.groupby(['stationid', 'month', 'day', 'hour'])['lag0'].transform(lambda x: np.amax(np.abs(np.fft.rfft(x))))
95+
df['fft_energy'] = df.groupby(['stationid', 'month', 'day', 'hour'])['lag0'].transform(lambda x: np.sum((np.abs(np.fft.rfft(x))) ** 2))
96+
97+
return df
98+
99+
100+
101+
102+
103+
class StatisticalFeaturesExtractor(BaseEstimator, TransformerMixin):
104+
"""Extract statistical related features"""
105+
106+
def __init__(self, max_lags = 4):
107+
self.__max_lags = max_lags
108+
109+
def fit(self, x, y=None):
110+
return self
111+
112+
def transform(self, df):
113+
114+
df['var'] = df.groupby(['stationid', 'month', 'day', 'hour'])['lag0'].transform('var')
115+
df['cumrelfreq'] = df.groupby(['stationid', 'month', 'day', 'hour'])['lag0'].cumsum() / self.__max_lags
116+
df['mad'] = df.groupby(['stationid', 'month', 'day', 'hour'])['lag0'].transform('mad')
117+
df['idxmax'] = df.groupby(['stationid', 'month', 'day', 'hour'])['lag0'].transform(\
118+
lambda x: np.argmax(x.ravel()))
119+
df['idxmin'] = df.groupby(['stationid', 'month', 'day', 'hour'])['lag0'].transform(\
120+
lambda x: np.argmin(x.ravel()))
121+
return df
122+
123+
124+
125+
126+
class FeaturesExcluder(BaseEstimator, TransformerMixin):
127+
"""features to exclude"""
128+
129+
def __init__(self, features = ['availablebikes', 'bikestands','lastupdate', 'zipcode','month', 'day']):
130+
self.__exclusionlist = features
131+
132+
def fit(self, X, y = None):
133+
return self
134+
135+
def transform(self, df):
136+
137+
df.drop(self.__exclusionlist, axis = 1, inplace = True)
138+
return df
139+
140+
141+
class FeaturesScaler(BaseEstimator, TransformerMixin):
142+
143+
"""Z-score scaler """
144+
145+
146+
147+
def fit(self, X, y = None):
148+
return self
149+
150+
def transform(self, df):
151+
152+
if df.isnull().any().any():
153+
df.dropna(inplace = True)
154+
cols = df.columns.tolist()
155+
excluded_cols = ['stationid', 'label','hour', 'minute', 'isweekend']
156+
157+
X = StandardScaler().fit_transform(df.drop(excluded_cols, axis=1, inplace = False))
158+
X = np.concatenate((df.loc[:, excluded_cols].as_matrix(), X), axis = 1)
159+
160+
df_out = pd.DataFrame(X, columns = cols)
161+
162+
return df_out
163+
164+
165+
class RxClassifier(BaseEstimator, ClassifierMixin):
166+
167+
""" Revoscalerpy logisitic regression binary classifier wrapped in sklearn estimator """
168+
169+
def __init__(self, computecontext):
170+
171+
self.__computecontext = computecontext
172+
173+
174+
def fit(self, X, y = None):
175+
176+
177+
"""Fit model to training data
178+
179+
180+
Args:
181+
X (pandas DataFrame): training data.
182+
y (None): Not used the target variable is passed in X.
183+
184+
return: coefficients (pandas DataFrame)
185+
186+
"""
187+
188+
formula = "label ~ F(stationid) + F(hour) + F(minute) + isweekend + lag0 + \
189+
lag1 + lag2 + lag3 + 1st_derivative + 2nd_derivative\
190+
+ fft_max_coeff + fft_energy + var + cumrelfreq + mad + idxmax + idxmin"
191+
192+
start = time.time()
193+
self.__clf = rx_logit_ex(formula, data = X, compute_context = self.__computecontext, report_progress = 3, verbose = 1)
194+
end = time.time()
195+
196+
print("Training time duration: %.2f seconds" % (end - start))
197+
return self.__clf.coefficients
198+
199+
200+
def predict(self, X):
201+
"""
202+
Perform classification on X
203+
204+
Args:
205+
X (pandas DataFrame): prediction input dataset
206+
207+
return: prediction results vector (numpy array)
208+
"""
209+
if self.__clf is None:
210+
raise RuntimeError("Data must be fitted before calling predict!")
211+
212+
predict = rx_predict_ex(self.__clf, data = X, compute_context = self.__computecontext)
213+
predictions = np.where(predict._results['label_Pred'] == 1, 1, 0)
214+
215+
return predictions
216+
217+

0 commit comments

Comments
 (0)