Skip to content

Commit 3b715c2

Browse files
TanZiYenElliezza
andauthored
docs: change_for_OpenMLDB_Byzer_taxi_of_integration_folder (#3469)
* Change for OpenMLDB_Byzer_taxi of integration folder * Update OpenMLDB_Byzer_taxi.md * Update OpenMLDB_Byzer_taxi.md * Docs: upload_byzer-load-data_image * Update OpenMLDB_Byzer_taxi.md * Update OpenMLDB_Byzer_taxi.md ZH --------- Co-authored-by: Siqi Wang <[email protected]>
1 parent fa91a98 commit 3b715c2

File tree

5 files changed

+393
-20
lines changed

5 files changed

+393
-20
lines changed
Lines changed: 373 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,373 @@
1+
# Byzer
2+
3+
This article demonstrates how to use [OpenMLDB](https://github.com/4paradigm/OpenMLDB) and [Byzer](https://www.byzer.org/home) together to accomplish a complete machine-learning application. In this example, OpenMLDB receives instructions and data sent by Byzer, performs real-time feature computation on the data, and returns the processed dataset through feature engineering to Byzer for subsequent machine learning training and inference.
4+
5+
## Preparation
6+
7+
### Install OpenMLDB
8+
9+
1. This example recommends using the OpenMLDB cluster version running in a Docker container. For installation steps, please refer to [OpenMLDB Quickstart](../../quickstart/openmldb_quickstart.md).
10+
2. In this example, although the Byzer engine is on the same host, it needs to access OpenMLDB services from outside the container. Therefore, the service port of the OpenMLDB cluster needs to be exposed. It is recommended to use the `--network host` method, as detailed in the [IP Configuration Documentation - CLI/SDK->containeronebox](../../reference/ip_tips.md#clisdk-containeronebox).
11+
3. For simplicity, we use the file format to import and export OpenMLDB cluster data, so Byzer and OpenMLDB need to share the file path. Here, we map `/mlsql/admin` to `/byzermnt`, and use `/byzermnt` as the file path in SQL commands that interact with OpenMLDB.
12+
4. We also need to create a database named `db1` in the OpenMLDB cluster, and then use this database in Byzer (currently cannot be executed in Byzer, and Byzer must specify a usable db to connect to OpenMLDB).
13+
14+
The command is as follows:
15+
```
16+
docker run --network host -dit --name openmldb -v /mlsql/admin/:/byzermnt 4pdosc/openmldb:0.8.4 bash
17+
docker exec -it openmldb bash
18+
/work/init.sh
19+
echo "create database db1;" | /work/openmldb/bin/openmldb --zk_cluster=127.0.0.1:2181 --zk_root_path=/openmldb --role=sql_client
20+
exit # exit container
21+
```
22+
23+
### Install Byzer Engine and Byzer Notebook
24+
25+
The example uses [Byzer All In One Deployment](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/server/byzer-all-in-one-deployment) and [Byzer Notebook Binary System Installation](https://docs.byzer.org/#/byzer-notebook/zh-cn/installation/install_uninstall) method to install Byzer component.
26+
27+
```{note}
28+
If you only need to install the OpenMLDB plugin offline, you can also use the [Sandbox Container Deployment](https://docs.byzer.org/#/byzer-lang/zh-cn/installation/containerized-deployment/sandbox-standalone), where you can start and install the OpenMLDB plugin offline with just one click.
29+
30+
If you use VSCode, you can also choose the [Byzer Plugin in VSCode] (https://docs.byzer.org/#/byzer-Lang/zh-cn/installation/vscode/byzer-vscode-extension-installation). The plugin includes the built-in Byzer All In One, eliminating the need for manual installation.
31+
32+
For other deployment methods, please refer to the [Byzer Engine Deployment Guidelines] (https://docs.byzer.org/#/byzer-Lang/zh-cn/installation/README).
33+
```
34+
35+
1. Install Byzer All In One
36+
```
37+
wget https://download.byzer.org/byzer/2.3.0/byzer-lang-all-in-one-linux-amd64-3.1.1-2.3.0.tar.gz
38+
tar -zxvf byzer-lang-all-in-one-linux-amd64-3.1.1-2.3.0.tar.gz
39+
cd byzer-lang-all-in-one-linux-amd64-3.1.1-2.3.0
40+
# If you have a Java (jdk8 or higher) environment, you can skip two steps to export
41+
export JAVA_HOME=$(pwd)/jdk8
42+
export PATH=$JAVA_HOME/bin:$PATH
43+
./bin/byzer.sh start
44+
```
45+
You can visit `http://<ip>:9003/`.
46+
47+
2. Install Byzer Notebook. As Byzer Notebook [Requires mysql](https://docs.byzer.org/#/byzer-notebook/zh-cn/installation/prerequisites), if you do not have a MySQL engine, you can start one through Docker mode.
48+
```
49+
docker run -d --name mysql -e MYSQL_ROOT_PASSWORD=root -e MYSQL_ROOT_HOST=% -p 3306:3306 byzer/mysql:8.0-20.04_beta
50+
wget https://download.byzer.org/byzer-notebook/1.2.3/Byzer-Notebook-1.2.3.tar.gz
51+
tar -zxvf Byzer-Notebook-1.2.3.tar.gz
52+
cd Byzer-Notebook-1.2.3
53+
./bin/bootstrap.sh start
54+
```
55+
You can visit `http://<ip>:9002/`. The user id and password are admin/admin. The webpage is shown below. This article uses Byzer Notebook for demonstration.
56+
57+
![Byzer_Notebook](images/Byzer_Notebook.jpg)
58+
59+
### Byzer OpenMLDB Plugin
60+
61+
This example requires the use of the [OpenMLDB Plugin](https://github.com/byzer-org/byzer-extension/tree/master/byzer-openmldb) provided by Byzer to accomplish message delivery with OpenMLDB. We can install it in Byzer Notebook, create a Notebook, add Cells, and execute the following commands:
62+
63+
```
64+
!plugin app add - "byzer-openmldb-3.0";
65+
```
66+
67+
After running Cell, the plugin will be downloaded and installed. The process will take some time.
68+
```{note}
69+
If the installation is not successful or the download is too slow, you can manually download the jar package and then go offline (https://docs.byzer.org/#/byzer-lang/zh-cn/extension/installation/offline_install) for installation and configuration.
70+
```
71+
72+
### Prepare Dataset
73+
74+
This article utilizes the Kaggle taxi driving time dataset. For the sake of demonstration, we will only use a portion of the data, which can be downloaded from [here](https://openmldb.ai/download/taxi_tour_table_train_simple.csv) and then uploaded to Byzer Notebook.
75+
76+
![byzer upload](images/byzer-upload-data.png)
77+
78+
After uploading, you can find it in the Data Catalog-File System of the Byzer Notebook.
79+
```{note}
80+
If you prefer to use the full dataset, you can obtain it from the following website: [Kaggle Taxi Driving Time Prediction Problem](https://www.kaggle.com/c/nyc-taxi-trip-duration/overview). After downloading the dataset locally, it needs to be imported into Byzer Notebook.
81+
```
82+
## Machine Learning Process
83+
84+
By creating a Notebook in Byzer Notebook, you can start writing the entire machine learning process.
85+
86+
### Step 1: Check Dataset
87+
[Prepare Dataset](#prepare-dataset) has been imported into the File System with the path `tmp/upload`. Use the Byzer Lang `load` command to load the data.
88+
89+
```sql
90+
load csv.`tmp/upload/taxi_tour_table_train_simple.csv` where delimiter=","
91+
and header = "true"
92+
as taxi_tour_table_train_simple;
93+
```
94+
After running Cell, you can see the browsing result of the data.
95+
![byzer load result](images/byzer-load-data.png)
96+
97+
### Step 2: OpenMLDB Create Tables and Import Data
98+
99+
To connect to the OpenMLDB engine, we will use plugins. Since the OpenMLDB cluster and Byzer are deployed on the same host, we can access OpenMLDB using the local address `127.0.0.1:2181`. Before running this code block in Byzer Notebook, make sure that the OpenMLDB engine is started.
100+
101+
```sql
102+
run command as FeatureStoreExt.`` where
103+
zkAddress="127.0.0.1:2181"
104+
and zkPath="/openmldb"
105+
and `sql-0`='''
106+
SET @@execute_mode='offline';
107+
'''
108+
and `sql-1`='''
109+
SET @@sync_job=true;
110+
'''
111+
and `sql-2`='''
112+
SET @@job_timeout=20000000;
113+
'''
114+
and `sql-3`='''
115+
CREATE TABLE IF NOT EXISTS t1(id string, vendor_id int, pickup_datetime timestamp, dropoff_datetime timestamp, passenger_count int, pickup_longitude double, pickup_latitude double, dropoff_longitude double, dropoff_latitude double, store_and_fwd_flag string, trip_duration int);
116+
'''
117+
and `sql-4`='''
118+
LOAD DATA INFILE '/byzermnt/tmp/upload/taxi_tour_table_train_simple.csv'
119+
INTO TABLE t1 options(format='csv',header=true,mode='overwrite');
120+
'''
121+
and db="db1"
122+
and action="ddl";
123+
```
124+
After the task is completed, the Result should have a prompt of `FINISHED`. Please enter the openmldb container to check the job log if it is `FAILED`.
125+
126+
### Step 3: Perform Offline Feature Computation
127+
128+
Usually, this step involves feature design. However, in this example, we will skip the design phase and directly use the features designed in [OpenMLDB + LightGBM: Taxi Travel Time Prediction](../../use_case/taxi_tour_duration_prediction.md) for offline feature computation. The processed dataset will be exported as a local parquet file (parquet format is recommended; CSV load requires additional schema).
129+
130+
```sql
131+
run command as FeatureStoreExt.`` where
132+
zkAddress="127.0.0.1:2181"
133+
and zkPath="/openmldb"
134+
and `sql-0`='''
135+
SET @@execute_mode='offline';
136+
'''
137+
and `sql-1`='''
138+
SET @@sync_job=true;
139+
'''
140+
and `sql-2`='''
141+
SET @@job_timeout=20000000;
142+
'''
143+
and `sql-3`='''
144+
SELECT trip_duration, passenger_count,
145+
sum(pickup_latitude) OVER w AS vendor_sum_pl,
146+
max(pickup_latitude) OVER w AS vendor_max_pl,
147+
min(pickup_latitude) OVER w AS vendor_min_pl,
148+
avg(pickup_latitude) OVER w AS vendor_avg_pl,
149+
sum(pickup_latitude) OVER w2 AS pc_sum_pl,
150+
max(pickup_latitude) OVER w2 AS pc_max_pl,
151+
min(pickup_latitude) OVER w2 AS pc_min_pl,
152+
avg(pickup_latitude) OVER w2 AS pc_avg_pl ,
153+
count(vendor_id) OVER w2 AS pc_cnt,
154+
count(vendor_id) OVER w AS vendor_cnt
155+
FROM t1
156+
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
157+
w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) INTO OUTFILE '/byzermnt/tmp/feature_data' OPTIONS(mode='overwrite', format='parquet');
158+
'''
159+
and db="db1"
160+
and action="ddl";
161+
```
162+
After the task is completed, the Result should show a prompt of `FINISHED`. If it shows `FAILED`, please enter the openmldb container to check the job log. Refresh the Data Catalog of the Byzer Notebook to see the generated feature file path and the `tmp/feature_data` in the File System.
163+
164+
### Step 4: Load Data in Byzer
165+
166+
Load the feature data generated in the previous step into the Byzer environment:
167+
```sql
168+
load parquet.`tmp/feature_data` as feature_data;
169+
```
170+
171+
Convert all int-type fields to double.
172+
173+
```
174+
select *,
175+
cast(passenger_count as double) as passenger_count_d,
176+
cast(pc_cnt as double) as pc_cnt_d,
177+
cast(vendor_cnt as double) as vendor_cnt_d
178+
from feature_data
179+
as new_feature_data;
180+
```
181+
182+
Then merge all the fields into one vector.
183+
184+
```sql
185+
select vec_dense(array(
186+
passenger_count_d,
187+
vendor_sum_pl,
188+
vendor_max_pl,
189+
vendor_min_pl,
190+
vendor_avg_pl,
191+
pc_sum_pl,
192+
pc_max_pl,
193+
pc_min_pl,
194+
pc_avg_pl,
195+
pc_cnt_d,
196+
vendor_cnt
197+
)) as features,cast(trip_duration as double) as label
198+
from new_feature_data
199+
as training_table;
200+
201+
```
202+
203+
### Step 5: Model Training
204+
205+
Use the `train` command with the [Built-In Linear Regression Algorithm]([https://docs.byzer.org/#/byzer-lang/en-us/](https://docs.byzer.org/#/byzer-lang/zh-cn/ml/algs/linear_regression)) to train the model and save it to the specified path `/model/taxi-trip`.
206+
207+
```sql
208+
train training_table as LinearRegression.`/model/taxi-trip` where
209+
keepVersion="true"
210+
and evaluateTable="training_table"
211+
and `fitParam.0.labelCol`="label"
212+
and `fitParam.0.featuresCol`= "features"
213+
and `fitParam.0.maxIter`="50";
214+
```
215+
216+
```{note}
217+
View the relevant parameters of Byzer's Built-In Linear Regression Model using the command `!show et/params/LinearRegression;`.
218+
```
219+
220+
### Step 6: Feature Deployment
221+
222+
Use `DEPLOY` (must be in online mode) to deploy the feature computation SQL to OpenMLDB. The SQL should be consistent with the offline feature calculation SQL and it will be named as `d1` for deployment.
223+
224+
```{note}
225+
Note: The deployment name for `DEPLOY must be unique. If you want to change the deployment after it has been successful, you will need to change the deployment name or delete the existing deployment d1 before re-deploying.
226+
```
227+
228+
```sql
229+
run command as FeatureStoreExt.`` where
230+
zkAddress="127.0.0.1:2181"
231+
and zkPath="/openmldb"
232+
and `sql-0`='''
233+
SET @@execute_mode='online';
234+
'''
235+
and `sql-1`='''
236+
DEPLOY d1 SELECT trip_duration, passenger_count,
237+
sum(pickup_latitude) OVER w AS vendor_sum_pl,
238+
max(pickup_latitude) OVER w AS vendor_max_pl,
239+
min(pickup_latitude) OVER w AS vendor_min_pl,
240+
avg(pickup_latitude) OVER w AS vendor_avg_pl,
241+
sum(pickup_latitude) OVER w2 AS pc_sum_pl,
242+
max(pickup_latitude) OVER w2 AS pc_max_pl,
243+
min(pickup_latitude) OVER w2 AS pc_min_pl,
244+
avg(pickup_latitude) OVER w2 AS pc_avg_pl ,
245+
count(vendor_id) OVER w2 AS pc_cnt,
246+
count(vendor_id) OVER w AS vendor_cnt
247+
FROM t1
248+
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
249+
w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW);
250+
'''
251+
and db="db1"
252+
and action="ddl";
253+
```
254+
255+
### Step 7: Import Online Data
256+
257+
Online real-time prediction often involves importing recent historical data into online storage. In addition to importing data files, real-time data sources can also be accessed in the production environment. For simplicity, this example directly imports the original dataset (real-time prediction uses new real-time data as a request, so there will be no "feature data for prediction training").
258+
259+
```sql
260+
run command as FeatureStoreExt.`` where
261+
zkAddress="127.0.0.1:2181"
262+
and zkPath="/openmldb"
263+
and `sql-0`='''
264+
SET @@execute_mode='online';
265+
'''
266+
and `sql-1`='''
267+
SET @@sync_job=true;
268+
'''
269+
and `sql-2`='''
270+
SET @@job_timeout=20000000;
271+
'''
272+
and `sql-3`='''
273+
LOAD DATA INFILE '/byzermnt/tmp/upload/taxi_tour_table_train_simple.csv'
274+
INTO TABLE t1 options(format='csv',mode='append');
275+
'''
276+
and db="db1"
277+
and action="ddl";
278+
```
279+
280+
### Step 8: Model Deployment
281+
282+
Register the previously saved and trained model as a function that can be used directly.
283+
284+
```sql
285+
register LinearRegression.`/model/taxi-trip` as taxi_trip_model_predict;
286+
```
287+
288+
### Step 9: Real-Time Predictive Testing
289+
290+
Typically, real-time feature prediction is driven by real-time data. For the convenience of this demonstration, we will still perform "real-time feature computation + prediction" in the Notebook. We will use the [Python Environment]([https://docs.byzer.org/#/byzer-lang/en-us/](https://docs.byzer.org/#/byzer-lang/zh-cn/python/env)) for real-time feature computation by using the requirements file.
291+
292+
```
293+
pyarrow==4.0.1
294+
ray[default]==1.8.0
295+
aiohttp==3.7.4
296+
pandas>=1.0.5; python_version < '3.7'
297+
pandas>=1.2.0; python_version >= '3.7'
298+
requests
299+
matplotlib~=3.3.4
300+
uuid~=1.30
301+
pyjava
302+
protobuf==3.20.0 # New, if the protobuf version is too high, there will be import ray errors
303+
```
304+
```
305+
pip install -r requirements.txt
306+
```
307+
308+
We will construct "real-time data" and request OpenMLDB to compute real-time features using HTTP. The computed features will be saved as a file and then loaded into the Byzer environment.
309+
```
310+
!python env "PYTHON_ENV=:";
311+
!python conf "runIn=driver";
312+
!python conf "schema=file";
313+
run command as Ray.`` where
314+
inputTable="command"
315+
and outputTable="test_feature"
316+
and code='''
317+
import numpy as np
318+
import os
319+
import pandas as pd
320+
import ray
321+
import requests
322+
import json
323+
from pyjava.api.mlsql import RayContext,PythonContext
324+
325+
ray_context = RayContext.connect(globals(),None)
326+
327+
resp = requests.post('http://127.0.0.1:9080/dbs/db1/deployments/d1', json=json.loads('{"input":[["id0376262", 1, 1467302350000, 1467304896000, 2, -73.873093, 40.774097, -73.926704, 40.856739, "N", 1]], "need_schema":true}'))
328+
329+
res = json.loads(resp.text)["data"]
330+
schema_names = [(col["name"], col["type"]) for col in res["schema"]]
331+
df = pd.DataFrame.from_records(np.array([tuple(res["data"][0])], dtype=schema_names))
332+
df.to_parquet('/mlsql/admin/tmp/test_feature.parquet')
333+
334+
context.build_result([])
335+
''';
336+
```
337+
338+
Convert all int-type fields of the processed online data to double.
339+
340+
```sql
341+
select *,
342+
cast(passenger_count as double) as passenger_count_d,
343+
cast(pc_cnt as double) as pc_cnt_d,
344+
cast(vendor_cnt as double) as vendor_cnt_d
345+
from feature_data_test
346+
as new_feature_data_test;
347+
```
348+
349+
Then perform vectorization.
350+
351+
```sql
352+
select vec_dense(array(
353+
passenger_count_d,
354+
vendor_sum_pl,
355+
vendor_max_pl,
356+
vendor_min_pl,
357+
vendor_avg_pl,
358+
pc_sum_pl,
359+
pc_max_pl,
360+
pc_min_pl,
361+
pc_avg_pl,
362+
pc_cnt_d,
363+
vendor_cnt
364+
)) as features
365+
from new_feature_data_test
366+
as testing_table;
367+
```
368+
369+
Using the processed test set for prediction, the result is the predicted `trip_duration`.
370+
371+
```sql
372+
select taxi_trip_model_predict(testing_table) as predict_label;
373+
```
309 KB
Loading
117 KB
Loading
14.9 KB
Loading

0 commit comments

Comments
 (0)