You'll need:
- An OpenShift 4.8 cluster - with admin rights. You can create one by following the instructions here, or via RHPDS (Red Hat staff only).
- the OpenShift command line interface, oc available here
There are two versions of this workshop you can choose to use:
- an FSI Use Case
- a Telco use case Both are functionally identical - but use different product data examples, applicable to the chosen use case. At various part of the workshop, you use different files approapiate to your chosen use case.
REVISIT: This only has the FSI data files.
If you are running this as a workshop, it is recommended you fork this repo as there are changes you can make to your instance of the repo, that will simplify the experience for the students. See section Updating Tool URLs below.
Using the example below:
- Clone (or fork) this repo.
- Change directory into the root dir, ml-workshop.
- Create a variable REPO_HOME_ for this directory
REVISIT: Change to a non-personal repo, and clone based on a tag/branch:
git clone -b tag --single-branch https:// github.com/bryonbaker/ml-workshop
git clone https://github.com/bryonbaker/ml-workshop
cd ml-workshop
export REPO_HOME=`pwd`
- Log on to OpenShift as a Cluster Administrator. (For RHPDS this is opentlc-mgr.)
- Select the Administrator perspective
- Install the Open Data Hub operator. Click Operators > Operator Hub
OpenShift displays the operator catalogue. - Click the Filter by keybord text box and type open data hub
OpenShift displays the Open Data Hub Operator tile. - Click the tile
OpenShift displays a Commmunity Operator warning dialog box. - Click Continue
OpenShift displays the operator details. - Click Install
OpenShift prompts for the operator configuration details.

- Accept all defaults and click *Install
OpenShift installs the operator and displays a diaglog box once complete.

- Click View Operator
OpenShift displays the operator details.

The Open Data Hub Operator is now installed. Proceed to create the workshop project and install Open Data Hub
We will now create the workshop's project and install Open Data Hub into the project.
Before we do this we need to copy the Open Data Hub KfDef file that will instruct the operator which tools to install and how to configure them.
Later in these steps you will also need to:
a. Edit the KfDef file you create in OpenShift with the URL of your cluster. Pay careful attention to those steps or Airflow will not run.
b. Update the certificate for Airflow.
Before installing Open Data Hub you need to copy the KFDef file from a oublic git repository.
** TODO: Change from Faisal's personal repo.**
- Open the KFDef File from the github repository: https://github.com/masoodfaisal/odh-manifests/blob/master/kfdef/ml-workshop-limited.yaml
- Click the Copy Raw Contents button
to copy the file contents to your clipboard.
Keep this in the clipboard, you will use it shortly.
-
Create the ml-workshop project:
1.1 Click Home > Projects
1.2 Click the Create Project button on the top right of the screen
1.3 Click the Name text box and type ml-workshop
1.4 Click Create
OpenShift creates the project.

-
Delete the Limit Range for the project
2.1 Click Administration > LimitRanges
2.2 Click the hambuger button for the ml-workshop-core-resource-limits.

2.3 Click Delete LimitRange
OpenShift removes the LImitRange for the project. -
Install Open Data Hub
2.1 Click Operators > Installed Operators
OpenShift displays all the operators currently installed.Note that the ml-workshop project is unselected and All Projects is selected. You must make ml-workshop the active project.
2.2 Click the Projects drop-down list and click ml-workshop

2.3 Click Open Data Hub Operator.
OpenShift displays the operator's details.

2.4 Click Open Data Hub in the operator toolbar.
OpenShift displays the operand details - of which there are none.

2.5 Click the Create KfDef button.
2.6 Click the YAML View radio button
OpenShift displays the KfDef YAML editor.

2.7 Replace the entire YAML file with the KfDef YAML you copied to your clipboard in the Prerequisits step above.
This KfDef file will tell OpenShift how to install and configure ODH.
Before you save the KfDef you must edit one line of code.
2.8 Locate the airflow2 overlay in the code

Around line 57 you will see a value field that contains part of the URL to your OpenShift clister.
2.9 Replace the value with the the URI of your cluster from the .apps through to the .com as follows:
- kustomizeConfig:
overlays:
- custom-image
parameters:
- name: OCP_APPS_URI
# TODO: Change this uri before applying the KfDef
value: .apps.cluster-9482.9482.sandbox744.opentlc.com
repoRef:
name: manifests
path: ml-workshop-airflow2
2.10 Click Create
OpenShift creates the KfDef and proceeeds to deploy ODH.
2.11 Click Workloads > Pods to observe the deployment progress.

Be aweare this may take seveeralk minutes to complete.
The installation phase of Open Data Hub is now complete. Next you will configure the workshop environment.
If you are running ODH for a a workshop then you need to configure the users. If you are using the environment as a demo then you can jump forward to the Configure Tools section.
- In a terminal window, type the following commands:
cd $REPO_HOME/scripts
./setup-users.sh
Note: User configuration will invalidate any other logins like opentlc-mgr.
For cluster-admin access you should now use user29.
If you need to create users with different credentials consult this blog - on which these instructions are based.
The password for all users is openshift.
In this section we will upload the files that will be used for feature engineering. The files are located in the data-files directory in the ml-workshop git project you cloned earlier.
-
Open the OpenShift console in your browser.
-
Click: Networking > Routes
-
Scroll down to find minio-ml-workshop-ui.
-
Click the Minio url under Location heading
OpenShift opens a new browser tab and launches the Minio console and diaplays the login screen.

-
Enter the following credentials:
- Username: minio
- Password: minio123
-
Click Login
Minio displays the main console and all of the existing S3 buckets.

-
Scroll down to find the rawdata bucket.
-
Click Browse.
Minio displays the bucket contents.
You will now upload two folders (customers and products) to the rawdata bucket.
Minio prompts for the folder to upload.
- Navigate to the data files directory within the git repository
$REPO_HOME/data-files
- Click the customers folder.
-
Click: Upload.
Minio uploads the folder and all file contents to the raw data S3 bucket. -
Click the Clean Complete Objects button
to reveal the hidden upload controls.
Now you need to set up Superset to talk to our S3 and Kafka raw data via Trino - exposing the data via SQL.
-
Click the url for superset
OpenShift opens a new browser tab and displays the Superset login page.

-
Enter the following credentials:
- Username: admin
- Password: admin
-
Click: Data > Databases
Superset displays a list of configured databases.

-
Click: the "+ DATABASE" button
Superset prompts for the database connection details
-
Click the Supported Databases drop-down list
-
Scroll down to the entry Trino and click it.
-
Copy and paste the following text into the SQL Alchemy URI text box:
trino://admin@trino-service:8080
- Click Test Connection.
If all steps have been performed correctly, Superset displays the message Connection looks good!.
- Click the Advanced tab in the Edit Database form.
Superset prompts for the advanced database configuration.
- Click SQL Lab.
- Complete the form as illustrated in the following figure:
- Click the + QUERY button.
NOTE: DO NOT SAVE THE QUERY. We don't save this as it only needs to be run once per workshop
-
Copy and paste the query editor:
CREATE TABLE hive.default.customers ( customerId varchar, gender varchar, seniorCitizen varchar, partner varchar, dependents varchar, tenure varchar ) WITH (format = 'CSV', skip_header_line_count = 1, EXTERNAL_LOCATION='s3a://rawdata/customers' ) -
Click Run.
Superset displays Result - true as shown.
- Replace the SQL command with:
SELECT customers.gender, customers.seniorcitizen, customers.partner, customers.dependents, customers.tenure, products.* from hive.default.customers customers, customerchurn.default.data products where cast(customers.customerId as VARCHAR) = cast(products.customerId as VARCHAR)
Run the query as shown. You should see a resultset spanning personal and product consumption customer data.

- Click the SAVE AS button
.
Superset displays the Save As dialog box. - Click the Name text box. Replace the text with: Kafka-CSV-Join
- Click the SAVE button.
Superset saves the query.
You are now done with setup!







