Skip to content

Commit 1517d7f

Browse files
authored
Merge pull request #41708 from deguhath/master
Adding files for code test content
2 parents 8a88af1 + 41f9b8f commit 1517d7f

27 files changed

+157
-0
lines changed
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
---
2+
title: "Data science code testing on Azure with UCI adult income prediction dataset - Team Data Science Process (TDSP) and Visual Studio Team Services (VSTS)"
3+
description: Data Science Code Testing with UCI Adult Income Prediction Data
4+
services: machine-learning, team-data-science-process
5+
documentationcenter: ''
6+
author: weig
7+
manager: deguhath
8+
editor: cgronlun
9+
10+
ms.assetid: b8fbef77-3e80-4911-8e84-23dbf42c9bee
11+
ms.service: machine-learning
12+
ms.workload: data-services
13+
ms.tgt_pltfrm: na
14+
ms.devlang: na
15+
ms.topic: article
16+
ms.date: 05/19/2018
17+
ms.author: weig
18+
---
19+
# Data science code testing with UCI adult income prediction dataset
20+
In this article, we provide preliminary guidelines regarding code test for data science workflow. Such testing gives data scientists a systematic and efficient way to check the quality and expected outcome of their code. We use a Team Data Science Process (TDSP) [project which uses the UCI Adult Income dataset](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome) we already published earlier to show how code testing can be done.
21+
22+
## Introduction on code testing
23+
"Unit testing" a longstanding practice for software development. But for data science, it is often not precisely clear what that means and how one should test code for different stages of a data science lifecycle, such as data preparation, data quality examination, modeling, model deployment etc. For this article, we replace the term "unit testing" by "code testing". We refer to testing as the functions, which help to assess if code for a certain step of a data science lifecycle is producing results "as expected". What is "as expected" is defined by the person writing the test, depending on the outcome of the function, for example, data quality check, modeling etc.
24+
25+
References are given below for useful resources.
26+
27+
## Visual Studio Team Services (VSTS) for testing framework
28+
In this article, we describe how to perform and automate testing using VSTS. You may decide to use alternative tools. We also show how to set up automatic build using VSTS and build agents. For build agents we have used Azure Data Science Virtual Machine (DSVM).
29+
30+
## Overall flow of code testing
31+
The overall work flow of doing code test in a data science project looks like this:
32+
33+
<img src="./media/code-test/test-flow-chart.PNG" width="900" height="400">
34+
35+
36+
37+
## Detailed steps
38+
39+
### All steps for setup and execution of code testing and automated build using a build agent and VSTS are detailed below.
40+
41+
1. Create project in Visual Studio desktop application
42+
43+
<img src="./media/code-test/create_project.PNG" width="900" height="700">
44+
45+
2. Create your project in Visual Studio desktop application, you will find your project in the solution explorer on the right panel:
46+
47+
![create-repo](./media/code-test/create_python_project_in_vs.PNG)
48+
49+
![solution-explorer](./media/code-test/solution_explorer_in_vs.PNG)
50+
51+
3. Feed your project code into the VSTS project code repository:
52+
53+
![create-repo](./media/code-test/create_repo.PNG)
54+
55+
4. Testing code for data processing
56+
Suppose you have done some data preparation work such as data ingestion, feature engineering, and creating label columns, you want to make sure your code is generating the results you expect, here are some code that can be used to test the data processing code is working properly:
57+
58+
* Check column names are right
59+
60+
![check-columns](./media/code-test/check_column_names.PNG)
61+
62+
* Check response levels are right
63+
64+
![response-level](./media/code-test/check_response_levels.PNG)
65+
66+
* Check response percentage is reasonable
67+
68+
![response-percentage](./media/code-test/check_response_percentage.PNG)
69+
70+
* Check missing rate of each column in the data
71+
72+
![missing-rate](./media/code-test/check_missing_rate.PNG)
73+
74+
75+
5. Testing code for feature engineering
76+
After you have done the data processing, feature engineering work, and you trained a good model, you want to make sure the model you trained is able to score new data sets correctly, the following two tests can be used to check the prediction levels and distribution of label values.
77+
78+
* Check prediction levels
79+
80+
![check-prediction-level](./media/code-test/check_prediction_levels.PNG)
81+
82+
* Check prediction value distribution
83+
84+
![check-prediction-values](./media/code-test/check_prediction_values.PNG)
85+
86+
6. Put all the test functions together
87+
Put all test functions together into a python script called **test_functions.py**:
88+
89+
![create-test-func](./media/code-test/create_file_test_func.PNG)
90+
91+
92+
7. After the test codes are prepared, you can set up the testing environment in Visual Studio
93+
94+
- Create a python file called **test1.py**, within this file create a class that includes all the tests you want to do, here I have six tests prepared
95+
96+
![create-test-class](./media/code-test/create_file_test1_class.PNG)
97+
98+
8. Running all tests using Test Explorer
99+
Those tests can be automatically discovered if you put **codetest.testCase** after your class name, open **Test Explorer** on the right panel, click run all, all the tests will be running sequentially and telling you if the test is successful or not.
100+
101+
![run-tests](./media/code-test/run_tests.PNG)
102+
103+
9. Check in your code in remote repository
104+
Check in your code to the project repository using git commands and your most recent work will be reflected shortly in VSTS.
105+
106+
![git-checkin](./media/code-test/git_check_in.PNG)
107+
108+
![most-recent-work](./media/code-test/git_check_in_most_recent_work.PNG)
109+
110+
10. Set up automatic build and test in VSTS
111+
112+
* In the project repository, click **Build and Release**, click **+New** to create a new build process.
113+
114+
![create-new-build](./media/code-test/create_new_build.PNG)
115+
116+
* Follow the prompts on the screen to select your source code location, project name, repository, and branch info
117+
118+
![fill-in-build-info](./media/code-test/fill_in_build_info.PNG)
119+
120+
* Select a template, since there is no python project template, we just start with an **Empty Process**
121+
122+
![start-empty-template](./media/code-test/start_empty_process_template.PNG)
123+
124+
* Name the build and select the agent, you can choose **Default**, here using default will let us use a DSVM to finish the build process. More details about setting agent can be found in [here](https://docs.microsoft.com/en-us/vsts/build-release/concepts/agents/agents?view=vsts)
125+
126+
![select-agent](./media/code-test/select_agent.PNG)
127+
128+
* Click **+** on the left panel, to add a task for this build phase, since we are going to run our Python script **test1.py** to finish all the checks, this task is using PowerShell command to run python code.
129+
130+
![add-task-powershell](./media/code-test/add_task_powershell.PNG)
131+
132+
* In the PowerShell details part, fill in the required info as needed such as name and version of PowerShell, choose **Inline Script**, in the box below, you can type _python test1.py_. Make sure environment variable is set up correctly for Python. If you need different version/kernel of python, you can explicitly specify the path as shown in the figure.
133+
134+
![powershell-inline-script](./media/code-test/powershell_scripts.PNG)
135+
136+
* Click **Save & queue** to finish the build definition process.
137+
138+
![save-and-queue-build-defnition](./media/code-test/save_and_queue_build_definition.PNG)
139+
140+
11. Automatic build process
141+
Now every time when new commit is pushed to the code repository (here we use master, you can define any branch), the build process will be initiated automatically. Basically it runs the **test1.py** file in the agent machine to make sure everything defined in the code is correctly executed as planned. You will get notified in email (if alert set up correctly) when the build is finished. You can also check build status in VSTS. If it failed, you can dig into the details of build and find out which piece is broken.
142+
143+
![build-success-email](./media/code-test/email_build_succeed.PNG)
144+
145+
![build-success-vsts](./media/code-test/vs_online_build_succeed.PNG)
146+
147+
## Next steps
148+
* Refer to the [UCI Income prediction repository](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome) for unit tests for that data science scenario for some concrete examples
149+
* Follow the above outline and examples from UCI Income prediction scenario in your own data science projects.
150+
151+
## References
152+
* [Team Data Science Process (TDSP)](https://aka.ms/tdsp)
153+
* [Visual Studio Testing TOols](https://www.visualstudio.com/vs/features/testing-tools/)
154+
* [VSTS Testing Resources](https://www.visualstudio.com/team-services/)
155+
* [Data Science Virtual Machine (DSVM)](https://azure.microsoft.com/services/virtual-machines/data-science-virtual-machines/)
25.4 KB
Loading
6.64 KB
Loading
9.38 KB
Loading
10.3 KB
Loading
11.2 KB
Loading
6.59 KB
Loading
7.7 KB
Loading
18.8 KB
Loading
30 KB
Loading

0 commit comments

Comments
 (0)