|
| 1 | +--- |
| 2 | +title: "Data science code testing on Azure with UCI adult income prediction dataset - Team Data Science Process (TDSP) and Visual Studio Team Services (VSTS)" |
| 3 | +description: Data Science Code Testing with UCI Adult Income Prediction Data |
| 4 | +services: machine-learning, team-data-science-process |
| 5 | +documentationcenter: '' |
| 6 | +author: weig |
| 7 | +manager: deguhath |
| 8 | +editor: cgronlun |
| 9 | + |
| 10 | +ms.assetid: b8fbef77-3e80-4911-8e84-23dbf42c9bee |
| 11 | +ms.service: machine-learning |
| 12 | +ms.workload: data-services |
| 13 | +ms.tgt_pltfrm: na |
| 14 | +ms.devlang: na |
| 15 | +ms.topic: article |
| 16 | +ms.date: 05/19/2018 |
| 17 | +ms.author: weig |
| 18 | +--- |
| 19 | +# Data science code testing with UCI adult income prediction dataset |
| 20 | +In this article, we provide preliminary guidelines regarding code test for data science workflow. Such testing gives data scientists a systematic and efficient way to check the quality and expected outcome of their code. We use a Team Data Science Process (TDSP) [project which uses the UCI Adult Income dataset](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome) we already published earlier to show how code testing can be done. |
| 21 | + |
| 22 | +## Introduction on code testing |
| 23 | +"Unit testing" a longstanding practice for software development. But for data science, it is often not precisely clear what that means and how one should test code for different stages of a data science lifecycle, such as data preparation, data quality examination, modeling, model deployment etc. For this article, we replace the term "unit testing" by "code testing". We refer to testing as the functions, which help to assess if code for a certain step of a data science lifecycle is producing results "as expected". What is "as expected" is defined by the person writing the test, depending on the outcome of the function, for example, data quality check, modeling etc. |
| 24 | + |
| 25 | +References are given below for useful resources. |
| 26 | + |
| 27 | +## Visual Studio Team Services (VSTS) for testing framework |
| 28 | +In this article, we describe how to perform and automate testing using VSTS. You may decide to use alternative tools. We also show how to set up automatic build using VSTS and build agents. For build agents we have used Azure Data Science Virtual Machine (DSVM). |
| 29 | + |
| 30 | +## Overall flow of code testing |
| 31 | +The overall work flow of doing code test in a data science project looks like this: |
| 32 | + |
| 33 | + <img src="./media/code-test/test-flow-chart.PNG" width="900" height="400"> |
| 34 | + |
| 35 | + |
| 36 | + |
| 37 | +## Detailed steps |
| 38 | + |
| 39 | +### All steps for setup and execution of code testing and automated build using a build agent and VSTS are detailed below. |
| 40 | + |
| 41 | +1. Create project in Visual Studio desktop application |
| 42 | + |
| 43 | + <img src="./media/code-test/create_project.PNG" width="900" height="700"> |
| 44 | + |
| 45 | +2. Create your project in Visual Studio desktop application, you will find your project in the solution explorer on the right panel: |
| 46 | + |
| 47 | +  |
| 48 | + |
| 49 | +  |
| 50 | + |
| 51 | +3. Feed your project code into the VSTS project code repository: |
| 52 | + |
| 53 | +  |
| 54 | + |
| 55 | +4. Testing code for data processing |
| 56 | +Suppose you have done some data preparation work such as data ingestion, feature engineering, and creating label columns, you want to make sure your code is generating the results you expect, here are some code that can be used to test the data processing code is working properly: |
| 57 | + |
| 58 | + * Check column names are right |
| 59 | + |
| 60 | +  |
| 61 | + |
| 62 | + * Check response levels are right |
| 63 | + |
| 64 | +  |
| 65 | + |
| 66 | + * Check response percentage is reasonable |
| 67 | + |
| 68 | +  |
| 69 | + |
| 70 | + * Check missing rate of each column in the data |
| 71 | + |
| 72 | +  |
| 73 | + |
| 74 | + |
| 75 | +5. Testing code for feature engineering |
| 76 | +After you have done the data processing, feature engineering work, and you trained a good model, you want to make sure the model you trained is able to score new data sets correctly, the following two tests can be used to check the prediction levels and distribution of label values. |
| 77 | + |
| 78 | + * Check prediction levels |
| 79 | + |
| 80 | +  |
| 81 | + |
| 82 | + * Check prediction value distribution |
| 83 | + |
| 84 | +  |
| 85 | + |
| 86 | +6. Put all the test functions together |
| 87 | +Put all test functions together into a python script called **test_functions.py**: |
| 88 | + |
| 89 | +  |
| 90 | + |
| 91 | + |
| 92 | +7. After the test codes are prepared, you can set up the testing environment in Visual Studio |
| 93 | + |
| 94 | + - Create a python file called **test1.py**, within this file create a class that includes all the tests you want to do, here I have six tests prepared |
| 95 | + |
| 96 | +  |
| 97 | + |
| 98 | +8. Running all tests using Test Explorer |
| 99 | +Those tests can be automatically discovered if you put **codetest.testCase** after your class name, open **Test Explorer** on the right panel, click run all, all the tests will be running sequentially and telling you if the test is successful or not. |
| 100 | + |
| 101 | +  |
| 102 | + |
| 103 | +9. Check in your code in remote repository |
| 104 | +Check in your code to the project repository using git commands and your most recent work will be reflected shortly in VSTS. |
| 105 | + |
| 106 | +  |
| 107 | + |
| 108 | +  |
| 109 | + |
| 110 | +10. Set up automatic build and test in VSTS |
| 111 | + |
| 112 | + * In the project repository, click **Build and Release**, click **+New** to create a new build process. |
| 113 | + |
| 114 | +  |
| 115 | + |
| 116 | + * Follow the prompts on the screen to select your source code location, project name, repository, and branch info |
| 117 | + |
| 118 | +  |
| 119 | + |
| 120 | + * Select a template, since there is no python project template, we just start with an **Empty Process** |
| 121 | + |
| 122 | +  |
| 123 | + |
| 124 | + * Name the build and select the agent, you can choose **Default**, here using default will let us use a DSVM to finish the build process. More details about setting agent can be found in [here](https://docs.microsoft.com/en-us/vsts/build-release/concepts/agents/agents?view=vsts) |
| 125 | + |
| 126 | +  |
| 127 | + |
| 128 | + * Click **+** on the left panel, to add a task for this build phase, since we are going to run our Python script **test1.py** to finish all the checks, this task is using PowerShell command to run python code. |
| 129 | + |
| 130 | +  |
| 131 | + |
| 132 | + * In the PowerShell details part, fill in the required info as needed such as name and version of PowerShell, choose **Inline Script**, in the box below, you can type _python test1.py_. Make sure environment variable is set up correctly for Python. If you need different version/kernel of python, you can explicitly specify the path as shown in the figure. |
| 133 | + |
| 134 | +  |
| 135 | + |
| 136 | + * Click **Save & queue** to finish the build definition process. |
| 137 | + |
| 138 | +  |
| 139 | + |
| 140 | +11. Automatic build process |
| 141 | +Now every time when new commit is pushed to the code repository (here we use master, you can define any branch), the build process will be initiated automatically. Basically it runs the **test1.py** file in the agent machine to make sure everything defined in the code is correctly executed as planned. You will get notified in email (if alert set up correctly) when the build is finished. You can also check build status in VSTS. If it failed, you can dig into the details of build and find out which piece is broken. |
| 142 | + |
| 143 | +  |
| 144 | + |
| 145 | +  |
| 146 | + |
| 147 | +## Next steps |
| 148 | +* Refer to the [UCI Income prediction repository](https://github.com/Azure/MachineLearningSamples-TDSPUCIAdultIncome) for unit tests for that data science scenario for some concrete examples |
| 149 | +* Follow the above outline and examples from UCI Income prediction scenario in your own data science projects. |
| 150 | + |
| 151 | +## References |
| 152 | +* [Team Data Science Process (TDSP)](https://aka.ms/tdsp) |
| 153 | +* [Visual Studio Testing TOols](https://www.visualstudio.com/vs/features/testing-tools/) |
| 154 | +* [VSTS Testing Resources](https://www.visualstudio.com/team-services/) |
| 155 | +* [Data Science Virtual Machine (DSVM)](https://azure.microsoft.com/services/virtual-machines/data-science-virtual-machines/) |
0 commit comments