|
1 | 1 | ---
|
2 |
| -title: Resolve data-skew - Azure Data Lake Tools for Visual Studio |
3 |
| -description: Troubleshooting potential solutions for data-skew problems by using Azure Data Lake Tools for Visual Studio. |
| 2 | +title: Resolve data-skew in Azure Data Lake Analytics using tools for Visual Studio |
| 3 | +description: Troubleshoot potential solutions for data-skew problems in Azure Data Lake Analytics by using Azure Data Lake Tools for Visual Studio. |
4 | 4 | ms.reviewer: whhender
|
5 | 5 | ms.service: data-lake-analytics
|
6 | 6 | ms.topic: how-to
|
7 |
| -ms.date: 01/20/2023 |
| 7 | +ms.date: 03/16/2023 |
8 | 8 | ---
|
9 | 9 |
|
10 |
| -# Resolve data-skew problems by using Azure Data Lake Tools for Visual Studio |
| 10 | +# Resolve data-skew problems in Azure Data Lake Analytics using Azure Data Lake Tools for Visual Studio |
11 | 11 |
|
12 | 12 | [!INCLUDE [retirement-flag](includes/retirement-flag.md)]
|
13 | 13 |
|
14 | 14 | ## What is data skew?
|
15 | 15 |
|
16 |
| -Briefly stated, data skew is an over-represented value. Imagine that you've assigned 50 tax examiners to audit tax returns, one examiner for each US state. The Wyoming examiner, because the population there is small, has little to do. In California, however, the examiner is kept busy because of the state's large population. |
| 16 | +Briefly stated, data skew is an over-represented value. Imagine that you've assigned 50 tax examiners to audit tax returns, one examiner for each US state. The Wyoming examiner, because the population there's small, has little to do. In California, however, the examiner is kept busy because of the state's large population. |
17 | 17 |
|
18 | 18 | :::image type="content" source="./media/data-lake-analytics-data-lake-tools-data-skew-solutions/data-skew-problem.png" alt-text="A sample column chart showing the majority of data being grouped into two columns, rather than being evenly spread across categories." lightbox="./media/data-lake-analytics-data-lake-tools-data-skew-solutions/data-skew-problem.png":::
|
19 | 19 |
|
20 | 20 | In our scenario, the data is unevenly distributed across all tax examiners, which means that some examiners must work more than others. In your own job, you frequently experience situations like the tax-examiner example here. In more technical terms, one vertex gets much more data than its peers, a situation that makes the vertex work more than the others and that eventually slows down an entire job. What's worse, the job might fail, because vertices might have, for example, a 5-hour runtime limitation and a 6-GB memory limitation.
|
21 | 21 |
|
22 | 22 | ## Resolving data-skew problems
|
23 | 23 |
|
24 |
| -Azure Data Lake Tools for Visual Studio can help detect whether your job has a data-skew problem. If a problem exists, you can resolve it by trying the solutions in this section. |
| 24 | +Azure Data Lake Tools for Visual Studio and Visual Studio Code can help detect whether your job has a data-skew problem. |
| 25 | + |
| 26 | +- [Install Azure Data Lake Tools for Visual Studio](data-lake-analytics-data-lake-tools-get-started.md#install-azure-data-lake-tools-for-visual-studio) |
| 27 | +- [Install Azure Data Lake Tools for Visual Studio Code](data-lake-analytics-data-lake-tools-for-vscode.md) |
| 28 | + |
| 29 | +If a problem exists, you can resolve it by trying the solutions in this section. |
25 | 30 |
|
26 | 31 | ## Solution 1: Improve table partitioning
|
27 | 32 |
|
@@ -129,11 +134,11 @@ You can sometimes write a user-defined operator to deal with complicated process
|
129 | 134 |
|
130 | 135 | ### Option 1: Use a recursive reducer, if possible
|
131 | 136 |
|
132 |
| -By default, a user-defined reducer runs in non-recursive mode, which means that reduce work for a key is distributed into a single vertex. But if your data is skewed, the huge data sets might be processed in a single vertex and run for a long time. |
| 137 | +By default, a user-defined reducer runs in nonrecursive mode, which means that reduce work for a key is distributed into a single vertex. But if your data is skewed, the huge data sets might be processed in a single vertex and run for a long time. |
133 | 138 |
|
134 | 139 | To improve performance, you can add an attribute in your code to define reducer to run in recursive mode. Then, the huge data sets can be distributed to multiple vertices and run in parallel, which speeds up your job.
|
135 | 140 |
|
136 |
| -To change a non-recursive reducer to recursive, you need to make sure that your algorithm is associative. For example, the sum is associative, and the median isn't. You also need to make sure that the input and output for reducer keep the same schema. |
| 141 | +To change a nonrecursive reducer to recursive, you need to make sure that your algorithm is associative. For example, the sum is associative, and the median isn't. You also need to make sure that the input and output for reducer keep the same schema. |
137 | 142 |
|
138 | 143 | Attribute of recursive reducer:
|
139 | 144 |
|
|
0 commit comments