Skip to content

Commit de8d284

Browse files
Merge pull request #231099 from whhender/adla-health-updates
Adla health updates
2 parents 468498d + a68a697 commit de8d284

File tree

2 files changed

+23
-17
lines changed

2 files changed

+23
-17
lines changed

articles/data-lake-analytics/data-lake-analytics-data-lake-tools-data-skew-solutions.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,32 @@
11
---
2-
title: Resolve data-skew - Azure Data Lake Tools for Visual Studio
3-
description: Troubleshooting potential solutions for data-skew problems by using Azure Data Lake Tools for Visual Studio.
2+
title: Resolve data-skew in Azure Data Lake Analytics using tools for Visual Studio
3+
description: Troubleshoot potential solutions for data-skew problems in Azure Data Lake Analytics by using Azure Data Lake Tools for Visual Studio.
44
ms.reviewer: whhender
55
ms.service: data-lake-analytics
66
ms.topic: how-to
7-
ms.date: 01/20/2023
7+
ms.date: 03/16/2023
88
---
99

10-
# Resolve data-skew problems by using Azure Data Lake Tools for Visual Studio
10+
# Resolve data-skew problems in Azure Data Lake Analytics using Azure Data Lake Tools for Visual Studio
1111

1212
[!INCLUDE [retirement-flag](includes/retirement-flag.md)]
1313

1414
## What is data skew?
1515

16-
Briefly stated, data skew is an over-represented value. Imagine that you've assigned 50 tax examiners to audit tax returns, one examiner for each US state. The Wyoming examiner, because the population there is small, has little to do. In California, however, the examiner is kept busy because of the state's large population.
16+
Briefly stated, data skew is an over-represented value. Imagine that you've assigned 50 tax examiners to audit tax returns, one examiner for each US state. The Wyoming examiner, because the population there's small, has little to do. In California, however, the examiner is kept busy because of the state's large population.
1717

1818
:::image type="content" source="./media/data-lake-analytics-data-lake-tools-data-skew-solutions/data-skew-problem.png" alt-text="A sample column chart showing the majority of data being grouped into two columns, rather than being evenly spread across categories." lightbox="./media/data-lake-analytics-data-lake-tools-data-skew-solutions/data-skew-problem.png":::
1919

2020
In our scenario, the data is unevenly distributed across all tax examiners, which means that some examiners must work more than others. In your own job, you frequently experience situations like the tax-examiner example here. In more technical terms, one vertex gets much more data than its peers, a situation that makes the vertex work more than the others and that eventually slows down an entire job. What's worse, the job might fail, because vertices might have, for example, a 5-hour runtime limitation and a 6-GB memory limitation.
2121

2222
## Resolving data-skew problems
2323

24-
Azure Data Lake Tools for Visual Studio can help detect whether your job has a data-skew problem. If a problem exists, you can resolve it by trying the solutions in this section.
24+
Azure Data Lake Tools for Visual Studio and Visual Studio Code can help detect whether your job has a data-skew problem.
25+
26+
- [Install Azure Data Lake Tools for Visual Studio](data-lake-analytics-data-lake-tools-get-started.md#install-azure-data-lake-tools-for-visual-studio)
27+
- [Install Azure Data Lake Tools for Visual Studio Code](data-lake-analytics-data-lake-tools-for-vscode.md)
28+
29+
If a problem exists, you can resolve it by trying the solutions in this section.
2530

2631
## Solution 1: Improve table partitioning
2732

@@ -129,11 +134,11 @@ You can sometimes write a user-defined operator to deal with complicated process
129134

130135
### Option 1: Use a recursive reducer, if possible
131136

132-
By default, a user-defined reducer runs in non-recursive mode, which means that reduce work for a key is distributed into a single vertex. But if your data is skewed, the huge data sets might be processed in a single vertex and run for a long time.
137+
By default, a user-defined reducer runs in nonrecursive mode, which means that reduce work for a key is distributed into a single vertex. But if your data is skewed, the huge data sets might be processed in a single vertex and run for a long time.
133138

134139
To improve performance, you can add an attribute in your code to define reducer to run in recursive mode. Then, the huge data sets can be distributed to multiple vertices and run in parallel, which speeds up your job.
135140

136-
To change a non-recursive reducer to recursive, you need to make sure that your algorithm is associative. For example, the sum is associative, and the median isn't. You also need to make sure that the input and output for reducer keep the same schema.
141+
To change a nonrecursive reducer to recursive, you need to make sure that your algorithm is associative. For example, the sum is associative, and the median isn't. You also need to make sure that the input and output for reducer keep the same schema.
137142

138143
Attribute of recursive reducer:
139144

articles/data-lake-analytics/data-lake-analytics-diagnostic-logs.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Enable and view diagnostic logs for Azure Data Lake Analytics
33
description: Understand how to set up and access diagnostic logs for Azure Data Lake Analytics
44
ms.service: data-lake-analytics
55
ms.topic: how-to
6-
ms.date: 11/15/2022
6+
ms.date: 03/16/2023
77
---
88
# Accessing diagnostic logs for Azure Data Lake Analytics
99

@@ -31,7 +31,7 @@ Diagnostic logging allows you to collect data access audit trails. These logs pr
3131

3232
* Select **Archive to a storage account** to store logs in an Azure storage account. Use this option if you want to archive the data. If you select this option, you must provide an Azure storage account to save the logs to.
3333

34-
* Select **Stream to an event hub** to stream log data to an Azure Event Hub. Use this option if you have a downstream processing pipeline that is analyzing incoming logs in real time. If you select this option, you must provide the details for the Azure Event Hub you want to use.
34+
* Select **Stream to an event hub** to stream log data to an Azure Event Hubs. Use this option if you have a downstream processing pipeline that is analyzing incoming logs in real time. If you select this option, you must provide the details for the Azure Event Hubs you want to use.
3535

3636
* Select **Send to Log Analytics workspace** to send the data to the Azure Monitor service. Use this option if you want to use Azure Monitor logs to gather and analyze logs.
3737

@@ -84,6 +84,10 @@ Diagnostic logging allows you to collect data access audit trails. These logs pr
8484

8585
`https://adllogs.blob.core.windows.net/insights-logs-requests/resourceId=/SUBSCRIPTIONS/<sub-id>/RESOURCEGROUPS/myresourcegroup/PROVIDERS/MICROSOFT.DATALAKEANALYTICS/ACCOUNTS/mydatalakeanalytics/y=2016/m=07/d=18/h=14/m=00/PT1H.json`
8686

87+
## Process the log data
88+
89+
Azure Data Lake Analytics provides a sample on how to process and analyze the log data. You can find the sample at [https://github.com/Azure/AzureDataLake/tree/master/Samples/AzureDiagnosticsSample](https://github.com/Azure/AzureDataLake/tree/master/Samples/AzureDiagnosticsSample).
90+
8791
## Log structure
8892

8993
The audit and request logs are in a structured JSON format.
@@ -144,8 +148,8 @@ Here's a sample entry in the JSON-formatted request log. Each blob has one root
144148
| Path |String |The path the operation was performed on |
145149
| RequestContentLength |int |The content length of the HTTP request |
146150
| ClientRequestId |String |The identifier that uniquely identifies this request |
147-
| StartTime |String |The time at which the server received the request |
148-
| EndTime |String |The time at which the server sent a response |
151+
| StartTime |String |The time when the server received the request |
152+
| EndTime |String |The time when the server sent a response |
149153

150154
### Audit logs
151155

@@ -181,7 +185,7 @@ Here's a sample entry in the JSON-formatted audit log. Each blob has one root ob
181185
| category |String |The log category. For example, **Audit**. |
182186
| operationName |String |Name of the operation that is logged. For example, JobSubmitted. |
183187
| resultType |String |A substatus for the job status (operationName). |
184-
| resultSignature |String |Additional details on the job status (operationName). |
188+
| resultSignature |String |Extra details on the job status (operationName). |
185189
| identity |String |The user that requested the operation. For example, [email protected]. |
186190
| properties |JSON |See the next section (Audit log properties schema) for details |
187191

@@ -205,10 +209,7 @@ Here's a sample entry in the JSON-formatted audit log. Each blob has one root ob
205209
> [!NOTE]
206210
> **SubmitTime**, **StartTime**, **EndTime**, and **Parallelism** provide information on an operation. These entries only contain a value if that operation has started or completed. For example, **SubmitTime** only contains a value after **operationName** has the value **JobSubmitted**.
207211
208-
## Process the log data
209-
210-
Azure Data Lake Analytics provides a sample on how to process and analyze the log data. You can find the sample at [https://github.com/Azure/AzureDataLake/tree/master/Samples/AzureDiagnosticsSample](https://github.com/Azure/AzureDataLake/tree/master/Samples/AzureDiagnosticsSample).
211-
212212
## Next steps
213213

214214
[Overview of Azure Data Lake Analytics](data-lake-analytics-overview.md)
215+
[Troubleshoot U-SQL jobs](runtime-troubleshoot.md)

0 commit comments

Comments
 (0)