You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/data-lake-analytics/understand-spark-data-formats.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
---
2
2
title: Understand Apache Spark data formats for Azure Data Lake Analytics U-SQL developers.
3
3
description: This article describes Apache Spark concepts to help U_SQL developers understand differences between U-SQL and Spark data formats.
4
-
ms.reviewer: jasonh
4
+
ms.reviewer: whhender
5
5
ms.service: data-lake-analytics
6
6
ms.topic: how-to
7
7
ms.custom: understand-apache-spark-data-formats
8
-
ms.date: 01/31/2019
8
+
ms.date: 01/20/2022
9
9
---
10
10
11
11
# Understand differences between U-SQL and Spark data formats
@@ -21,13 +21,13 @@ In addition to moving your files, you'll also want to make your data, stored in
21
21
Data stored in files can be moved in various ways:
22
22
23
23
- Write an [Azure Data Factory](../data-factory/introduction.md) pipeline to copy the data from [Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-overview.md) account to the [Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-introduction.md) account.
24
-
- Write a Spark job that reads the data from the [Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-overview.md) account and writes it to the [Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-introduction.md) account. Based on your use case, you may want to write it in a different format such as Parquet if you do not need to preserve the original file format.
24
+
- Write a Spark job that reads the data from the [Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-overview.md) account and writes it to the [Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-introduction.md) account. Based on your use case, you may want to write it in a different format such as Parquet if you don't need to preserve the original file format.
25
25
26
26
We recommend that you review the article [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
27
27
28
28
## Move data stored in U-SQL tables
29
29
30
-
U-SQL tables are not understood by Spark. If you have data stored in U-SQL tables, you'll run a U-SQL job that extracts the table data and saves it in a format that Spark understands. The most appropriate format is to create a set of Parquet files following the Hive metastore's folder layout.
30
+
U-SQL tables aren't understood by Spark. If you have data stored in U-SQL tables, you'll run a U-SQL job that extracts the table data and saves it in a format that Spark understands. The most appropriate format is to create a set of Parquet files following the Hive metastore's folder layout.
31
31
32
32
The output can be achieved in U-SQL with the built-in Parquet outputter and using the dynamic output partitioning with file sets to create the partition folders. [Process more files than ever and use Parquet](/archive/blogs/azuredatalake/process-more-files-than-ever-and-use-parquet-with-azure-data-lake-analytics) provides an example of how to create such Spark consumable data.
33
33
@@ -40,8 +40,8 @@ After this transformation, you copy the data as outlined in the chapter [Move da
40
40
Furthermore, if you're copying typed data (from tables), then Parquet and Spark may have different precision and scale for some of the typed values (for example, a float) and may treat null values differently. For example, U-SQL has the C# semantics for null values, while Spark has a three-valued logic for null values.
41
41
42
42
- Data organization (partitioning)
43
-
U-SQL tables provide two level partitioning. The outer level (`PARTITIONED BY`) is by value and maps mostly into the Hive/Spark partitioning scheme using folder hierarchies. You will need to ensure that the null values are mapped to the right folder. The inner level (`DISTRIBUTED BY`) in U-SQL offers 4 distribution schemes: round robin, range, hash, and direct hash.
44
-
Hive/Spark tables only support value partitioning or hash partitioning, using a different hash function than U-SQL. When you output your U-SQL table data, you will probably only be able to map into the value partitioning for Spark and may need to do further tuning of your data layout depending on your final Spark queries.
43
+
U-SQL tables provide two level partitioning. The outer level (`PARTITIONED BY`) is by value and maps mostly into the Hive/Spark partitioning scheme using folder hierarchies. You'll need to ensure that the null values are mapped to the right folder. The inner level (`DISTRIBUTED BY`) in U-SQL offers four distribution schemes: round robin, range, hash, and direct hash.
44
+
Hive/Spark tables only support value partitioning or hash partitioning, using a different hash function than U-SQL. When you output your U-SQL table data, you'll probably only be able to map into the value partitioning for Spark and may need to do further tuning of your data layout depending on your final Spark queries.
0 commit comments