Skip to content

Commit c88a83b

Browse files
committed
Review
1 parent 91b1124 commit c88a83b

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

articles/data-lake-analytics/understand-spark-data-formats.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
22
title: Understand Apache Spark data formats for Azure Data Lake Analytics U-SQL developers.
33
description: This article describes Apache Spark concepts to help U_SQL developers understand differences between U-SQL and Spark data formats.
4-
ms.reviewer: jasonh
4+
ms.reviewer: whhender
55
ms.service: data-lake-analytics
66
ms.topic: how-to
77
ms.custom: understand-apache-spark-data-formats
8-
ms.date: 01/31/2019
8+
ms.date: 01/20/2022
99
---
1010

1111
# Understand differences between U-SQL and Spark data formats
@@ -21,13 +21,13 @@ In addition to moving your files, you'll also want to make your data, stored in
2121
Data stored in files can be moved in various ways:
2222

2323
- Write an [Azure Data Factory](../data-factory/introduction.md) pipeline to copy the data from [Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-overview.md) account to the [Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-introduction.md) account.
24-
- Write a Spark job that reads the data from the [Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-overview.md) account and writes it to the [Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-introduction.md) account. Based on your use case, you may want to write it in a different format such as Parquet if you do not need to preserve the original file format.
24+
- Write a Spark job that reads the data from the [Azure Data Lake Storage Gen1](../data-lake-store/data-lake-store-overview.md) account and writes it to the [Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-introduction.md) account. Based on your use case, you may want to write it in a different format such as Parquet if you don't need to preserve the original file format.
2525

2626
We recommend that you review the article [Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2](../storage/blobs/data-lake-storage-migrate-gen1-to-gen2.md)
2727

2828
## Move data stored in U-SQL tables
2929

30-
U-SQL tables are not understood by Spark. If you have data stored in U-SQL tables, you'll run a U-SQL job that extracts the table data and saves it in a format that Spark understands. The most appropriate format is to create a set of Parquet files following the Hive metastore's folder layout.
30+
U-SQL tables aren't understood by Spark. If you have data stored in U-SQL tables, you'll run a U-SQL job that extracts the table data and saves it in a format that Spark understands. The most appropriate format is to create a set of Parquet files following the Hive metastore's folder layout.
3131

3232
The output can be achieved in U-SQL with the built-in Parquet outputter and using the dynamic output partitioning with file sets to create the partition folders. [Process more files than ever and use Parquet](/archive/blogs/azuredatalake/process-more-files-than-ever-and-use-parquet-with-azure-data-lake-analytics) provides an example of how to create such Spark consumable data.
3333

@@ -40,8 +40,8 @@ After this transformation, you copy the data as outlined in the chapter [Move da
4040
Furthermore, if you're copying typed data (from tables), then Parquet and Spark may have different precision and scale for some of the typed values (for example, a float) and may treat null values differently. For example, U-SQL has the C# semantics for null values, while Spark has a three-valued logic for null values.
4141

4242
- Data organization (partitioning)
43-
U-SQL tables provide two level partitioning. The outer level (`PARTITIONED BY`) is by value and maps mostly into the Hive/Spark partitioning scheme using folder hierarchies. You will need to ensure that the null values are mapped to the right folder. The inner level (`DISTRIBUTED BY`) in U-SQL offers 4 distribution schemes: round robin, range, hash, and direct hash.
44-
Hive/Spark tables only support value partitioning or hash partitioning, using a different hash function than U-SQL. When you output your U-SQL table data, you will probably only be able to map into the value partitioning for Spark and may need to do further tuning of your data layout depending on your final Spark queries.
43+
U-SQL tables provide two level partitioning. The outer level (`PARTITIONED BY`) is by value and maps mostly into the Hive/Spark partitioning scheme using folder hierarchies. You'll need to ensure that the null values are mapped to the right folder. The inner level (`DISTRIBUTED BY`) in U-SQL offers four distribution schemes: round robin, range, hash, and direct hash.
44+
Hive/Spark tables only support value partitioning or hash partitioning, using a different hash function than U-SQL. When you output your U-SQL table data, you'll probably only be able to map into the value partitioning for Spark and may need to do further tuning of your data layout depending on your final Spark queries.
4545

4646
## Next steps
4747

0 commit comments

Comments
 (0)