Skip to content

Commit 4a6dfb2

Browse files
committed
Perf improvements on format-partquet.md
1 parent 348eea1 commit 4a6dfb2

File tree

1 file changed

+36
-19
lines changed

1 file changed

+36
-19
lines changed

articles/data-factory/format-parquet.md

Lines changed: 36 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,42 @@ ms.author: jianleishen
1616

1717
Follow this article when you want to **parse the Parquet files or write the data into Parquet format**.
1818

19-
Parquet format is supported for the following connectors: [Amazon S3](connector-amazon-simple-storage-service.md), [Amazon S3 Compatible Storage](connector-amazon-s3-compatible-storage.md), [Azure Blob](connector-azure-blob-storage.md), [Azure Data Lake Storage Gen1](connector-azure-data-lake-store.md), [Azure Data Lake Storage Gen2](connector-azure-data-lake-storage.md), [Azure Files](connector-azure-file-storage.md), [File System](connector-file-system.md), [FTP](connector-ftp.md), [Google Cloud Storage](connector-google-cloud-storage.md), [HDFS](connector-hdfs.md), [HTTP](connector-http.md), [Oracle Cloud Storage](connector-oracle-cloud-storage.md) and [SFTP](connector-sftp.md).
19+
Parquet format is supported for the following connectors:
20+
21+
- [Amazon S3](connector-amazon-simple-storage-service.md)
22+
- [Amazon S3 Compatible Storage](connector-amazon-s3-compatible-storage.md)
23+
- [Azure Blob](connector-azure-blob-storage.md)
24+
- [Azure Data Lake Storage Gen1](connector-azure-data-lake-store.md)
25+
- [Azure Data Lake Storage Gen2](connector-azure-data-lake-storage.md)
26+
- [Azure Files](connector-azure-file-storage.md)
27+
- [File System](connector-file-system.md)
28+
- [FTP](connector-ftp.md)
29+
- [Google Cloud Storage](connector-google-cloud-storage.md)
30+
- [HDFS](connector-hdfs.md)
31+
- [HTTP](connector-http.md)
32+
- [Oracle Cloud Storage](connector-oracle-cloud-storage.md)
33+
- [SFTP](connector-sftp.md).
34+
35+
For a list of supported features for all available connectors, visit the [Connectors Overview](connector-overview.md) article.
36+
37+
## Using Self-hosted Integration Runtime
38+
39+
> [!IMPORTANT]
40+
> For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not copying Parquet files **as-is**, you need to install the **64-bit JRE 8 (Java Runtime Environment) or OpenJDK** and **Microsoft Visual C++ 2010 Redistributable Package** on your IR machine. Check the following paragraph with more details.
41+
42+
For copy running on Self-hosted IR with Parquet file serialization/deserialization, the service locates the Java runtime by firstly checking the registry *`(SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome)`* for JRE, if not found, secondly checking system variable *`JAVA_HOME`* for OpenJDK.
43+
44+
- **To use JRE**: The 64-bit IR requires 64-bit JRE. You can find it from [here](https://go.microsoft.com/fwlink/?LinkId=808605).
45+
- **To use OpenJDK**: It's supported since IR version 3.13. Package the jvm.dll with all other required assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
46+
- **To install Visual C++ 2010 Redistributable Package**: Visual C++ 2010 Redistributable Package is not installed with self-hosted IR installations. You can find it from [here](https://www.microsoft.com/download/details.aspx?id=26999).
47+
48+
> [!TIP]
49+
> If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: **java.lang.OutOfMemoryError:Java heap space**", you can add an environment variable `_JAVA_OPTIONS` in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.
50+
51+
:::image type="content" source="./media/supported-file-formats-and-compression-codecs/set-jvm-heap-size-on-selfhosted-ir.png" alt-text="Set JVM heap size on Self-hosted IR":::
52+
53+
Example: set variable `_JAVA_OPTIONS` with value `-Xms256m -Xmx16g`. The flag `Xms` specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while `Xmx` specifies the maximum memory allocation pool. This means that JVM will be started with `Xms` amount of memory and will be able to use a maximum of `Xmx` amount of memory. By default, the service uses min 64 MB and max 1G.
54+
2055

2156
## Dataset properties
2257

@@ -153,24 +188,6 @@ ParquetSource sink(
153188

154189
Parquet complex data types (e.g. MAP, LIST, STRUCT) are currently supported only in Data Flows, not in Copy Activity. To use complex types in data flows, do not import the file schema in the dataset, leaving schema blank in the dataset. Then, in the Source transformation, import the projection.
155190

156-
## Using Self-hosted Integration Runtime
157-
158-
> [!IMPORTANT]
159-
> For copy empowered by Self-hosted Integration Runtime e.g. between on-premises and cloud data stores, if you are not copying Parquet files **as-is**, you need to install the **64-bit JRE 8 (Java Runtime Environment) or OpenJDK** and **Microsoft Visual C++ 2010 Redistributable Package** on your IR machine. Check the following paragraph with more details.
160-
161-
For copy running on Self-hosted IR with Parquet file serialization/deserialization, the service locates the Java runtime by firstly checking the registry *`(SOFTWARE\JavaSoft\Java Runtime Environment\{Current Version}\JavaHome)`* for JRE, if not found, secondly checking system variable *`JAVA_HOME`* for OpenJDK.
162-
163-
- **To use JRE**: The 64-bit IR requires 64-bit JRE. You can find it from [here](https://go.microsoft.com/fwlink/?LinkId=808605).
164-
- **To use OpenJDK**: It's supported since IR version 3.13. Package the jvm.dll with all other required assemblies of OpenJDK into Self-hosted IR machine, and set system environment variable JAVA_HOME accordingly.
165-
- **To install Visual C++ 2010 Redistributable Package**: Visual C++ 2010 Redistributable Package is not installed with self-hosted IR installations. You can find it from [here](https://www.microsoft.com/download/details.aspx?id=26999).
166-
167-
> [!TIP]
168-
> If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: **java.lang.OutOfMemoryError:Java heap space**", you can add an environment variable `_JAVA_OPTIONS` in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.
169-
170-
:::image type="content" source="./media/supported-file-formats-and-compression-codecs/set-jvm-heap-size-on-selfhosted-ir.png" alt-text="Set JVM heap size on Self-hosted IR":::
171-
172-
Example: set variable `_JAVA_OPTIONS` with value `-Xms256m -Xmx16g`. The flag `Xms` specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while `Xmx` specifies the maximum memory allocation pool. This means that JVM will be started with `Xms` amount of memory and will be able to use a maximum of `Xmx` amount of memory. By default, the service uses min 64 MB and max 1G.
173-
174191
## Next steps
175192

176193
- [Copy activity overview](copy-activity-overview.md)

0 commit comments

Comments
 (0)