|
| 1 | +--- |
| 2 | +title: Optimizing sink performance in mapping data flow |
| 3 | +titleSuffix: Azure Data Factory & Azure Synapse |
| 4 | +description: Learn about optimizing sink performance in mapping data flows in Azure Data Factory and Azure Synapse Analytics pipelines. |
| 5 | +author: kromerm |
| 6 | +ms.topic: conceptual |
| 7 | +ms.author: makromer |
| 8 | +ms.service: data-factory |
| 9 | +ms.subservice: data-flows |
| 10 | +ms.custom: synapse |
| 11 | +ms.date: 09/29/2021 |
| 12 | +--- |
| 13 | + |
| 14 | +# Optimizing sinks |
| 15 | + |
| 16 | +When data flows write to sinks, any custom partitioning will happen immediately before the write. Like the source, in most cases it is recommended that you keep **Use current partitioning** as the selected partition option. Partitioned data will write significantly quicker than unpartitioned data, even your destination is not partitioned. Below are the individual considerations for various sink types. |
| 17 | + |
| 18 | +## Azure SQL Database sinks |
| 19 | + |
| 20 | +With Azure SQL Database, the default partitioning should work in most cases. There is a chance that your sink may have too many partitions for your SQL database to handle. If you are running into this, reduce the number of partitions outputted by your SQL Database sink. |
| 21 | + |
| 22 | +### Impact of error row handling to performance |
| 23 | + |
| 24 | +When you enable error row handling ("continue on error") in the sink transformation, the service will take an additional step before writing the compatible rows to your destination table. This additional step will have a small performance penalty that can be in the range of 5% added for this step with an additional small performance hit also added if you set the option to also with the incompatible rows to a log file. |
| 25 | + |
| 26 | +### Disabling indexes using a SQL Script |
| 27 | + |
| 28 | +Disabling indexes before a load in a SQL database can greatly improve performance of writing to the table. Run the below command before writing to your SQL sink. |
| 29 | + |
| 30 | +`ALTER INDEX ALL ON dbo.[Table Name] DISABLE` |
| 31 | + |
| 32 | +After the write has completed, rebuild the indexes using the following command: |
| 33 | + |
| 34 | +`ALTER INDEX ALL ON dbo.[Table Name] REBUILD` |
| 35 | + |
| 36 | +These can both be done natively using Pre and Post-SQL scripts within an Azure SQL DB or Synapse sink in mapping data flows. |
| 37 | + |
| 38 | +:::image type="content" source="media/data-flow/disable-indexes-sql.png" alt-text="Disable indexes"::: |
| 39 | + |
| 40 | +> [!WARNING] |
| 41 | +> When disabling indexes, the data flow is effectively taking control of a database and queries are unlikely to succeed at this time. As a result, many ETL jobs are triggered in the middle of the night to avoid this conflict. For more information, learn about the [constraints of disabling SQL indexes](/sql/relational-databases/indexes/disable-indexes-and-constraints) |
| 42 | +
|
| 43 | +### Scaling up your database |
| 44 | + |
| 45 | +Schedule a resizing of your source and sink Azure SQL DB and DW before your pipeline run to increase the throughput and minimize Azure throttling once you reach DTU limits. After your pipeline execution is complete, resize your databases back to their normal run rate. |
| 46 | + |
| 47 | +## Azure Synapse Analytics sinks |
| 48 | + |
| 49 | +When writing to Azure Synapse Analytics, make sure that **Enable staging** is set to true. This enables the service to write using the [SQL COPY Command](/sql/t-sql/statements/copy-into-transact-sql) which effectively loads the data in bulk. You will need to reference an Azure Data Lake Storage gen2 or Azure Blob Storage account for staging of the data when using Staging. |
| 50 | + |
| 51 | +Other than Staging, the same best practices apply to Azure Synapse Analytics as Azure SQL Database. |
| 52 | + |
| 53 | +## File-based sinks |
| 54 | + |
| 55 | +While data flows support a variety of file types, the Spark-native Parquet format is recommended for optimal read and write times. |
| 56 | + |
| 57 | +If the data is evenly distributed, **Use current partitioning** will be the fastest partitioning option for writing files. |
| 58 | + |
| 59 | +### File name options |
| 60 | + |
| 61 | +When writing files, you have a choice of naming options that each have a performance impact. |
| 62 | + |
| 63 | +:::image type="content" source="media/data-flow/file-sink-settings.png" alt-text="Sink options"::: |
| 64 | + |
| 65 | +Selecting the **Default** option will write the fastest. Each partition will equate to a file with the Spark default name. This is useful if you are just reading from the folder of data. |
| 66 | + |
| 67 | +Setting a naming **Pattern** will rename each partition file to a more user-friendly name. This operation happens after write and is slightly slower than choosing the default. Per partition allows you to name each individual partition manually. |
| 68 | + |
| 69 | +If a column corresponds to how you wish to output the data, you can select **As data in column**. This reshuffles the data and can impact performance if the columns are not evenly distributed. |
| 70 | + |
| 71 | +**Output to single file** combines all the data into a single partition. This leads to long write times, especially for large datasets. This option is strongly discouraged unless there is an explicit business reason to use it. |
| 72 | + |
| 73 | +## CosmosDB sinks |
| 74 | + |
| 75 | +When writing to CosmosDB, altering throughput and batch size during data flow execution can improve performance. These changes only take effect during the data flow activity run and will return to the original collection settings after conclusion. |
| 76 | + |
| 77 | +**Batch size:** Usually, starting with the default batch size is sufficient. To further tune this value, calculate the rough object size of your data, and make sure that object size * batch size is less than 2MB. If it is, you can increase the batch size to get better throughput. |
| 78 | + |
| 79 | +**Throughput:** Set a higher throughput setting here to allow documents to write faster to CosmosDB. Keep in mind the higher RU costs based upon a high throughput setting. |
| 80 | + |
| 81 | +**Write throughput budget:** Use a value which is smaller than total RUs per minute. If you have a data flow with a high number of Spark partitions, setting a budget throughput will allow more balance across those partitions. |
| 82 | + |
| 83 | +## Next steps |
| 84 | + |
| 85 | +- [Data flow performance overview](concepts-data-flow-performance.md) |
| 86 | +- [Optimizing sources](concepts-data-flow-performance-sources.md) |
| 87 | +- [Optimizing transformations](concepts-data-flow-performance-transformations.md) |
| 88 | +- [Using data flows in pipelines](concepts-data-flow-performance-pipelines.md) |
| 89 | + |
| 90 | +See other Data Flow articles related to performance: |
| 91 | + |
| 92 | +- [Data Flow activity](control-flow-execute-data-flow-activity.md) |
| 93 | +- [Monitor Data Flow performance](concepts-data-flow-monitoring.md) |
| 94 | +- [Integration Runtime performance](concepts-integration-runtime-performance.md) |
0 commit comments