Skip to content

Commit 77735c1

Browse files
authored
Merge pull request #215310 from itechedit/three-flexible-server-how-to-articles
edit pass: three flexible-server-how-to articles
2 parents 1e0a7f2 + f81539a commit 77735c1

File tree

3 files changed

+204
-238
lines changed

3 files changed

+204
-238
lines changed
Lines changed: 101 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -1,126 +1,109 @@
11
---
2-
title: Bulk data uploads For Azure Database for PostgreSQL - Flexible Server
3-
description: Best practices to bulk load data in Azure Database for PostgreSQL - Flexible Server
2+
title: Upload data in bulk in Azure Database for PostgreSQL - Flexible Server
3+
description: This article discusses best practices for uploading data in bulk in Azure Database for PostgreSQL - Flexible Server
44
author: sarat0681
55
ms.author: sbalijepalli
66
ms.reviewer: maghan
77
ms.service: postgresql
88
ms.topic: conceptual
99
ms.date: 08/16/2022
10-
ms.custom: template-how-to #Required; leave this attribute/value as-is.
10+
ms.custom: template-how-to
1111
---
1212

1313

14-
# Best practices for bulk data upload for Azure Database for PostgreSQL - Flexible Server
14+
# Best practices for uploading data in bulk in Azure Database for PostgreSQL - Flexible Server
1515

16-
There are two types of bulk loads:
17-
- Initial data load of an empty database
18-
- Incremental data loads
19-
20-
This article discusses various loading techniques along with best practices when it comes to initial data loads and incremental data loads.
16+
This article discusses various methods for loading data in bulk in Azure Database for PostgreSQL - Flexible Server, along with best practices for both initial data loads in empty databases and incremental data loads.
2117

2218
## Loading methods
2319

24-
Performance-wise, the data loading methods arranged in the order of most time consuming to least time consuming is as follows:
25-
- Single record Insert
26-
- Batch into 100-1000 rows per commit. One can use transaction block to wrap multiple records per commit
27-
- INSERT with multi row values
28-
- COPY command
20+
The following data-loading methods are arranged in order from most time consuming to least time consuming:
21+
- Run a single-record `INSERT` command.
22+
- Batch into 100 to 1000 rows per commit. You can use a transaction block to wrap multiple records per commit.
23+
- Run `INSERT` with multiple row values.
24+
- Run the `COPY` command.
2925

30-
The preferred method to load the data into the database is by copy command. If the copy command isn't possible, batch INSERTs is the next best method. Multi-threading with a COPY command is the optimal method for bulk data loads.
26+
The preferred method for loading data into a database is to use the `COPY` command. If the `COPY` command isn't possible, using batch `INSERT` is the next best method. Multi-threading with a `COPY` command is the optimal method for loading data in bulk.
3127

3228
## Best practices for initial data loads
3329

34-
#### Drop indexes
30+
### Drop indexes
31+
32+
Before you do an initial data load, we recommend that you drop all the indexes in the tables. It's always more efficient to create the indexes after the data is loaded.
3533

36-
Before an initial data load, it's advised to drop all the indexes in the tables. It's always more efficient to create the indexes after the data load.
34+
### Drop constraints
3735

38-
#### Drop constraints
36+
The main drop constraints are described here:
3937

40-
##### Unique key constraints
38+
* **Unique key constraints**
4139

42-
To achieve strong performance, it's advised to drop unique key constraints before an initial data load, and recreate it once the data load is completed. However, dropping unique key constraints cancels the safeguards against duplicated data.
40+
To achieve strong performance, we recommend that you drop unique key constraints before an initial data load, and re-create them after the data load is completed. However, dropping unique key constraints cancels the safeguards against duplicated data.
4341

44-
##### Foreign key constraints
42+
* **Foreign key constraints**
4543

46-
It's advised to drop foreign key constraints before initial data load and recreate once data load is completed.
44+
We recommend that you drop foreign key constraints before the initial data load and re-create them after the data load is completed.
4745

48-
Changing the `session_replication_role` parameter to replica also disables all foreign key checks. However, be aware making the change can leave data in an inconsistent state if not properly used.
46+
Changing the `session_replication_role` parameter to `replica` also disables all foreign key checks. However, be aware that making the change can leave data in an inconsistent state if it's not properly used.
4947

50-
#### Unlogged tables
48+
### Unlogged tables
5149

52-
Use of unlogged tables will make data load faster. Data written to unlogged tables isn't written to the write-ahead log.
50+
Consider the pros and cons of using unlogged tables before you use them in initial data loads.
5351

54-
The disadvantages of using unlogged tables are
52+
Using unlogged tables makes data load faster. Data that's written to unlogged tables isn't written to the write-ahead log.
53+
54+
The disadvantages of using unlogged tables are:
5555
- They aren't crash-safe. An unlogged table is automatically truncated after a crash or unclean shutdown.
5656
- Data from unlogged tables can't be replicated to standby servers.
5757

58-
The pros and cons of using unlogged tables should be considered before using in initial data loads.
59-
60-
Use the following options to create an unlogged table or change an existing table to unlogged table:
58+
To create an unlogged table or change an existing table to an unlogged table, use the following options:
6159

62-
Create a new unlogged table by using the following syntax:
63-
```
64-
CREATE UNLOGGED TABLE <tablename>;
65-
```
60+
* Create a new unlogged table by using the following syntax:
61+
```
62+
CREATE UNLOGGED TABLE <tablename>;
63+
```
6664
67-
Convert an existing logged table to an unlogged table by using the following syntax:
68-
```
69-
ALTER TABLE <tablename> SET UNLOGGED;
70-
```
65+
* Convert an existing logged table to an unlogged table by using the following syntax:
7166
72-
#### Server parameter tuning
67+
```
68+
ALTER TABLE <tablename> SET UNLOGGED;
69+
```
7370
74-
`Autovacuum`
71+
### Server parameter tuning
7572
76-
During the initial data load, it's best to turn off the autovacuum. Once the initial load is completed, it's advised to run a manual VACUUM ANALYZE on all tables in the database, and then turn on autovacuum.
73+
* `autovacuum`: During the initial data load, it's best to turn off `autovacuum`. After the initial load is completed, we recommend that you run a manual `VACUUM ANALYZE` on all tables in the database, and then turn on `autovacuum`.
7774
7875
> [!NOTE]
79-
> Please follow the recommendations below only if there is enough memory and disk space.
80-
81-
`maintenance_work_mem`
76+
> Follow the recommendations here only if there's enough memory and disk space.
8277
83-
The maintenance_work_mem can be set to a maximum of 2 GB on a flexible server. `maintenance_work_mem` helps in speeding up autovacuum, index, and foreign key creation.
78+
* `maintenance_work_mem`: Can be set to a maximum of 2 gigabytes (GB) on a flexible server. `maintenance_work_mem` helps in speeding up autovacuum, index, and foreign key creation.
8479
85-
`checkpoint_timeout`
80+
* `checkpoint_timeout`: On a flexible server, the `checkpoint_timeout` value can be increased to a maximum of 24 hours from the default setting of 5 minutes. We recommend that you increase the value to 1 hour before you load data initially on the flexible server.
8681
87-
On the flexible server, the checkpoint_timeout can be increased to maximum 24 h from default 5 minutes. It's advised to increase the value to 1 hour before initial data loads on Flexible server.
82+
* `checkpoint_completion_target`: We recommend a value of 0.9.
8883
89-
`checkpoint_completion_target`
84+
* `max_wal_size`: Can be set to the maximum allowed value on a flexible server, which is 64 GB while you're doing the initial data load.
9085
91-
A value of 0.9 is always recommended.
86+
* `wal_compression`: Can be turned on. Enabling this parameter can incur some extra CPU cost spent on the compression during write-ahead log (WAL) logging and on the decompression during WAL replay.
9287
93-
`max_wal_size`
9488
95-
The max_wal_size can be set to the maximum allowed value on the Flexible server, which is 64 GB while we do the initial data load.
89+
### Flexible server recommendations
9690
97-
`wal_compression`
91+
Before you begin an initial data load on the flexible server, we recommend that you:
9892
99-
wal_compression can be turned on. Enabling the parameter can have some extra CPU cost spent on the compression during WAL logging and on the decompression during WAL replay.
93+
- Disable high availability on the server. You can enable it after the initial load is completed on master/primary.
94+
- Create read replicas after the initial data load is completed.
95+
- Make logging minimal or disable it altogether during initial data loads (for example: disable pgaudit, pg_stat_statements, query store).
10096
10197
102-
#### Flexible server recommendations
98+
### Re-create indexes and add constraints
10399
104-
Before the start of initial data load on a Flexible server, it's recommended to
100+
Assuming that you dropped the indexes and constraints before the initial load, we recommend that you use high values in `maintenance_work_mem` (as mentioned earlier) for creating indexes and adding constraints. In addition, starting with PostgreSQL version 11, the following parameters can be modified for faster parallel index creation after the initial data load:
105101
106-
- Disable high availability [HA] on the server. You can enable HA once initial load is completed on master/primary.
107-
- Create read replicas after initial data load is completed.
108-
- Make logging minimal or disable all together during initial data loads. Example: disable pgaudit, pg_stat_statements, query store.
102+
* `max_parallel_workers`: Sets the maximum number of workers that the system can support for parallel queries.
109103
104+
* `max_parallel_maintenance_workers`: Controls the maximum number of worker processes, which can be used in `CREATE INDEX`.
110105
111-
#### Recreating indexes and adding constraints
112-
113-
Assuming the indexes and constraints were dropped before the initial load, it's recommended to have high values of maintenance_work_mem (as recommended above) for creating indexes and adding constraints. In addition, starting with Postgres version 11, the following parameters can be modified for faster parallel index creation after initial data load:
114-
115-
`max_parallel_workers`
116-
117-
Sets the maximum number of workers that the system can support for parallel queries.
118-
119-
`max_parallel_maintenance_workers`
120-
121-
Controls the maximum number of worker processes, which can be used to CREATE INDEX.
122-
123-
One could also create the indexes by making recommended settings at the session level. An example of how it can be done at the session level is shown below:
106+
You can also create the indexes by making the recommended settings at the session level. Here's an example of how to do it:
124107
125108
```sql
126109
SET maintenance_work_mem = '2GB';
@@ -131,40 +114,42 @@ CREATE INDEX test_index ON test_table (test_column);
131114

132115
## Best practices for incremental data loads
133116

134-
#### Table partitioning
117+
### Partition tables
118+
119+
We always recommend that you partition large tables. Some advantages of partitioning, especially during incremental loads, include:
120+
- Creating new partitions based on new deltas makes it efficient to add new data to the table.
121+
- Maintaining tables becomes easier. You can drop a partition during an incremental data load to avoid time-consuming deletions in large tables.
122+
- Autovacuum would be triggered only on partitions that were changed or added during incremental loads, which makes maintaining statistics on the table easier.
135123

136-
It's always recommended to partition large tables. Some advantages of partitioning, especially during incremental loads:
137-
- Creation of new partitions based on the new deltas makes it efficient to add new data to the table.
138-
- Maintenance of tables becomes easier. One can drop a partition during incremental data loads avoiding time-consuming deletes on large tables.
139-
- Autovacuum would be triggered only on partitions that were changed or added during incremental loads, which make maintaining statistics on the table easier.
124+
### Maintain up-to-date table statistics
140125

141-
#### Maintain up-to-date table statistics
126+
Monitoring and maintaining table statistics is important for query performance on the database. This also includes scenarios where you have incremental loads. PostgreSQL uses the autovacuum daemon process to clean up dead tuples and analyze the tables to keep the statistics updated. For more information, see [Autovacuum monitoring and tuning](./how-to-autovacuum-tuning.md).
142127

143-
Monitoring and maintaining table statistics is important for query performance on the database. This also includes scenarios where you have incremental loads. PostgreSQL uses the autovacuum daemon process to clean up dead tuples and analyze the tables to keep the statistics updated. For more details on autovacuum monitoring and tuning, review [Autovacuum Tuning](./how-to-autovacuum-tuning.md).
128+
### Create indexes on foreign key constraints
144129

145-
#### Index creation on foreign key constraints
130+
Creating indexes on foreign keys in the child tables can be beneficial in the following scenarios:
131+
- Data updates or deletions in the parent table. When data is updated or deleted in the parent table, lookups are performed on the child table. To make lookups faster, you could index foreign keys on the child table.
132+
- Queries, where you can see the joining of parent and child tables on key columns.
146133

147-
Creating indexes on foreign keys in the child tables would be beneficial in the following scenarios:
148-
- Data updates or deletions in the parent table. When data is updated or deleted in the parent table lookups would be performed on the child table. To make lookups faster, you could index foreign keys on the child table.
149-
- Queries, where we see join between parent and child tables on key columns.
134+
### Identify unused indexes
150135

151-
#### Unused indexes
136+
Identify unused indexes in the database and drop them. Indexes are an overhead on data loads. The fewer the indexes on a table, the better the performance during data ingestion.
152137

153-
Identify unused indexes in the database and drop them. Indexes are an overhead on data loads. The fewer the indexes on a table the better the performance is during data ingestion.
154-
Unused indexes can be identified in two ways - by Query Store and an index usage query.
138+
You can identify unused indexes in two ways: by Query Store and an index usage query.
155139

156-
##### Query store
140+
**Query Store**
157141

158-
Query Store helps identify indexes, which can be dropped based on query usage patterns on the database. For step-by-step guidance, see [Query Store](./concepts-query-store.md).
159-
Once Query Store is enabled on the server, the following query can be used to identify indexes that can be dropped by connecting to azure_sys database.
142+
The Query Store feature helps identify indexes, which can be dropped based on query usage patterns on the database. For step-by-step guidance, see [Query Store](./concepts-query-store.md).
143+
144+
After you've enabled Query Store on the server, you can use the following query to identify indexes that can be dropped by connecting to azure_sys database.
160145

161146
```sql
162147
SELECT * FROM IntelligentPerformance.DropIndexRecommendations;
163148
```
164149

165-
##### Index usage
150+
**Index usage**
166151

167-
The below query can also be used to identify unused indexes:
152+
You can also use the following query to identify unused indexes:
168153

169154
```sql
170155
SELECT
@@ -188,59 +173,45 @@ WHERE
188173
ORDER BY 1, 2;
189174
```
190175

191-
Number_of_scans, tuples_read, and tuples_fetched columns would indicate index usage.number_of_scans column value of zero points to index not being used.
176+
The `number_of_scans`, `tuples_read`, and `tuples_fetched` columns would indicate the index usage.number_of_scans column value of zero points as an index that's not being used.
192177

193-
#### Server parameter tuning
178+
### Server parameter tuning
194179

195180
> [!NOTE]
196-
> Please follow the recommendations below only if there is enough memory and disk space.
197-
198-
`maintenance_work_mem`
199-
200-
The maintenance_work_mem parameter can be set to a maximum of 2 GB on Flexible Server. `maintenance_work_mem` helps speed up index creation and foreign key additions.
201-
202-
`checkpoint_timeout`
181+
> Follow the recommendations in the following parameters only if there's enough memory and disk space.
203182
204-
On the Flexible Server, the checkpoint_timeout parameter can be increased to 10 minutes or 15 minutes from the default 5 minutes. Increasing `checkpoint_timeout` to a larger value, such as 15 minutes, can reduce the I/O load, but the downside is that it takes longer to recover if there was a crash. Careful consideration is recommended before making the change.
183+
* `maintenance_work_mem`: This parameter can be set to a maximum of 2 GB on the flexible server. `maintenance_work_mem` helps speed up index creation and foreign key additions.
205184

206-
`checkpoint_completion_target`
185+
* `checkpoint_timeout`: On the flexible server, the `checkpoint_timeout` value can be increased to 10 or 15 minutes from the default setting of 5 minutes. Increasing `checkpoint_timeout` to a larger value, such as 15 minutes, can reduce the I/O load, but the downside is that it takes longer to recover if there's a crash. We recommend careful consideration before you make the change.
207186

208-
A value of 0.9 is always recommended.
187+
* `checkpoint_completion_target`: We recommend a value of 0.9.
209188

210-
`max_wal_size`
189+
* `max_wal_size`: This value depends on SKU, storage, and workload. One way to arrive at the correct value for `max_wal_size` is shown in the following example.
211190

212-
The max_wal_size depends on SKU, storage, and workload.
191+
During peak business hours, arrive at a value by doing the following:
213192

214-
One way to arrive at the correct value for max_wal_size is shown below.
193+
a. Take the current WAL log sequence number (LSN) by running the following query:
215194

216-
During peak business hours, follow the below steps to arrive at a value:
217-
218-
- Take the current WAL LSN by executing the below query:
219-
220-
```sql
221-
SELECT pg_current_wal_lsn ();
222-
```
223-
224-
- Wait for checkpoint_timeout number of seconds. Take the current WAL LSN by executing the below query:
225-
226-
```sql
227-
SELECT pg_current_wal_lsn ();
228-
```
229-
230-
- Use the two results to check the difference in GB:
231-
232-
```sql
233-
SELECT round (pg_wal_lsn_diff('LSN value when run second time','LSN value when run first time')/1024/1024/1024,2) WAL_CHANGE_GB;
234-
```
195+
```sql
196+
SELECT pg_current_wal_lsn ();
197+
```
198+
b. Wait for `checkpoint_timeout` number of seconds. Take the current WAL LSN by running the following query:
235199

236-
`wal_compression`
200+
```sql
201+
SELECT pg_current_wal_lsn ();
202+
```
203+
c. Use the two results to check the difference, in GB:
204+
205+
```sql
206+
SELECT round (pg_wal_lsn_diff('LSN value when run second time','LSN value when run first time')/1024/1024/1024,2) WAL_CHANGE_GB;
207+
```
237208

238-
wal_compression can be turned on. Enabling the parameter can have some extra CPU cost spent on the compression during WAL logging and on the decompression during WAL replay.
209+
* `wal_compression`: Can be turned on. Enabling this parameter can incur some extra CPU cost spent on the compression during WAL logging and on the decompression during WAL replay.
239210

240211

241212
## Next steps
242-
- Troubleshoot high CPU utilization [High CPU Utilization](./how-to-high-CPU-utilization.md).
243-
- Troubleshoot high memory utilization [High Memory Utilization](./how-to-high-memory-utilization.md).
244-
- Configure server parameters [Server Parameters](./howto-configure-server-parameters-using-portal.md).
245-
- Troubleshoot and tune Autovacuum [Autovacuum Tuning](./how-to-autovacuum-tuning.md).
246-
- Troubleshoot high CPU utilization [High IOPS Utilization](./how-to-high-io-utilization.md).
213+
- [Troubleshoot high CPU utilization](./how-to-high-CPU-utilization.md)
214+
- [Troubleshoot high memory utilization](./how-to-high-memory-utilization.md)
215+
- [Configure server parameters](./howto-configure-server-parameters-using-portal.md)
216+
- [Troubleshoot and tune Autovacuum](./how-to-autovacuum-tuning.md)
217+
- [Troubleshoot high CPU utilization](./how-to-high-io-utilization.md)

0 commit comments

Comments
 (0)