Skip to content

Commit 637c61e

Browse files
authored
Merge pull request #214607 from mulander/cosmosdb-howto-ingest-azure-blob-storage
Cosmosdb howto ingest azure blob storage
2 parents b9ae670 + 6d5c3e2 commit 637c61e

File tree

4 files changed

+357
-7
lines changed

4 files changed

+357
-7
lines changed

articles/cosmos-db/postgresql/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,8 @@
128128
href: howto-connect.md
129129
- name: Ingest data
130130
items:
131+
- name: Azure Blob Storage (ABS)
132+
href: howto-ingest-azure-blob-storage.md
131133
- name: Azure Stream Analytics (ASA)
132134
href: howto-ingest-azure-stream-analytics.md
133135
- name: Azure Data Factory (ADF)
Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
---
2+
title: Data ingestion with Azure Blob Storage - Azure Cosmos DB for PostgreSQL
3+
description: How to ingest data using Azure Blob Storage as a staging area
4+
ms.author: adamwolk
5+
author: mulander
6+
ms.service: cosmos-db
7+
ms.subservice: postgresql
8+
ms.topic: how-to
9+
ms.date: 10/19/2022
10+
---
11+
12+
# How to ingest data using Azure Blob Storage
13+
14+
[!INCLUDE [PostgreSQL](../includes/appliesto-postgresql.md)]
15+
16+
[Azure Blob Storage](https://azure.microsoft.com/services/storage/blobs/#features) (ABS) is a cloud-native scalable, durable and secure storage service. These characteristics of ABS make it a good choice of storing and moving existing data into the cloud.
17+
18+
This article shows how to use the pg_azure_storage PostgreSQL extension to
19+
manipulate and load data into your Azure Cosmos DB for PostgreSQL directly from
20+
Azure Blob Storage.
21+
22+
## Prepare database and blob storage
23+
24+
To load data from Azure Blob Storage, install the `pg_azure_storage` PostgreSQL
25+
extension in your database:
26+
27+
```sql
28+
SELECT * FROM create_extension('azure_storage');
29+
```
30+
31+
We've prepared a public demonstration dataset for this article. To use your own
32+
dataset, follow [migrate your on-premises data to cloud
33+
storage](../../storage/common/storage-use-azcopy-migrate-on-premises-data.md)
34+
to learn how to get your datasets efficiently into Azure Blob Storage.
35+
36+
> [!NOTE]
37+
>
38+
> Selecting "Container (anonymous read access for containers and blobs)" will allow you to ingest files from Azure Blob Storage using their public URLs and enumerating the container contents without the need to configure an account key in pg_azure_storage. Containers set to access level "Private (no anonymous access)" or "Blob (anonymous read access for blobs only)" will require an access key.
39+
40+
## List container contents
41+
42+
There's a demonstration Azure Blob Storage account and container pre-created for this how-to. The container's name is `github`, and it's in the `pgquickstart` account. We can easily see which files are present in the container by using the `azure_storage.blob_list(account, container)` function.
43+
44+
```sql
45+
SELECT path, bytes, pg_size_pretty(bytes), content_type
46+
FROM azure_storage.blob_list('pgquickstart','github');
47+
```
48+
49+
```
50+
-[ RECORD 1 ]--+-------------------
51+
path | events.csv.gz
52+
bytes | 41691786
53+
pg_size_pretty | 40 MB
54+
content_type | application/x-gzip
55+
-[ RECORD 2 ]--+-------------------
56+
path | users.csv.gz
57+
bytes | 5382831
58+
pg_size_pretty | 5257 kB
59+
content_type | application/x-gzip
60+
```
61+
62+
You can filter the output either by using a regular SQL `WHERE` clause, or by using the `prefix` parameter of the `blob_list` UDF. The latter will filter the returned rows on the Azure Blob Storage side.
63+
64+
65+
> [!NOTE]
66+
>
67+
> Listing container contents requires an account and access key or a container with enabled anonymous access.
68+
69+
70+
```sql
71+
SELECT * FROM azure_storage.blob_list('pgquickstart','github','e');
72+
```
73+
74+
```
75+
-[ RECORD 1 ]----+---------------------------------
76+
path | events.csv.gz
77+
bytes | 41691786
78+
last_modified | 2022-10-12 18:49:51+00
79+
etag | 0x8DAAC828B970928
80+
content_type | application/x-gzip
81+
content_encoding |
82+
content_hash | 473b6ad25b7c88ff6e0a628889466aed
83+
```
84+
85+
```sql
86+
SELECT *
87+
FROM azure_storage.blob_list('pgquickstart','github')
88+
WHERE path LIKE 'e%';
89+
```
90+
91+
```
92+
-[ RECORD 1 ]----+---------------------------------
93+
path | events.csv.gz
94+
bytes | 41691786
95+
last_modified | 2022-10-12 18:49:51+00
96+
etag | 0x8DAAC828B970928
97+
content_type | application/x-gzip
98+
content_encoding |
99+
content_hash | 473b6ad25b7c88ff6e0a628889466aed
100+
```
101+
102+
## Load data from ABS
103+
104+
### Load data with the COPY command
105+
106+
Start by creating a sample schema.
107+
108+
```sql
109+
CREATE TABLE github_users
110+
(
111+
user_id bigint,
112+
url text,
113+
login text,
114+
avatar_url text,
115+
gravatar_id text,
116+
display_login text
117+
);
118+
119+
CREATE TABLE github_events
120+
(
121+
event_id bigint,
122+
event_type text,
123+
event_public boolean,
124+
repo_id bigint,
125+
payload jsonb,
126+
repo jsonb,
127+
user_id bigint,
128+
org jsonb,
129+
created_at timestamp
130+
);
131+
132+
CREATE INDEX event_type_index ON github_events (event_type);
133+
CREATE INDEX payload_index ON github_events USING GIN (payload jsonb_path_ops);
134+
135+
SELECT create_distributed_table('github_users', 'user_id');
136+
SELECT create_distributed_table('github_events', 'user_id');
137+
```
138+
139+
Loading data into the tables becomes as simple as calling the `COPY` command.
140+
141+
```sql
142+
-- download users and store in table
143+
144+
COPY github_users
145+
FROM 'https://pgquickstart.blob.core.windows.net/github/users.csv.gz';
146+
147+
-- download events and store in table
148+
149+
COPY github_events
150+
FROM 'https://pgquickstart.blob.core.windows.net/github/events.csv.gz';
151+
```
152+
153+
Notice how the extension recognized that the URLs provided to the copy command are from Azure Blob Storage, the files we pointed were gzip compressed and that was also automatically handled for us.
154+
155+
The `COPY` command supports more parameters and formats. In the above example, the format and compression were auto-selected based on the file extensions. You can however provide the format directly similar to the regular `COPY` command.
156+
157+
```sql
158+
COPY github_users
159+
FROM 'https://pgquickstart.blob.core.windows.net/github/users.csv.gz'
160+
WITH (FORMAT 'csv');
161+
```
162+
163+
Currently the extension supports the following file formats:
164+
165+
|format|description|
166+
|------|-----------|
167+
|csv|Comma-separated values format used by PostgreSQL COPY|
168+
|tsv|Tab-separated values, the default PostgreSQL COPY format|
169+
|binary|Binary PostgreSQL COPY format|
170+
|text|A file containing a single text value (for example, large JSON or XML)|
171+
172+
### Load data with blob_get()
173+
174+
The `COPY` command is convenient, but limited in flexibility. Internally COPY uses the `blob_get` function, which you can use directly to manipulate data in much more complex scenarios.
175+
176+
```sql
177+
SELECT *
178+
FROM azure_storage.blob_get(
179+
'pgquickstart', 'github',
180+
'users.csv.gz', NULL::github_users
181+
)
182+
LIMIT 3;
183+
```
184+
185+
```
186+
-[ RECORD 1 ]-+--------------------------------------------
187+
user_id | 21
188+
url | https://api.github.com/users/technoweenie
189+
login | technoweenie
190+
avatar_url | https://avatars.githubusercontent.com/u/21?
191+
gravatar_id |
192+
display_login | technoweenie
193+
-[ RECORD 2 ]-+--------------------------------------------
194+
user_id | 22
195+
url | https://api.github.com/users/macournoyer
196+
login | macournoyer
197+
avatar_url | https://avatars.githubusercontent.com/u/22?
198+
gravatar_id |
199+
display_login | macournoyer
200+
-[ RECORD 3 ]-+--------------------------------------------
201+
user_id | 38
202+
url | https://api.github.com/users/atmos
203+
login | atmos
204+
avatar_url | https://avatars.githubusercontent.com/u/38?
205+
gravatar_id |
206+
display_login | atmos
207+
```
208+
209+
> [!NOTE]
210+
>
211+
> In the above query, the file is fully fetched before `LIMIT 3` is applied.
212+
213+
With this function, you can manipulate data on the fly in complex queries, and do imports as `INSERT FROM SELECT`.
214+
215+
```sql
216+
INSERT INTO github_users
217+
SELECT user_id, url, UPPER(login), avatar_url, gravatar_id, display_login
218+
FROM azure_storage.blob_get('pgquickstart', 'github', 'users.csv.gz', NULL::github_users)
219+
WHERE gravatar_id IS NOT NULL;
220+
```
221+
222+
```
223+
INSERT 0 264308
224+
```
225+
226+
In the above command, we filtered the data to accounts with a `gravatar_id` present and upper cased their logins on the fly.
227+
228+
#### Options for blob_get()
229+
230+
In some situations, you may need to control exactly what `blob_get` attempts to do by using the `decoder`, `compression` and `options` parameters.
231+
232+
Decoder can be set to `auto` (default) or any of the following values:
233+
234+
|format|description|
235+
|------|-----------|
236+
|csv|Comma-separated values format used by PostgreSQL COPY|
237+
|tsv|Tab-separated values, the default PostgreSQL COPY format|
238+
|binary|Binary PostgreSQL COPY format|
239+
|text|A file containing a single text value (for example, large JSON or XML)|
240+
241+
`compression` can be either `auto` (default), `none` or `gzip`.
242+
243+
Finally, the `options` parameter is of type `jsonb`. There are four utility functions that help building values for it.
244+
Each utility function is designated for the decoder matching its name.
245+
246+
|decoder|options function |
247+
|-------|------------------|
248+
|csv |`options_csv_get` |
249+
|tsv |`options_tsv` |
250+
|binary |`options_binary` |
251+
|text |`options_copy` |
252+
253+
By looking at the function definitions, you can see which parameters are supported by which decoder.
254+
255+
`options_csv_get` - delimiter, null_string, header, quote, escape, force_not_null, force_null, content_encoding
256+
`options_tsv` - delimiter, null_string, content_encoding
257+
`options_copy` - delimiter, null_string, header, quote, escape, force_quote, force_not_null, force_null, content_encoding.
258+
`options_binary` - content_encoding
259+
260+
Knowing the above, we can discard recordings with null `gravatar_id` during parsing.
261+
262+
```sql
263+
INSERT INTO github_users
264+
SELECT user_id, url, UPPER(login), avatar_url, gravatar_id, display_login
265+
FROM azure_storage.blob_get('pgquickstart', 'github', 'users.csv.gz', NULL::github_users,
266+
options := azure_storage.options_csv_get(force_not_null := ARRAY['gravatar_id']));
267+
```
268+
269+
270+
```
271+
INSERT 0 264308
272+
```
273+
274+
## Access private storage
275+
276+
1. Obtain your account name and access key
277+
278+
Without an access key, we won't be allowed to list containers that are set to Private or Blob access levels.
279+
280+
```sql
281+
SELECT * FROM azure_storage.blob_list('mystorageaccount','privdatasets');
282+
```
283+
284+
```
285+
ERROR: azure_storage: missing account access key
286+
HINT: Use SELECT azure_storage.account_add('<account name>', '<access key>')
287+
```
288+
289+
In your storage account, open **Access keys**. Copy the **Storage account name** and copy the **Key** from **key1** section (you have to select **Show** next to the key first).
290+
291+
:::image type="content" source="media/howto-ingestion/azure-blob-storage-account-key.png" alt-text="Screenshot of Security + networking > Access keys section of an Azure Blob Storage page in the Azure portal." border="true":::
292+
293+
1. Adding an account to pg_azure_storage
294+
295+
```sql
296+
SELECT azure_storage.account_add('mystorageaccount', 'SECRET_ACCESS_KEY');
297+
```
298+
299+
Now you can list containers set to Private and Blob access levels for that storage but only as the `citus` user, which has the `azure_storage_admin` role granted to it. If you create a new user named `support`, it won't be allowed to access container contents by default.
300+
301+
```sql
302+
SELECT * FROM azure_storage.blob_list('pgabs','dataverse');
303+
```
304+
305+
```
306+
ERROR: azure_storage: current user support is not allowed to use storage account pgabs
307+
```
308+
309+
1. Allow the `support` user to use a specific Azure Blob Storage account
310+
311+
Granting the permission is as simple as calling `account_user_add`.
312+
313+
```sql
314+
SELECT * FROM azure_storage.account_user_add('mystorageaccount', 'support');
315+
```
316+
317+
We can see the allowed users in the output of `account_list`, which shows all accounts with access keys defined.
318+
319+
```sql
320+
SELECT * FROM azure_storage.account_list();
321+
```
322+
323+
```
324+
account_name | allowed_users
325+
------------------+---------------
326+
mystorageaccount | {support}
327+
(1 row)
328+
```
329+
330+
If you ever decide, that the user should no longer have access. Just call `account_user_remove`.
331+
332+
333+
```sql
334+
SELECT * FROM azure_storage.account_user_remove('mystorageaccount', 'support');
335+
```
336+
337+
## Next steps
338+
339+
Congratulations, you just learned how to load data into Azure Cosmos DB for PostgreSQL directly from Azure Blob Storage.
340+
341+
Learn how to create a [real-time dashboard](tutorial-design-database-realtime.md) with Azure Cosmos DB for PostgreSQL.
169 KB
Loading

articles/cosmos-db/postgresql/quickstart-distribute-tables.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ ms.service: cosmos-db
88
ms.subservice: postgresql
99
ms.custom: mvc, mode-ui, ignite-2022
1010
ms.topic: quickstart
11-
ms.date: 08/11/2022
11+
ms.date: 10/14/2022
1212
---
1313

1414
# Create and distribute tables
@@ -85,20 +85,27 @@ SELECT create_distributed_table('github_events', 'user_id');
8585
We're ready to fill the tables with sample data. For this quickstart, we'll use
8686
a dataset previously captured from the GitHub API.
8787

88-
Run the following commands to download example CSV files and load them into the
89-
database tables. (The `curl` command downloads the files, and comes
90-
pre-installed in the Azure Cloud Shell.)
88+
We're going to use the pg_azure_storage extension, to load the data directly from a public container in Azure Blob Storage. First we need to create the extension in our database:
9189

92-
```
90+
```sql
91+
SELECT * FROM create_extension('azure_storage');
92+
```
93+
94+
Run the following commands to have the database fetch the example CSV files and load them into the
95+
database tables.
96+
97+
```sql
9398
-- download users and store in table
9499

95-
\COPY github_users FROM PROGRAM 'curl https://examples.citusdata.com/users.csv' WITH (FORMAT CSV)
100+
COPY github_users FROM 'https://pgquickstart.blob.core.windows.net/github/users.csv.gz';
96101

97102
-- download events and store in table
98103

99-
\COPY github_events FROM PROGRAM 'curl https://examples.citusdata.com/events.csv' WITH (FORMAT CSV)
104+
COPY github_events FROM 'https://pgquickstart.blob.core.windows.net/github/events.csv.gz';
100105
```
101106

107+
Notice how the extension recognized that the URLs provided to the copy command are from Azure Blob Storage, the files we pointed were gzip compressed and that was also automatically handled for us.
108+
102109
We can review details of our distributed tables, including their sizes, with
103110
the `citus_tables` view:
104111

0 commit comments

Comments
 (0)