Skip to content

Commit 101ebb1

Browse files
Update query-parquet-files.md
1 parent 46346ab commit 101ebb1

File tree

1 file changed

+28
-49
lines changed

1 file changed

+28
-49
lines changed

articles/synapse-analytics/sql/query-parquet-files.md

Lines changed: 28 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -21,52 +21,35 @@ Your first step is to **create a database** where the tables will be created. Th
2121

2222
## Dataset
2323

24-
You can query Parquet files the same way you read CSV files. The only difference is that the FILEFORMAT parameter should be set to PARQUET. Examples in this article show the specifics of reading Parquet files.
24+
[NYC Yellow Taxi](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/) dataset i used in this sample isYou can query Parquet files the same way you read CSV files. The only difference is that the FILEFORMAT parameter should be set to PARQUET. Examples in this article show the specifics of reading Parquet files.
2525

2626
> [!NOTE]
2727
> You do not have to specify columns in the OPENROWSET WITH clause when reading parquet files. SQL on-demand will utilize metadata in the Parquet file and bind columns by name.
2828
29-
You'll use the folder *parquet/taxi* for the sample queries. It contains NYC Taxi - Yellow Taxi Trip Records data from July 2016. to June 2018.
30-
31-
Data is partitioned by year and month and the folder structure is as follows:
32-
33-
- year=2016
34-
- month=6
35-
- ...
36-
- month=12
37-
- year=2017
38-
- month=1
39-
- ...
40-
- month=12
41-
- year=2018
42-
- month=1
43-
- ...
44-
- month=6
45-
4629
## Query set of parquet files
4730

4831
You can specify only the columns of interest when you query Parquet files.
4932

5033
```sql
5134
SELECT
52-
YEAR(pickup_datetime),
53-
passenger_count,
35+
YEAR(tpepPickupDateTime),
36+
passengerCount,
5437
COUNT(*) AS cnt
5538
FROM
5639
OPENROWSET(
57-
BULK 'parquet/taxi/*/*/*',
58-
DATA_SOURCE = 'SqlOnDemandDemo',
40+
BULK 'puYear=2018/puMonth=*/*.snappy.parquet',
41+
DATA_SOURCE = 'YellowTaxi',
5942
FORMAT='PARQUET'
6043
) WITH (
61-
pickup_datetime DATETIME2,
62-
passenger_count INT
44+
tpepPickupDateTime DATETIME2,
45+
passengerCount INT
6346
) AS nyc
6447
GROUP BY
65-
passenger_count,
66-
YEAR(pickup_datetime)
48+
passengerCount,
49+
YEAR(tpepPickupDateTime)
6750
ORDER BY
68-
YEAR(pickup_datetime),
69-
passenger_count;
51+
YEAR(tpepPickupDateTime),
52+
passengerCount;
7053
```
7154

7255
## Automatic schema inference
@@ -79,14 +62,13 @@ The sample below shows the automatic schema inference capabilities for Parquet f
7962
> You don't have to specify columns in the OPENROWSET WITH clause when reading Parquet files. In that case, SQL on-demand Query service will utilize metadata in the Parquet file and bind columns by name.
8063
8164
```sql
82-
SELECT
83-
COUNT_BIG(*)
84-
FROM
65+
SELECT TOP 10 *
66+
FROM
8567
OPENROWSET(
86-
BULK 'parquet/taxi/year=2017/month=9/*.parquet',
87-
DATA_SOURCE = 'SqlOnDemandDemo',
68+
BULK 'puYear=2018/puMonth=*/*.snappy.parquet',
69+
DATA_SOURCE = 'YellowTaxi',
8870
FORMAT='PARQUET'
89-
) AS nyc;
71+
) AS nyc
9072
```
9173

9274
### Query partitioned data
@@ -98,28 +80,25 @@ The data set provided in this sample is divided (partitioned) into separate subf
9880
9981
```sql
10082
SELECT
101-
nyc.filepath(1) AS [year],
102-
nyc.filepath(2) AS [month],
103-
payment_type,
104-
SUM(fare_amount) AS fare_total
105-
FROM
83+
YEAR(tpepPickupDateTime),
84+
passengerCount,
85+
COUNT(*) AS cnt
86+
FROM
10687
OPENROWSET(
107-
BULK 'parquet/taxi/year=*/month=*/*.parquet',
108-
DATA_SOURCE = 'SqlOnDemandDemo',
88+
BULK 'puYear=*/puMonth=*/*.snappy.parquet',
89+
DATA_SOURCE = 'YellowTaxi',
10990
FORMAT='PARQUET'
110-
) AS nyc
91+
) nyc
11192
WHERE
11293
nyc.filepath(1) = 2017
11394
AND nyc.filepath(2) IN (1, 2, 3)
114-
AND pickup_datetime BETWEEN CAST('1/1/2017' AS datetime) AND CAST('3/31/2017' AS datetime)
95+
AND tpepPickupDateTime BETWEEN CAST('1/1/2017' AS datetime) AND CAST('3/31/2017' AS datetime)
11596
GROUP BY
116-
nyc.filepath(1),
117-
nyc.filepath(2),
118-
payment_type
97+
passengerCount,
98+
YEAR(tpepPickupDateTime)
11999
ORDER BY
120-
nyc.filepath(1),
121-
nyc.filepath(2),
122-
payment_type;
100+
YEAR(tpepPickupDateTime),
101+
passengerCount;
123102
```
124103

125104
## Type mapping

0 commit comments

Comments
 (0)