You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/synapse-analytics/sql/query-parquet-files.md
+28-49Lines changed: 28 additions & 49 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,52 +21,35 @@ Your first step is to **create a database** where the tables will be created. Th
21
21
22
22
## Dataset
23
23
24
-
You can query Parquet files the same way you read CSV files. The only difference is that the FILEFORMAT parameter should be set to PARQUET. Examples in this article show the specifics of reading Parquet files.
24
+
[NYC Yellow Taxi](https://azure.microsoft.com/services/open-datasets/catalog/nyc-taxi-limousine-commission-yellow-taxi-trip-records/) dataset i used in this sample isYou can query Parquet files the same way you read CSV files. The only difference is that the FILEFORMAT parameter should be set to PARQUET. Examples in this article show the specifics of reading Parquet files.
25
25
26
26
> [!NOTE]
27
27
> You do not have to specify columns in the OPENROWSET WITH clause when reading parquet files. SQL on-demand will utilize metadata in the Parquet file and bind columns by name.
28
28
29
-
You'll use the folder *parquet/taxi* for the sample queries. It contains NYC Taxi - Yellow Taxi Trip Records data from July 2016. to June 2018.
30
-
31
-
Data is partitioned by year and month and the folder structure is as follows:
32
-
33
-
- year=2016
34
-
- month=6
35
-
- ...
36
-
- month=12
37
-
- year=2017
38
-
- month=1
39
-
- ...
40
-
- month=12
41
-
- year=2018
42
-
- month=1
43
-
- ...
44
-
- month=6
45
-
46
29
## Query set of parquet files
47
30
48
31
You can specify only the columns of interest when you query Parquet files.
49
32
50
33
```sql
51
34
SELECT
52
-
YEAR(pickup_datetime),
53
-
passenger_count,
35
+
YEAR(tpepPickupDateTime),
36
+
passengerCount,
54
37
COUNT(*) AS cnt
55
38
FROM
56
39
OPENROWSET(
57
-
BULK 'parquet/taxi/*/*/*',
58
-
DATA_SOURCE ='SqlOnDemandDemo',
40
+
BULK 'puYear=2018/puMonth=*/*.snappy.parquet',
41
+
DATA_SOURCE ='YellowTaxi',
59
42
FORMAT='PARQUET'
60
43
) WITH (
61
-
pickup_datetime DATETIME2,
62
-
passenger_countINT
44
+
tpepPickupDateTime DATETIME2,
45
+
passengerCountINT
63
46
) AS nyc
64
47
GROUP BY
65
-
passenger_count,
66
-
YEAR(pickup_datetime)
48
+
passengerCount,
49
+
YEAR(tpepPickupDateTime)
67
50
ORDER BY
68
-
YEAR(pickup_datetime),
69
-
passenger_count;
51
+
YEAR(tpepPickupDateTime),
52
+
passengerCount;
70
53
```
71
54
72
55
## Automatic schema inference
@@ -79,14 +62,13 @@ The sample below shows the automatic schema inference capabilities for Parquet f
79
62
> You don't have to specify columns in the OPENROWSET WITH clause when reading Parquet files. In that case, SQL on-demand Query service will utilize metadata in the Parquet file and bind columns by name.
80
63
81
64
```sql
82
-
SELECT
83
-
COUNT_BIG(*)
84
-
FROM
65
+
SELECT TOP 10*
66
+
FROM
85
67
OPENROWSET(
86
-
BULK 'parquet/taxi/year=2017/month=9/*.parquet',
87
-
DATA_SOURCE ='SqlOnDemandDemo',
68
+
BULK 'puYear=2018/puMonth=*/*.snappy.parquet',
69
+
DATA_SOURCE ='YellowTaxi',
88
70
FORMAT='PARQUET'
89
-
) AS nyc;
71
+
) AS nyc
90
72
```
91
73
92
74
### Query partitioned data
@@ -98,28 +80,25 @@ The data set provided in this sample is divided (partitioned) into separate subf
98
80
99
81
```sql
100
82
SELECT
101
-
nyc.filepath(1) AS [year],
102
-
nyc.filepath(2) AS [month],
103
-
payment_type,
104
-
SUM(fare_amount) AS fare_total
105
-
FROM
83
+
YEAR(tpepPickupDateTime),
84
+
passengerCount,
85
+
COUNT(*) AS cnt
86
+
FROM
106
87
OPENROWSET(
107
-
BULK 'parquet/taxi/year=*/month=*/*.parquet',
108
-
DATA_SOURCE ='SqlOnDemandDemo',
88
+
BULK 'puYear=*/puMonth=*/*.snappy.parquet',
89
+
DATA_SOURCE ='YellowTaxi',
109
90
FORMAT='PARQUET'
110
-
) ASnyc
91
+
) nyc
111
92
WHERE
112
93
nyc.filepath(1) =2017
113
94
ANDnyc.filepath(2) IN (1, 2, 3)
114
-
ANDpickup_datetime BETWEEN CAST('1/1/2017'AS datetime) AND CAST('3/31/2017'AS datetime)
95
+
ANDtpepPickupDateTime BETWEEN CAST('1/1/2017'AS datetime) AND CAST('3/31/2017'AS datetime)
0 commit comments