Is it possible to query multiple Parquet files at once? (running one SQL query on many files in a folder) #6728
collimarco
started this conversation in
General
Replies: 1 comment 4 replies
-
Yes, you can do this. Here is an example via # /data/99 has a bunch of parquet files with "compatible" schema:
$ ls /data/99 | head
03f0ada5-22ea-4121-99ac-77a61c74479c.parquet
041e28e6-6373-4e8b-873d-c5d6f612edc4.parquet
050cc247-686b-4167-8bdb-f3e42f1ba088.parquet
08243ac7-db62-4d19-83ac-b829d36568b6.parquet
0b309152-36ca-4d90-bdb1-edcf628dafb9.parquet
0bf87579-d9e3-457f-8048-0162394ba8b3.parquet
0ce16343-6850-4f03-800d-75f9810af87b.parquet
10ba4e59-0651-42ca-9d44-ee7939e1f36a.parquet
1125d145-e363-4dd0-8561-e10027cbdb76.parquet
117447dd-4af0-4866-b443-c3ce64d3dfdb.parquet You can query them via datafusion-cli like this: ❯ select * from '/data/99' limit 10;
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
| free | host | in | out | time | total | used | used_percent |
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:20 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:30 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:40 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:05:50 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:00 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:10 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:20 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:30 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:40 | 12884901888 | 11499995136 | 89.251708984375 |
| 1384906752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T22:06:50 | 12884901888 | 11499995136 | 89.251708984375 |
+------------+---------------------+----+-----+---------------------+-------------+-------------+-----------------+
10 rows in set. Query took 0.035 seconds. You can also use the explicit ❯ create external table t stored as parquet location '/Users/alamb/.influxdb_iox/object_store/1/6/99';
0 rows in set. Query took 0.009 seconds.
❯ select * from t limit 10;
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+
| free | host | in | out | time | total | used | used_percent |
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+
| 624427008 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:20 | 10737418240 | 10112991232 | 94.1845703125 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:30 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:40 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:45:50 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:00 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:10 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:20 | 11811160064 | 10983309312 | 92.99094460227273 |
| 827850752 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:30 | 11811160064 | 10983309312 | 92.99094460227273 |
| 861405184 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:40 | 11811160064 | 10949754880 | 92.70685369318183 |
| 861405184 | MacBook-Pro-8.local | 0 | 0 | 2023-06-09T16:46:50 | 11811160064 | 10949754880 | 92.70685369318183 |
+-----------+---------------------+----+-----+---------------------+-------------+-------------+-------------------+ You can also do this explicitly using |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am getting started with Arrow Datafusion and looking at the examples:
https://arrow.apache.org/datafusion/user-guide/example-usage.html
I don't see any way to execute a SQL query on multiple files at the same time.
Is that possible?
Let's say that you have thousands of Parquet files already stored in a folder.
The schema is similar, but it is not identical for all the files. For example:
Is it possible to use Datafusion to query all the files in a directory?
Or it possible to give Datafusion a long list of files to query dynamically?
Ideally each query uses a different set of files (they are grouped in partitions), so it would be better to be able to execute the queries directly on a list of files, without having to perform too many intermediate steps.
Is this possible with Datafusion?
Beta Was this translation helpful? Give feedback.
All reactions