You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/data-explorer/data-lake-query-data.md
+12-11Lines changed: 12 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ Azure Data Lake Storage is a highly scalable and cost-effective data lake soluti
16
16
Azure Data Explorer integrates with Azure Blob Storage and Azure Data Lake Storage Gen2, providing fast, cached, and indexed access to data in the lake. You can analyze and query data in the lake without prior ingestion into Azure Data Explorer. You can also query across ingested and uningested native lake data simultaneously.
17
17
18
18
> [!TIP]
19
-
> The best query performance necessitates data ingestion into Azure Data Explorer. The capability to query data in Azure Data Lake Storage Gen2 without prior ingestion should only be used for historical data or data that is rarely queried. See [best practices](#best-practices) section below for more information.
19
+
> The best query performance necessitates data ingestion into Azure Data Explorer. The capability to query data in Azure Data Lake Storage Gen2 without prior ingestion should only be used for historical data or data that is rarely queried. Optimize your query performance in the lake by following these [best practices](#best-practices).
20
20
21
21
22
22
## Create an external table
@@ -222,35 +222,36 @@ This query uses partitioning, which optimizes query time and performance. The qu
222
222
223
223
You can write additional queries to run on the external table *TaxiRides* and learn more about the data.
224
224
225
+
## Optimize your query performance
225
226
226
-
## Best practices
227
-
228
-
Below are some important points when considering querying external data (most of them are relevant not only for Azure Data Explorer).
227
+
Optimize your query performance in the lake by using the following best practices for querying external data.
229
228
230
229
### Data format
231
230
232
-
Use columnar format for analytical queries - this allows reading only columns relevant to a query. Also, different column encoding techniques allow for reducing data size significantly, which always have positive effect considering data transfer is usually the bottleneck. Azure Data Explorer supports Parquet and ORC columnar formats, however current implementations of ORC reader is less optimized than Parquet's, therefore current suggestion is Parquet.
231
+
Use a columnar format for analytical queries since:
232
+
* Only the columns relevant to a query can be read.
233
+
* Column encoding techniques can reduce data size significantly.
234
+
Azure Data Explorer supports Parquet and ORC columnar formats. Parquet format is suggested due to optimized implementation.
233
235
234
236
### Azure region
235
237
236
-
Make sure external data resides in the same region as Azure Data Explorer cluster. This reduces both the cost and the data fetch time.
238
+
Ascertain that external data resides in the same Azure region as your Azure Data Explorer cluster. This reduces cost and data fetch time.
237
239
238
240
### File size
239
241
240
-
Try to avoid many small files this brings unneeded overhead, taking advantage of columnar format is hard, files enumeration process is slower, etc. As a rule of thumb - start with hundreds of Mb (up to 1Gb) per file, and see how this works for you. This is valid, of course, while the number of such files is greater than the number of CPU cores in your Azure Data Explorer cluster.
242
+
Optimal file size is hundreds of Mb (up to 1 Gb) per file. Avoid many small files that require unneeded overhead, such as slower file enumeration process and limited use of columnar format. Note that the number of files must be greater than the number of CPU cores in your Azure Data Explorer cluster.
241
243
242
244
### Compression
243
245
244
-
Always use compression to reduce amount of data being fetched from the remote storage. When using Parquet format - use internal Parquet compression mechanism, which compresses column groups separately, thus allowing to read them separately (to validate, check that files are named like “<filename>.gz.parquet” or “<filename>.snappy.parquet” as opposed to “<filename>.parquet.gz”). As for the compression codec itself - it’s not very much important, but rumors say that “gzip” gives better compression ratio, while “snappy” gives better throughput when decoding. Try what works best for your use case.
246
+
Use compression to reduce the amount of data being fetched from the remote storage. For Parquet format, use the internal Parquet compression mechanism that compresses column groups separately, thus allowing you to read them separately. To validate use of compression mechanism, check that the files are named as follows: “<filename>.gz.parquet” or “<filename>.snappy.parquet” as opposed to “<filename>.parquet.gz”).
245
247
246
248
### Partitioning
247
249
248
-
Always organize your data using "folder" partitions in such a way that enables query engine for skipping irrelevant paths. Look at most of your queries, what filter is used the most? Usually, this would be a timestamp or some sort of tenant ID (or a combination of both). When planning partitioning, however, think about file size. As a side note, Azure Data Explorer doesn’t support partition discovery yet the way it works in Spark (Databricks), when paths named like `/customer_id=Xyz/year=2018/month=01/day=01` automatically appear in resulted datasets as columns, therefore partitions must present in the data as fields.
250
+
Organize your data using "folder" partitions that enables the query to skip irrelevant paths. When planning partitioning consider file size and common filters in your queries such as timestamp or tenant ID.
249
251
250
252
### VM size
251
253
252
-
When choosing appropriate size of Azure Data Explorer engine nodes, prefer instances with more cores and higher network throughput (memory amount is less important).
253
-
254
+
Select VM SKUs with more cores and higher network throughput (memory is less important). For more information see [Select the correct VM SKU for your Azure Data Explorer cluster](manage-cluster-choose-sku.md).
0 commit comments