|
| 1 | +--- |
| 2 | +title: Filter data by using Azure Data Lake Storage query acceleration (preview) | Microsoft Docs |
| 3 | +description: Use query acceleration (preview) to retrieve a subset of data from your storage account. |
| 4 | +author: normesta |
| 5 | +ms.subservice: data-lake-storage-gen2 |
| 6 | +ms.service: storage |
| 7 | +ms.topic: conceptual |
| 8 | +ms.date: 04/21/2020 |
| 9 | +ms.author: normesta |
| 10 | +ms.reviewer: jamsbak |
| 11 | +--- |
| 12 | + |
| 13 | +# Filter data by using Azure Data Lake Storage query acceleration (preview) |
| 14 | + |
| 15 | +This article shows you how to use query acceleration (preview) to retrieve a subset of data from your storage account. |
| 16 | + |
| 17 | +Query acceleration (preview) is a new capability for Azure Data Lake Storage that enables applications and analytics frameworks to dramatically optimize data processing by retrieving only the data that they require to perform a given operation. To learn more, see [Azure Data Lake Storage Query Acceleration (preview)](data-lake-storage-query-acceleration.md). |
| 18 | + |
| 19 | +> [!NOTE] |
| 20 | +> The query acceleration feature is in public preview, and is available in the Canada Central and France Central regions. To review limitations, see the [Known issues](data-lake-storage-known-issues.md) article. To enroll in the preview, see [this form](https://aka.ms/adls/qa-preview-signup). |
| 21 | +
|
| 22 | +## Prerequisites |
| 23 | + |
| 24 | +### [.NET](#tab/dotnet) |
| 25 | + |
| 26 | +- To access Azure Storage, you'll need an Azure subscription. If you don't already have a subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin. |
| 27 | + |
| 28 | +- A **general-purpose v2** storage account. see [Create a storage account](../common/storage-quickstart-create-account.md). |
| 29 | + |
| 30 | +- [.NET SDK](https://dotnet.microsoft.com/download). |
| 31 | + |
| 32 | +### [Java](#tab/java) |
| 33 | + |
| 34 | +- To access Azure Storage, you'll need an Azure subscription. If you don't already have a subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin. |
| 35 | + |
| 36 | +- A **general-purpose v2** storage account. see [Create a storage account](../common/storage-quickstart-create-account.md). |
| 37 | + |
| 38 | +- [Java Development Kit (JDK)](/java/azure/jdk/?view=azure-java-stable) version 8 or above. |
| 39 | + |
| 40 | +- [Apache Maven](https://maven.apache.org/download.cgi). |
| 41 | + |
| 42 | + > [!NOTE] |
| 43 | + > This article assumes that you've created a Java project by using Apache Maven. For an example of how to create a project by using Apache Maven, see [Setting up](storage-quickstart-blobs-java.md#setting-up). |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## Install packages |
| 48 | + |
| 49 | +### [.NET](#tab/dotnet) |
| 50 | + |
| 51 | +1. Download the query acceleration packages. You can obtain a compressed .zip file that contains these packages by using this link: [https://aka.ms/adls/qqsdk/.net](https://aka.ms/adls/qqsdk/.net). |
| 52 | + |
| 53 | +2. Extract the contents of this file to your project directory. |
| 54 | + |
| 55 | +3. Open your project file (*.csproj*) in a text editor, and add these package references inside of the \<Project\> element. |
| 56 | + |
| 57 | + ```xml |
| 58 | + <ItemGroup> |
| 59 | + <PackageReference Include="Azure.Storage.Blobs" Version="12.5.0-preview.1" /> |
| 60 | + <PackageReference Include="Azure.Storage.Common" Version="12.4.0-preview.1" /> |
| 61 | + <PackageReference Include="Azure.Storage.QuickQuery" Version="12.0.0-preview.1" /> |
| 62 | + </ItemGroup> |
| 63 | + ``` |
| 64 | + |
| 65 | +4. Restore the preview SDK packages. This example command restores the preview SDK packages by using the `dotnet restore` command. |
| 66 | + |
| 67 | + ```console |
| 68 | + dotnet restore --source C:\Users\contoso\myProject |
| 69 | + ``` |
| 70 | + |
| 71 | +5. Restore all other dependencies from the public NuGet repository. |
| 72 | + |
| 73 | + ```console |
| 74 | + dotnet restore |
| 75 | + ``` |
| 76 | + |
| 77 | +### [Java](#tab/java) |
| 78 | + |
| 79 | +1. Create directory in the root of your project. The root directory is the directory that contains the **pom.xml** file. |
| 80 | + |
| 81 | + > [!NOTE] |
| 82 | + > The examples in this article assume that the name of the directory is **lib**. |
| 83 | +
|
| 84 | +2. Download the query acceleration packages. You can obtain a compressed .zip file that contains these packages by using this link: [https://aka.ms/adls/qqsdk/java](https://aka.ms/adls/qqsdk/java). |
| 85 | + |
| 86 | +3. Extract the files in this .zip file to the directory that you created. In our example, that directory is named **lib**. |
| 87 | + |
| 88 | +4. Open the *pom.xml* file in your text editor. Add the following dependency elements to the group of dependencies. |
| 89 | + |
| 90 | + ```xml |
| 91 | + <!-- Request static dependencies from Maven --> |
| 92 | + <dependency> |
| 93 | + <groupId>com.azure</groupId> |
| 94 | + <artifactId>azure-core</artifactId> |
| 95 | + <version>1.3.0</version> |
| 96 | + </dependency> |
| 97 | + <dependency> |
| 98 | + <groupId>com.azure</groupId> |
| 99 | + <artifactId>azure-core-http-netty</artifactId> |
| 100 | + <version>1.3.0</version> |
| 101 | + </dependency> |
| 102 | + <dependency> |
| 103 | + <groupId>org.apache.avro</groupId> |
| 104 | + <artifactId>avro</artifactId> |
| 105 | + <version>1.9.2</version> |
| 106 | + </dependency> |
| 107 | + <dependency> |
| 108 | + <groupId>org.apache.commons</groupId> |
| 109 | + <artifactId>commons-csv</artifactId> |
| 110 | + <version>1.8</version> |
| 111 | + </dependency> |
| 112 | + <!-- Local dependencies --> |
| 113 | + <dependency> |
| 114 | + <groupId>com.azure</groupId> |
| 115 | + <artifactId>azure-storage-blob</artifactId> |
| 116 | + <version>12.5.0-beta.1</version> |
| 117 | + <scope>system</scope> |
| 118 | + <systemPath>${project.basedir}/lib/azure-storage-blob-12.5.0-beta.1.jar</systemPath> |
| 119 | + </dependency> |
| 120 | + <dependency> |
| 121 | + <groupId>com.azure</groupId> |
| 122 | + <artifactId>azure-storage-common</artifactId> |
| 123 | + <version>12.5.0-beta.1</version> |
| 124 | + <scope>system</scope> |
| 125 | + <systemPath>${project.basedir}/lib/azure-storage-common-12.5.0-beta.1.jar</systemPath> |
| 126 | + </dependency> |
| 127 | + <dependency> |
| 128 | + <groupId>com.azure</groupId> |
| 129 | + <artifactId>azure-storage-quickquery</artifactId> |
| 130 | + <version>12.0.0-beta.1</version> |
| 131 | + <scope>system</scope> |
| 132 | + <systemPath>${project.basedir}/lib/azure-storage-quickquery-12.0.0-beta.1.jar</systemPath> |
| 133 | + </dependency> |
| 134 | + ``` |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## Add statements |
| 139 | + |
| 140 | + |
| 141 | +### [.NET](#tab/dotnet) |
| 142 | + |
| 143 | +Add these `using` statements to the top of your code file. |
| 144 | + |
| 145 | +```csharp |
| 146 | +using Azure.Storage.Blobs; |
| 147 | +using Azure.Storage.Blobs.Models; |
| 148 | +using Azure.Storage.Blobs.Specialized; |
| 149 | +using Azure.Storage.QuickQuery; |
| 150 | +using Azure.Storage.QuickQuery.Models; |
| 151 | +``` |
| 152 | + |
| 153 | +Query acceleration retrieves CSV and Json formatted data. Therefore, make sure to add using statements for any CSV or Json parsing libraries that you choose to use. The examples that appear in this article parse a CSV file by using the [CsvHelper](https://www.nuget.org/packages/CsvHelper/) library that is available on NuGet. Therefore, we'd add these `using` statements to the top of the code file. |
| 154 | + |
| 155 | +```csharp |
| 156 | +using CsvHelper; |
| 157 | +using CsvHelper.Configuration; |
| 158 | +``` |
| 159 | + |
| 160 | +To compile examples presented in this article, you'll also need to add these `using` statements as well. |
| 161 | + |
| 162 | +```csharp |
| 163 | +using System.Threading.Tasks; |
| 164 | +using System.IO; |
| 165 | +using System.Globalization; |
| 166 | +using System.Threading; |
| 167 | +using System.Linq; |
| 168 | +``` |
| 169 | + |
| 170 | +### [Java](#tab/java) |
| 171 | + |
| 172 | +Add these `import` statements to the top of your code file. |
| 173 | + |
| 174 | +```java |
| 175 | +import com.azure.storage.blob.*; |
| 176 | +import com.azure.storage.blob.models.*; |
| 177 | +import com.azure.storage.common.*; |
| 178 | +import com.azure.storage.quickquery.*; |
| 179 | +import com.azure.storage.quickquery.models.*; |
| 180 | +import java.io.*; |
| 181 | +import org.apache.commons.csv.*; |
| 182 | +``` |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +## Retrieve data by using a filter |
| 187 | + |
| 188 | +You can use SQL to specify the row filter predicates and column projections in a query acceleration request. The following code queries a CSV file in storage and returns all rows of data where the third column matches the value `Hemingway, Ernest`. |
| 189 | + |
| 190 | +- In the SQL query, the keyword `BlobStorage` is used to denote the file that is being queried. |
| 191 | + |
| 192 | +- Column references are specified as `_N` where the first column is `_1`. If the source file contains a header row, then you can refer to columns by the name that is specified in the header row. |
| 193 | + |
| 194 | +### [.NET](#tab/dotnet) |
| 195 | + |
| 196 | +The async method `BlobQuickQueryClient.QueryAsync` sends the query to the query acceleration API, and then streams the results back to the application as a [Stream](https://docs.microsoft.com/dotnet/api/system.io.stream?view=netframework-4.8) object. |
| 197 | + |
| 198 | +```cs |
| 199 | +static async Task QueryHemingway(BlockBlobClient blob) |
| 200 | +{ |
| 201 | + string query = @"SELECT * FROM BlobStorage WHERE _3 = 'Hemingway, Ernest'"; |
| 202 | + await DumpQueryCsv(blob, query, false); |
| 203 | +} |
| 204 | + |
| 205 | +private static async Task DumpQueryCsv(BlockBlobClient blob, string query, bool headers) |
| 206 | +{ |
| 207 | + try |
| 208 | + { |
| 209 | + using (var reader = new StreamReader((await blob.GetQuickQueryClient().QueryAsync(query, |
| 210 | + new CsvTextConfiguration() { HasHeaders = headers }, |
| 211 | + new CsvTextConfiguration() { HasHeaders = false }, |
| 212 | + new ErrorHandler(), |
| 213 | + new BlobRequestConditions(), |
| 214 | + new ProgressHandler(), |
| 215 | + CancellationToken.None)).Value.Content)) |
| 216 | + { |
| 217 | + using (var parser = new CsvReader(reader, new CsvConfiguration(CultureInfo.CurrentCulture) |
| 218 | + { HasHeaderRecord = false })) |
| 219 | + { |
| 220 | + while (await parser.ReadAsync()) |
| 221 | + { |
| 222 | + parser.Context.Record.All(cell => |
| 223 | + { |
| 224 | + Console.Out.Write(cell + " "); |
| 225 | + return true; |
| 226 | + }); |
| 227 | + Console.Out.WriteLine(); |
| 228 | + } |
| 229 | + } |
| 230 | + } |
| 231 | + } |
| 232 | + catch (Exception ex) |
| 233 | + { |
| 234 | + Console.Error.WriteLine("Exception: " + ex.ToString()); |
| 235 | + } |
| 236 | +} |
| 237 | + |
| 238 | +class ErrorHandler : IBlobQueryErrorReceiver |
| 239 | +{ |
| 240 | + public void ReportError(BlobQueryError err) |
| 241 | + { |
| 242 | + Console.Error.WriteLine(String.Format("Error: {1}:{2}", err.Name, err.Description)); |
| 243 | + } |
| 244 | +} |
| 245 | + |
| 246 | +class ProgressHandler : IProgress<long> |
| 247 | +{ |
| 248 | + public void Report(long value) |
| 249 | + { |
| 250 | + Console.Error.WriteLine("Bytes scanned: " + value.ToString()); |
| 251 | + } |
| 252 | +} |
| 253 | + |
| 254 | +``` |
| 255 | + |
| 256 | +### [Java](#tab/java) |
| 257 | + |
| 258 | +The method `BlobQuickQueryClient.openInputStream()` sends the query to the query acceleration API, and then streams the results back to the application as a `InputStream` object which can be read like any other InputStream object. |
| 259 | + |
| 260 | +```java |
| 261 | +static void QueryHemingway(BlobClient blobClient) { |
| 262 | + String expression = "SELECT * FROM BlobStorage WHERE _3 = 'Hemingway, Ernest'"; |
| 263 | + DumpQueryCsv(blobClient, expression, false); |
| 264 | +} |
| 265 | + |
| 266 | +static void DumpQueryCsv(BlobClient blobClient, String query, Boolean headers) { |
| 267 | + try { |
| 268 | + |
| 269 | + BlobQuickQueryDelimitedSerialization input = new BlobQuickQueryDelimitedSerialization() |
| 270 | + .setRecordSeparator('\n') |
| 271 | + .setColumnSeparator(',') |
| 272 | + .setHeadersPresent(headers) |
| 273 | + .setFieldQuote('\0') |
| 274 | + .setEscapeChar('\\'); |
| 275 | + |
| 276 | + BlobQuickQueryDelimitedSerialization output = new BlobQuickQueryDelimitedSerialization() |
| 277 | + .setRecordSeparator('\n') |
| 278 | + .setColumnSeparator(',') |
| 279 | + .setHeadersPresent(false) |
| 280 | + .setFieldQuote('\0') |
| 281 | + .setEscapeChar('\n'); |
| 282 | + |
| 283 | + BlobRequestConditions requestConditions = null; |
| 284 | + /* ErrorReceiver determines what to do on errors. */ |
| 285 | + ErrorReceiver<BlobQuickQueryError> errorReceiver = System.out::println; |
| 286 | + |
| 287 | + /* ProgressReceiver details how to log progress*/ |
| 288 | + com.azure.storage.common.ProgressReceiver progressReceiver = System.out::println; |
| 289 | + |
| 290 | + /* Create a query acceleration client to the blob. */ |
| 291 | + BlobQuickQueryClient qqClient = new BlobQuickQueryClientBuilder(blobClient) |
| 292 | + .buildClient(); |
| 293 | + /* Open the query input stream. */ |
| 294 | + InputStream stream = qqClient.openInputStream(query, input, output, requestConditions, errorReceiver, progressReceiver); |
| 295 | + |
| 296 | + try (BufferedReader reader = new BufferedReader(new InputStreamReader(stream))) { |
| 297 | + /* Read from stream like you normally would. */ |
| 298 | + for (CSVRecord record : CSVParser.parse(reader, CSVFormat.EXCEL)) { |
| 299 | + System.out.println(record.toString()); |
| 300 | + } |
| 301 | + } |
| 302 | + } catch (Exception e) { |
| 303 | + System.err.println("Exception: " + e.toString()); |
| 304 | + } |
| 305 | +} |
| 306 | +``` |
| 307 | + |
| 308 | +--- |
| 309 | + |
| 310 | +## Retrieve specific columns |
| 311 | + |
| 312 | +You can scope your results to a subset of columns. That way you retrieve only the columns needed to perform a given calculation. This improves application performance and reduces cost because less data is transferred over the network. |
| 313 | + |
| 314 | +This code retrieves only the `PublicationYear` column for all books in the data set. It also uses the information from the header row in the source file to reference columns in the query. |
| 315 | + |
| 316 | + |
| 317 | +### [.NET](#tab/dotnet) |
| 318 | + |
| 319 | +```cs |
| 320 | +static async Task QueryPublishDates(BlockBlobClient blob) |
| 321 | +{ |
| 322 | + string query = @"SELECT PublicationYear FROM BlobStorage"; |
| 323 | + await DumpQueryCsv(blob, query, true); |
| 324 | +} |
| 325 | +``` |
| 326 | + |
| 327 | +### [Java](#tab/java) |
| 328 | + |
| 329 | +```java |
| 330 | +static void QueryPublishDates(BlobClient blobClient) |
| 331 | +{ |
| 332 | + String expression = "SELECT PublicationYear FROM BlobStorage"; |
| 333 | + DumpQueryCsv(blobClient, expression, true); |
| 334 | +} |
| 335 | +``` |
| 336 | + |
| 337 | +--- |
| 338 | + |
| 339 | +The following code combines row filtering and column projections into the same query. |
| 340 | + |
| 341 | +### [.NET](#tab/dotnet) |
| 342 | + |
| 343 | +```cs |
| 344 | +static async Task QueryMysteryBooks(BlockBlobClient blob) |
| 345 | +{ |
| 346 | + string query = @"SELECT BibNum, Title, Author, ISBN, Publisher FROM BlobStorage WHERE Subjects LIKE '%Mystery%'"; |
| 347 | + await DumpQueryCsv(blob, query, true); |
| 348 | +} |
| 349 | +``` |
| 350 | + |
| 351 | +### [Java](#tab/java) |
| 352 | + |
| 353 | +```java |
| 354 | +static void QueryMysteryBooks(BlobClient blobClient) |
| 355 | +{ |
| 356 | + String expression = "SELECT BibNum, Title, Author, ISBN, Publisher FROM BlobStorage WHERE Subjects LIKE '%Mystery%'"; |
| 357 | + DumpQueryCsv(blobClient, expression, true); |
| 358 | +} |
| 359 | +``` |
| 360 | + |
| 361 | +--- |
| 362 | + |
| 363 | +## Next steps |
| 364 | + |
| 365 | +- [Query acceleration enrollment form](https://aka.ms/adls/queryaccelerationpreview) |
| 366 | +- [Azure Data Lake Storage query acceleration (preview)](data-lake-storage-query-acceleration.md) |
| 367 | +- [Query acceleration SQL language reference (preview)](query-acceleration-sql-reference.md) |
| 368 | +- Query acceleration REST API reference |
0 commit comments