Merge pull request #108291 from yossi-karp/ingest-doc-split

megvanhuygen · web-flow · commit 0dbf8055db8f · 2020-03-19T11:46:42.000-07:00
New files based on internal docs + updated TOC
diff --git a/articles/data-explorer/ingestion-properties.md b/articles/data-explorer/ingestion-properties.md
@@ -0,0 +1,40 @@
+---
+title: Data ingestion properties for Azure Data Explorer
+description: Learn about the various data ingestion properties supported by Azure Data Explorer.
+author: orspod
+ms.author: orspodek
+ms.reviewer: tzgitlin
+ms.service: data-explorer
+ms.topic: conceptual
+ms.date: 03/19/2020
+---
+
+# Azure Data Explorer data ingestion properties 
+
+Data ingestion is the process by which data is added to a table and is made available for query in Azure Data Explorer. You add properties to the ingestion command after the `with` keyword.
+
+## Ingestion properties
+
+The following table lists the properties supported by Azure Data Explorer, describes them, and provides examples: 
+
+|Property              |Description                                              |Example                                             |
+|----------------------|---------------------------------------------------------|----------------------------------------------------|
+|`ingestionMapping`    |A string value that indicates how to map data from the source file to the actual columns in the table. Define the `format` value with the relevant mapping type. See [data mappings](/azure/kusto/management/mappings).|`with (format="json", ingestionMapping = "[{\"column\":\"rownumber\", \"Properties\":{\"Path\":\"$.RowNumber\"}}, {\"column\":\"rowguid\", \"Properties\":{\"Path\":\"$.RowGuid\"}}]")`<br>(deprecated: `avroMapping`, `csvMapping`, `jsonMapping`) |
+|`ingestionMappingReference`|A string value that indicates how to map data from the source file to the actual columns in the table using a named mapping policy object. Define the `format` value with the relevant mapping type. See [data mappings](/azure/kusto/management/mappings).|`with (format="csv", ingestionMappingReference = "Mapping1")`<br>(deprecated: `avroMappingReference`, `csvMappingReference`, `jsonMappingReference`)|
+|`creationTime` |The datetime value (formatted as an ISO8601 string) to use at the creation time of the ingested data extents. If unspecified, the current value (`now()`) will be used. Overriding the default is useful when ingesting older data, so that the retention policy will be applied correctly.|`with (creationTime="2017-02-13T11:09:36.7992775Z")`|
+|`extend_schema`|A Boolean value that, if specified, instructs the command to extend the schema of the table (defaults to `false`). This option applies only to `.append` and `.set-or-append` commands. The only allowed schema extensions have additional columns added to the table at the end.|If the original table schema is `(a:string, b:int)`, a valid schema extension would be `(a:string, b:int, c:datetime, d:string)`, but `(a:string, c:datetime)` wouldn't be valid|
+|`folder` |For [ingest-from-query](/azure/kusto/management/data-ingestion/ingest-from-query) commands, the folder to assign to the table. If the table already exists, this property will override the table's folder.|`with (folder="Tables/Temporary")`|
+|`format` |The data format (see [supported data formats](ingestion-supported-formats.md)).|`with (format="csv")`|
+|`ingestIfNotExists`|A string value that, if specified, prevents ingestion from succeeding if the table already has data tagged with an `ingest-by:` tag with the same value. This ensures idempotent data ingestion. For more information, see [ingest-by: tags](/azure/kusto/management/extents-overview#ingest-by-extent-tags).|The properties `with (ingestIfNotExists='["Part0001"]', tags='["ingest-by:Part0001"]')` indicate that if data with the tag `ingest-by:Part0001` already exists, then don't complete the current ingestion. If it doesn't already exist, this new ingestion should have this tag set (in case a future ingestion attempts to ingest the same data again.)|
+|`ignoreFirstRecord` |A Boolean value that, if set to `true`, indicates that ingestion should ignore the first record of every file. This property is useful for files in `CSV`and similar formats, if the first record in the file are the column names. By default, `false` is assumed.|`with (ignoreFirstRecord=false)`|
+|`persistDetails` |A Boolean value that, if specified, indicates that the command should persist the detailed results (even if successful) so that the [.show operation details](/azure/kusto/management/operations#show-operation-details) command could retrieve them. Defaults to `false`.|`with (persistDetails=true)`|
+|`policy_ingestiontime`|A Boolean value that, if specified, describes whether to enable the [Ingestion Time Policy](/azure/kusto/management/ingestiontimepolicy) on a table that is created by this command. The default is `true`.|`with (policy_ingestiontime=false)`|
+|`recreate_schema` |A Boolean value that, if specified, describes whether the command may recreate the schema of the table. This property applies only to the `.set-or-replace` command. This property takes precedence over the `extend_schema` property if both are set.|`with (recreate_schema=true)`|
+|`tags`|A list of [tags](/azure/kusto/management/extents-overview#extent-tagging) to associate with the ingested data, formatted as a JSON string |`with (tags="['Tag1', 'Tag2']")`|
+|`validationPolicy`|A JSON string that indicates which validations to run during ingestion. See [Data ingestion](/azure/kusto/management/data-ingestion/) for an explanation of the different options.| `with (validationPolicy='{"ValidationOptions":1, "ValidationImplications":1}')` (this is actually the default policy)|
+|`zipPattern`|Use this property when ingesting data from storage that has a ZIP archive. This is a string value indicating the regular expression to use when selecting which files in the ZIP archive to ingest.  All other files in the archive will be ignored.|`with (zipPattern="*.csv")`|
+
+## Next steps
+
+* Learn more about [data ingestion](/azure/data-explorer/ingest-data-overview)
+* Learn more about [supported data formats](ingestion-supported-formats.md)
diff --git a/articles/data-explorer/ingestion-supported-formats.md b/articles/data-explorer/ingestion-supported-formats.md
@@ -0,0 +1,59 @@
+---
+title: Data formats supported by Azure Data Explorer for ingestion.
+description: Learn about the various data and compression formats supported by Azure Data Explorer for ingestion.
+author: orspod
+ms.author: orspodek
+ms.reviewer: tzgitlin
+ms.service: data-explorer
+ms.topic: conceptual
+ms.date: 03/19/2020
+---
+
+# Data formats supported by Azure Data Explorer for ingestion
+
+Data ingestion is the process by which data is added to a table and is made available for query in Azure Data Explorer. For all ingestion methods, other than ingest-from-query, the data must be in one of the supported formats. The following table lists and describes the formats that Azure Data Explorer supports for data ingestion.
+
+|Format   |Extension   |Description|
+|---------|------------|-----------|
+|avro     |`.avro`     |An [Avro container file](https://avro.apache.org/docs/current/). The following codes are supported: `null`, `deflate` (`snappy` is currently not supported).|
+|CSV      |`.csv`      |A text file with comma-separated values (`,`). See [RFC 4180: _Common Format and MIME Type for Comma-Separated Values (CSV) Files_](https://www.ietf.org/rfc/rfc4180.txt).|
+|JSON     |`.json`     |A text file with JSON objects delimited by `\n` or `\r\n`. See [JSON Lines (JSONL)](http://jsonlines.org/).|
+|multijson|`.multijson`|A text file with a JSON array of property bags (each representing a record), or any number of property bags delimited by whitespace, `\n` or `\r\n`. Each property bag can be spread on multiple lines. This format is preferred over `JSON`, unless the data is non-property bags.|
+|orc      |`.orc`      |An [Orc file](https://en.wikipedia.org/wiki/Apache_ORC).|
+|parquet  |`.parquet`  |A [Parquet file](https://en.wikipedia.org/wiki/Apache_Parquet).|
+|psv      |`.psv`      |A text file with pipe-separated values (<code>&#124;</code>).|
+|raw      |`.raw`      |A text file whose entire contents is a single string value.|
+|scsv     |`.scsv`     |A text file with semicolon-separated values (`;`).|
+|sohsv    |`.sohsv`    |A text file with SOH-separated values. (SOH is ASCII codepoint 1; this format is used by Hive on HDInsight.)|
+|tsv      |`.tsv`      |A text file with tab-separated values (`\t`).|
+|tsve     |`.tsv`      |A text file with tab-separated values (`\t`). A backslash character (`\`) is used for escaping.|
+|txt      |`.txt`      |A text file with lines delimited by `\n`. Empty lines are skipped.|
+
+## Supported data compression formats
+
+Blobs and files can be compressed through any of the following compression algorithms:
+
+|Compression|Extension|
+|-----------|---------|
+|GZip       |.gz      |
+|Zip        |.zip     |
+
+Indicate compression by appending the extension to the name of the blob or file.
+
+For example:
+* `MyData.csv.zip` indicates a blob or a file formatted as CSV, compressed with ZIP (archive or a single file)
+* `MyData.csv.gz` indicates a blob or a file formatted as CSV, compressed with GZip
+
+Blob or file names that don't include the format extensions but just compression (for example, ) is also supported. In this case, the file format
+must be specified as an ingestion property because it cannot be inferred.
+
+> [!NOTE]
+> Some compression formats keep track of the original file extension as part
+> of the compressed stream. This extension is generally ignored for
+> determining the file format. If the file format can't be determined from the (compressed)
+> blob or file name, it must be specified through the `format` ingestion property.
+
+## Next steps
+
+* Learn more about [data ingestion](/azure/data-explorer/ingest-data-overview)
+* Learn more about [Azure Data Explorer data ingestion properties](ingestion-properties.md)
diff --git a/articles/data-explorer/toc.yml b/articles/data-explorer/toc.yml
@@ -32,6 +32,10 @@
     - name: Data ingestion overview
       displayName: pipelines, connectors, plugins, Python, .NET, Java, Node, REST
       href: ingest-data-overview.md
+    - name: Data ingestion properties
+      href: ingestion-properties.md
+    - name: Formats for data ingestion
+      href: ingestion-supported-formats.md
   - name: Kusto Query Language
     items:
     - name: Quick reference guide