You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/purview/concept-scans-and-ingestion.md
+20-5Lines changed: 20 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,19 +6,24 @@ ms.author: shjia
6
6
ms.service: purview
7
7
ms.subservice: purview-data-map
8
8
ms.topic: conceptual
9
-
ms.date: 02/14/2023
9
+
ms.date: 03/13/2023
10
10
ms.custom: ignite-fall-2021
11
11
---
12
12
13
13
# Scans and ingestion in Microsoft Purview
14
14
15
15
This article provides an overview of the Scanning and Ingestion features in Microsoft Purview. These features connect your Microsoft Purview account to your sources to populate the data map and data catalog so you can begin exploring and managing your data through Microsoft Purview.
16
16
17
+
-[**Scanning**](#scanning) captures metadata from [data sources](microsoft-purview-connector-overview.md) and brings it to Microsoft Purview.
18
+
-[**Ingestion**](#ingestion) processes metadata and stores it in the data catalog from both:
19
+
- Data source scans
20
+
- Lineage connections
21
+
17
22
## Scanning
18
23
19
24
After data sources are [registered](manage-data-sources.md) in your Microsoft Purview account, the next step is to scan the data sources. The scanning process establishes a connection to the data source and captures technical metadata like names, file size, columns, and so on. It also extracts schema for structured data sources, applies classifications on schemas, and [applies sensitivity labels if your Microsoft Purview Data Map is connected to a Microsoft Purview compliance portal](create-sensitivity-label.md). The scanning process can be triggered to run immediately or can be scheduled to run on a periodic basis to keep your Microsoft Purview account up to date.
20
25
21
-
For each scan there are customizations you can apply so that you're only scanning your sources for the information you need.
26
+
For each scan, there are customizations you can apply so that you're only scanning information you need, rather than the whole source.
22
27
23
28
### Choose an authentication method for your scans
24
29
@@ -48,15 +53,15 @@ There are [system scan rule sets](create-a-scan-rule-set.md#system-scan-rule-set
48
53
49
54
### Schedule your scan
50
55
51
-
Microsoft Purview gives you a choice of scanning weekly or monthly at a specific time you choose. Weekly scans may be appropriate for data sources with structures that are actively under development or frequently change. Monthly scanning is more appropriate for data sources that change infrequently. A good best practice is to work with the administrator of the source you want to scan to identify a time when compute demands on the source are low.
56
+
Microsoft Purview gives you a choice of scanning weekly or monthly at a specific time you choose. Weekly scans may be appropriate for data sources with structures that are actively under development or frequently change. Monthly scanning is more appropriate for data sources that change infrequently. Best practice is to work with the administrator of the source you want to scan to identify a time when compute demands on the source are low.
52
57
53
58
### How scans detect deleted assets
54
59
55
60
A Microsoft Purview catalog is only aware of the state of a data store when it runs a scan. For the catalog to know if a file, table, or container was deleted, it compares the last scan output against the current scan output. For example, suppose that the last time you scanned an Azure Data Lake Storage Gen2 account, it included a folder named *folder1*. When the same account is scanned again, *folder1* is missing. Therefore, the catalog assumes the folder has been deleted.
56
61
57
62
#### Detecting deleted files
58
63
59
-
The logic for detecting missing files works for multiple scans by the same user as well as by different users. For example, suppose a user runs a one-time scan on a Data Lake Storage Gen2 data store on folders A, B, and C. Later, a different user in the same account runs a different one-time scan on folders C, D, and E of the same data store. Because folder C was scanned twice, the catalog checks it for possible deletions. Folders A, B, D, and E, however, were scanned only once, and the catalog won't check them for deleted assets.
64
+
The logic for detecting missing files works for multiple scans by the same user and by different users. For example, suppose a user runs a one-time scan on a Data Lake Storage Gen2 data store on folders A, B, and C. Later, a different user in the same account runs a different one-time scan on folders C, D, and E of the same data store. Because folder C was scanned twice, the catalog checks it for possible deletions. Folders A, B, D, and E, however, were scanned only once, and the catalog won't check them for deleted assets.
60
65
61
66
To keep deleted files out of your catalog, it's important to run regular scans. The scan interval is important, because the catalog can't detect deleted assets until another scan is run. So, if you run scans once a month on a particular store, the catalog can't detect any deleted data assets in that store until you run the next scan a month later.
62
67
@@ -67,7 +72,17 @@ When you enumerate large data stores like Data Lake Storage Gen2, there are mult
67
72
68
73
## Ingestion
69
74
70
-
The technical metadata or classifications identified by the scanning process are then sent to Ingestion. The ingestion process is responsible for populating the data map and is managed by Microsoft Purview. Ingestion analyses the input from scan, [applies resource set patterns](concept-resource-sets.md#how-microsoft-purview-detects-resource-sets), populates available [lineage](concept-data-lineage.md) information, and then loads the data map automatically. Assets/schemas can be discovered or curated only after ingestion is complete. So, if your scan is completed but you haven't seen your assets in the data map or catalog, you'll need to wait for the ingestion process to finish.
75
+
Ingestion is the process responsible for populating the data map with metadata gathered through its various processes.
76
+
77
+
## Ingestion from scans
78
+
79
+
The technical metadata or classifications identified by the scanning process are then sent to ingestion. Ingestion analyses the input from scan, [applies resource set patterns](concept-resource-sets.md#how-microsoft-purview-detects-resource-sets), populates available [lineage](concept-data-lineage.md) information, and then loads the data map automatically. Assets/schemas can be discovered or curated only after ingestion is complete. So, if your scan is completed but you haven't seen your assets in the data map or catalog, you'll need to wait for the ingestion process to finish.
80
+
81
+
## Ingestion from lineage connections
82
+
83
+
Resources like [Azure Data Factory](how-to-link-azure-data-factory.md) and [Azure Synapse](how-to-lineage-azure-synapse-analytics.md) can be connected to Microsoft Purview to bring lineage information into your Microsoft Purview Data Map. For example, when a copy pipeline runs in an Azure Data Factory that has been connected to Microsoft Purview, metadata about inputs, the activity, and outputs are ingested in Microsoft Purview and the information is added to the data map.
84
+
85
+
For more information about the available lineage connections, see the [lineage user guide](catalog-lineage-user-guide.md).
0 commit comments