Skip to content

Commit 227128f

Browse files
committed
Adding lineage ingestion
1 parent cf3f2e7 commit 227128f

File tree

1 file changed

+20
-5
lines changed

1 file changed

+20
-5
lines changed

articles/purview/concept-scans-and-ingestion.md

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,19 +6,24 @@ ms.author: shjia
66
ms.service: purview
77
ms.subservice: purview-data-map
88
ms.topic: conceptual
9-
ms.date: 02/14/2023
9+
ms.date: 03/13/2023
1010
ms.custom: ignite-fall-2021
1111
---
1212

1313
# Scans and ingestion in Microsoft Purview
1414

1515
This article provides an overview of the Scanning and Ingestion features in Microsoft Purview. These features connect your Microsoft Purview account to your sources to populate the data map and data catalog so you can begin exploring and managing your data through Microsoft Purview.
1616

17+
- [**Scanning**](#scanning) captures metadata from [data sources](microsoft-purview-connector-overview.md) and brings it to Microsoft Purview.
18+
- [**Ingestion**](#ingestion) processes metadata and stores it in the data catalog from both:
19+
- Data source scans
20+
- Lineage connections
21+
1722
## Scanning
1823

1924
After data sources are [registered](manage-data-sources.md) in your Microsoft Purview account, the next step is to scan the data sources. The scanning process establishes a connection to the data source and captures technical metadata like names, file size, columns, and so on. It also extracts schema for structured data sources, applies classifications on schemas, and [applies sensitivity labels if your Microsoft Purview Data Map is connected to a Microsoft Purview compliance portal](create-sensitivity-label.md). The scanning process can be triggered to run immediately or can be scheduled to run on a periodic basis to keep your Microsoft Purview account up to date.
2025

21-
For each scan there are customizations you can apply so that you're only scanning your sources for the information you need.
26+
For each scan, there are customizations you can apply so that you're only scanning information you need, rather than the whole source.
2227

2328
### Choose an authentication method for your scans
2429

@@ -48,15 +53,15 @@ There are [system scan rule sets](create-a-scan-rule-set.md#system-scan-rule-set
4853

4954
### Schedule your scan
5055

51-
Microsoft Purview gives you a choice of scanning weekly or monthly at a specific time you choose. Weekly scans may be appropriate for data sources with structures that are actively under development or frequently change. Monthly scanning is more appropriate for data sources that change infrequently. A good best practice is to work with the administrator of the source you want to scan to identify a time when compute demands on the source are low.
56+
Microsoft Purview gives you a choice of scanning weekly or monthly at a specific time you choose. Weekly scans may be appropriate for data sources with structures that are actively under development or frequently change. Monthly scanning is more appropriate for data sources that change infrequently. Best practice is to work with the administrator of the source you want to scan to identify a time when compute demands on the source are low.
5257

5358
### How scans detect deleted assets
5459

5560
A Microsoft Purview catalog is only aware of the state of a data store when it runs a scan. For the catalog to know if a file, table, or container was deleted, it compares the last scan output against the current scan output. For example, suppose that the last time you scanned an Azure Data Lake Storage Gen2 account, it included a folder named *folder1*. When the same account is scanned again, *folder1* is missing. Therefore, the catalog assumes the folder has been deleted.
5661

5762
#### Detecting deleted files
5863

59-
The logic for detecting missing files works for multiple scans by the same user as well as by different users. For example, suppose a user runs a one-time scan on a Data Lake Storage Gen2 data store on folders A, B, and C. Later, a different user in the same account runs a different one-time scan on folders C, D, and E of the same data store. Because folder C was scanned twice, the catalog checks it for possible deletions. Folders A, B, D, and E, however, were scanned only once, and the catalog won't check them for deleted assets.
64+
The logic for detecting missing files works for multiple scans by the same user and by different users. For example, suppose a user runs a one-time scan on a Data Lake Storage Gen2 data store on folders A, B, and C. Later, a different user in the same account runs a different one-time scan on folders C, D, and E of the same data store. Because folder C was scanned twice, the catalog checks it for possible deletions. Folders A, B, D, and E, however, were scanned only once, and the catalog won't check them for deleted assets.
6065

6166
To keep deleted files out of your catalog, it's important to run regular scans. The scan interval is important, because the catalog can't detect deleted assets until another scan is run. So, if you run scans once a month on a particular store, the catalog can't detect any deleted data assets in that store until you run the next scan a month later.
6267

@@ -67,7 +72,17 @@ When you enumerate large data stores like Data Lake Storage Gen2, there are mult
6772
6873
## Ingestion
6974

70-
The technical metadata or classifications identified by the scanning process are then sent to Ingestion. The ingestion process is responsible for populating the data map and is managed by Microsoft Purview. Ingestion analyses the input from scan, [applies resource set patterns](concept-resource-sets.md#how-microsoft-purview-detects-resource-sets), populates available [lineage](concept-data-lineage.md) information, and then loads the data map automatically. Assets/schemas can be discovered or curated only after ingestion is complete. So, if your scan is completed but you haven't seen your assets in the data map or catalog, you'll need to wait for the ingestion process to finish.
75+
Ingestion is the process responsible for populating the data map with metadata gathered through its various processes.
76+
77+
## Ingestion from scans
78+
79+
The technical metadata or classifications identified by the scanning process are then sent to ingestion. Ingestion analyses the input from scan, [applies resource set patterns](concept-resource-sets.md#how-microsoft-purview-detects-resource-sets), populates available [lineage](concept-data-lineage.md) information, and then loads the data map automatically. Assets/schemas can be discovered or curated only after ingestion is complete. So, if your scan is completed but you haven't seen your assets in the data map or catalog, you'll need to wait for the ingestion process to finish.
80+
81+
## Ingestion from lineage connections
82+
83+
Resources like [Azure Data Factory](how-to-link-azure-data-factory.md) and [Azure Synapse](how-to-lineage-azure-synapse-analytics.md) can be connected to Microsoft Purview to bring lineage information into your Microsoft Purview Data Map. For example, when a copy pipeline runs in an Azure Data Factory that has been connected to Microsoft Purview, metadata about inputs, the activity, and outputs are ingested in Microsoft Purview and the information is added to the data map.
84+
85+
For more information about the available lineage connections, see the [lineage user guide](catalog-lineage-user-guide.md).
7186

7287
## Next steps
7388

0 commit comments

Comments
 (0)