Skip to content

Commit 1a9c972

Browse files
authored
Merge pull request #206781 from linda33wj/purview
Add new HDFS source
2 parents a4eee07 + 5f9de09 commit 1a9c972

File tree

7 files changed

+327
-1
lines changed

7 files changed

+327
-1
lines changed
113 KB
Loading
69.7 KB
Loading
28.4 KB
Loading
32 KB
Loading

articles/purview/microsoft-purview-connector-overview.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.author: jingwang
66
ms.service: purview
77
ms.subservice: purview-data-map
88
ms.topic: conceptual
9-
ms.date: 06/17/2022
9+
ms.date: 08/03/2022
1010
ms.custom: ignite-fall-2021
1111
---
1212

@@ -50,6 +50,7 @@ The table below shows the supported capabilities for each data source. Select th
5050
|| SQL Server on Azure-Arc| No |No | No |[Yes (Preview)](how-to-data-owner-policies-arc-sql-server.md) | No |
5151
|| [Teradata](register-scan-teradata-source.md)| [Yes](register-scan-teradata-source.md#register)| [Yes](register-scan-teradata-source.md#scan)| [Yes*](register-scan-teradata-source.md#lineage) | No| No |
5252
|File|[Amazon S3](register-scan-amazon-s3.md)|[Yes](register-scan-amazon-s3.md)| [Yes](register-scan-amazon-s3.md)| Limited* | No| No |
53+
||[HDFS](register-scan-hdfs.md)|[Yes](register-scan-hdfs.md)| [Yes](register-scan-hdfs.md)| No | No| No |
5354
|Services and apps| [Erwin](register-scan-erwin-source.md)| [Yes](register-scan-erwin-source.md#register)| No | [Yes](register-scan-erwin-source.md#lineage)| No| No |
5455
|| [Looker](register-scan-looker-source.md)| [Yes](register-scan-looker-source.md#register)| No | [Yes](register-scan-looker-source.md#lineage)| No| No |
5556
|| [Power BI](register-scan-power-bi-tenant.md)| [Yes](register-scan-power-bi-tenant.md)| No | [Yes](how-to-lineage-powerbi.md)| No| No |
Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
---
2+
title: Connect to and manage HDFS
3+
description: This guide describes how to connect to HDFS in Microsoft Purview, and use Microsoft Purview's features to scan and manage your HDFS source.
4+
author: linda33wj
5+
ms.author: jingwang
6+
ms.service: purview
7+
ms.subservice: purview-data-map
8+
ms.topic: how-to #Required; leave this attribute/value as-is.
9+
ms.date: 08/03/2022
10+
ms.custom: template-how-to #Required; leave this attribute/value as-is.
11+
---
12+
13+
# Connect to and manage HDFS in Microsoft Purview
14+
15+
This article outlines how to register Hadoop Distributed File System (HDFS), and how to authenticate and interact with HDFS in Microsoft Purview. For more information about Microsoft Purview, read the [introductory article](overview.md).
16+
17+
## Supported capabilities
18+
19+
|**Metadata Extraction**|**Full Scan**|**Incremental Scan**|**Scoped Scan**|**Classification**|**Access Policy**|**Lineage**|**Data Sharing**|
20+
|---|---|---|---|---|---|---|---|
21+
| [Yes](#register)| [Yes](#scan)| [Yes](#scan) | [Yes](#scan) | [Yes](#scan) | No| No | No|
22+
23+
When scanning HDFS source, Microsoft Purview supports extracting technical metadata including HDFS:
24+
25+
- Namenode
26+
- Folder
27+
- File
28+
- Resource set
29+
30+
When setting up scan, you can choose to scan the entire HDFS or selective folders. Learn about the supported file format [here](microsoft-purview-connector-overview.md#file-types-supported-for-scanning).
31+
32+
## Prerequisites
33+
34+
- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F).
35+
- An active [Microsoft Purview account](create-catalog-portal.md).
36+
- You need Data Source Administrator and Data Reader permissions to register a source and manage it in the Microsoft Purview governance portal. For more information about permissions, see [Access control in Microsoft Purview](catalog-permissions.md).
37+
- Set up the latest [self-hosted integration runtime](https://www.microsoft.com/download/details.aspx?id=39717). For more information, see [the create and configure a self-hosted integration runtime guide](manage-integration-runtimes.md). The minimal supported Self-hosted Integration Runtime version is 5.20.8235.2.
38+
39+
* Ensure Visual C++ Redistributable for Visual Studio 2012 Update 4 is installed on the self-hosted integration runtime machine. If you don't have this update installed, [you can download it here](https://www.microsoft.com/download/details.aspx?id=30679).
40+
* Ensure JRE or OpenJDK is installed on the self-hosted integration runtime machine for parsing Parquet and ORC files. Learn more from [here](manage-integration-runtimes.md#java-runtime-environment-installation).
41+
* To set up your environment to enable Kerberos authentication, see the [Use Kerberos authentication for the HDFS connector](#use-kerberos-authentication-for-the-hdfs-connector) section.
42+
43+
## Register
44+
45+
This section describes how to register HDFS in Microsoft Purview using the [Microsoft Purview governance portal](https://web.purview.azure.com/).
46+
47+
### Steps to register
48+
49+
To register a new HDFS source in your data catalog, follow these steps:
50+
51+
1. Navigate to your Microsoft Purview account in the [Microsoft Purview governance portal](https://web.purview.azure.com/resource/).
52+
1. Select **Data Map** on the left navigation.
53+
1. Select **Register**
54+
1. On Register sources, select **HDFS**. Select **Continue**.
55+
56+
On the **Register sources (HDFS)** screen, follow these steps:
57+
58+
1. Enter a **Name** that the data source will be listed within the Catalog.
59+
60+
1. Enter the **Cluster URL** of the HDFS NameNode in the form of `https://<namenode>:<port>` or `http://<namenode>:<port>`, e.g. `https://namenodeserver.com:50470` or `http://namenodeserver.com:50070`.
61+
62+
1. Select a collection or create a new one (Optional)
63+
64+
1. Finish to register the data source.
65+
66+
:::image type="content" source="media/register-scan-hdfs/register-sources.png" alt-text="Screenshot of HDFS source registration in Purview." border="true":::
67+
68+
## Scan
69+
70+
Follow the steps below to scan HDFS to automatically identify assets. For more information about scanning in general, see our [introduction to scans and ingestion](concept-scans-and-ingestion.md).
71+
72+
### Authentication for a scan
73+
74+
The supported authentication type for an HDFS source is **Kerberos authentication**.
75+
76+
### Create and run scan
77+
78+
To create and run a new scan, follow these steps:
79+
80+
1. Make sure a self-hosted integration runtime is set up. If it isn't set up, use the steps mentioned [here](./manage-integration-runtimes.md) to create a self-hosted integration runtime.
81+
82+
1. Navigate to **Sources**.
83+
84+
1. Select the registered HDFS source.
85+
86+
1. Select **+ New scan**.
87+
88+
1. On "**Scan *source_name***"" page, provide the below details:
89+
90+
1. **Name**: The name of the scan
91+
92+
1. **Connect via integration runtime**: Select the configured self-hosted integration runtime. See setup requirements in [Prerequisites](#prerequisites) section.
93+
94+
1. **Credential**: Select the credential to connect to your data source. Make sure to:
95+
* Select **Kerberos Authentication** while creating a credential.
96+
* Provide the user name in the format of `<username>@<domain>.com` in the User name input field. Learn more from [Use Kerberos authentication for the HDFS connector](#use-kerberos-authentication-for-the-hdfs-connector).
97+
* Store the user password used to connect to HDFS in the secret key.
98+
99+
:::image type="content" source="media/register-scan-hdfs/scan.png" alt-text="Screenshot of HDFS scan configurations in Purview." border="true":::
100+
101+
1. Select **Test connection**.
102+
103+
1. Select **Continue**.
104+
105+
1. On "**Scope your scan**" page, select the path(s) that you want to scan.
106+
107+
1. On "**Select a scan rule set**" page, select the scan rule set you want to use for schema extraction and classification. You can choose between the system default, existing custom rule sets, or create a new rule set inline. Learn more from [Create a scan rule set](create-a-scan-rule-set.md).
108+
109+
1. On "**Set a scan trigger**" page, choose your **scan trigger**. You can set up a schedule or ran the scan once.
110+
111+
1. Review your scan and select **Save and Run**.
112+
113+
[!INCLUDE [create and manage scans](includes/view-and-manage-scans.md)]
114+
115+
## Use Kerberos authentication for the HDFS connector
116+
117+
There are two options for setting up the on-premises environment to use Kerberos authentication for the HDFS connector. You can choose the one that better fits your situation.
118+
* Option 1: [Join a self-hosted integration runtime machine in the Kerberos realm](#kerberos-join-realm)
119+
* Option 2: [Enable mutual trust between the Windows domain and the Kerberos realm](#kerberos-mutual-trust)
120+
121+
For either option, make sure you turn on webhdfs for Hadoop cluster:
122+
123+
1. Create the HTTP principal and keytab for webhdfs.
124+
125+
> [!IMPORTANT]
126+
> The HTTP Kerberos principal must start with "**HTTP/**" according to Kerberos HTTP SPNEGO specification. Learn more from [here](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#HDFS_Configuration_Options).
127+
128+
```bash
129+
Kadmin> addprinc -randkey HTTP/<namenode hostname>@<REALM.COM>
130+
Kadmin> ktadd -k /etc/security/keytab/spnego.service.keytab HTTP/<namenode hostname>@<REALM.COM>
131+
```
132+
133+
2. HDFS configuration options: add the following three properties in `hdfs-site.xml`.
134+
```xml
135+
<property>
136+
<name>dfs.webhdfs.enabled</name>
137+
<value>true</value>
138+
</property>
139+
<property>
140+
<name>dfs.web.authentication.kerberos.principal</name>
141+
<value>HTTP/_HOST@<REALM.COM></value>
142+
</property>
143+
<property>
144+
<name>dfs.web.authentication.kerberos.keytab</name>
145+
<value>/etc/security/keytab/spnego.service.keytab</value>
146+
</property>
147+
```
148+
149+
### <a name="kerberos-join-realm"></a>Option 1: Join a self-hosted integration runtime machine in the Kerberos realm
150+
151+
#### Requirements
152+
153+
* The self-hosted integration runtime machine needs to join the Kerberos realm and can’t join any Windows domain.
154+
155+
#### How to configure
156+
157+
**On the KDC server:**
158+
159+
Create a principal, and specify the password.
160+
161+
> [!IMPORTANT]
162+
> The username should not contain the hostname.
163+
164+
```bash
165+
Kadmin> addprinc <username>@<REALM.COM>
166+
```
167+
168+
**On the self-hosted integration runtime machine:**
169+
170+
1. Run the Ksetup utility to configure the Kerberos Key Distribution Center (KDC) server and realm.
171+
172+
The machine must be configured as a member of a workgroup, because a Kerberos realm is different from a Windows domain. You can achieve this configuration by setting the Kerberos realm and adding a KDC server by running the following commands. Replace *REALM.COM* with your own realm name.
173+
174+
```cmd
175+
C:> Ksetup /setdomain REALM.COM
176+
C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
177+
```
178+
179+
After you run these commands, restart the machine.
180+
181+
2. Verify the configuration with the `Ksetup` command. The output should be like:
182+
183+
```cmd
184+
C:> Ksetup
185+
default realm = REALM.COM (external)
186+
REALM.com:
187+
kdc = <your_kdc_server_address>
188+
```
189+
190+
**In your Purview account:**
191+
192+
* Configure a credential with Kerberos authentication type with your Kerberos principal name and password to scan the HDFS. For configuration details, check the credential setting part in [Scan section](#scan).
193+
194+
### <a name="kerberos-mutual-trust"></a>Option 2: Enable mutual trust between the Windows domain and the Kerberos realm
195+
196+
#### Requirements
197+
198+
* The self-hosted integration runtime machine must join a Windows domain.
199+
* You need permission to update the domain controller's settings.
200+
201+
#### How to configure
202+
203+
> [!NOTE]
204+
> Replace REALM.COM and AD.COM in the following tutorial with your own realm name and domain controller.
205+
206+
**On the KDC server:**
207+
208+
1. Edit the KDC configuration in the *krb5.conf* file to let KDC trust the Windows domain by referring to the following configuration template. By default, the configuration is located at */etc/krb5.conf*.
209+
210+
```config
211+
[logging]
212+
default = FILE:/var/log/krb5libs.log
213+
kdc = FILE:/var/log/krb5kdc.log
214+
admin_server = FILE:/var/log/kadmind.log
215+
216+
[libdefaults]
217+
default_realm = REALM.COM
218+
dns_lookup_realm = false
219+
dns_lookup_kdc = false
220+
ticket_lifetime = 24h
221+
renew_lifetime = 7d
222+
forwardable = true
223+
224+
[realms]
225+
REALM.COM = {
226+
kdc = node.REALM.COM
227+
admin_server = node.REALM.COM
228+
}
229+
AD.COM = {
230+
kdc = windc.ad.com
231+
admin_server = windc.ad.com
232+
}
233+
234+
[domain_realm]
235+
.REALM.COM = REALM.COM
236+
REALM.COM = REALM.COM
237+
.ad.com = AD.COM
238+
ad.com = AD.COM
239+
240+
[capaths]
241+
AD.COM = {
242+
REALM.COM = .
243+
}
244+
```
245+
246+
After you configure the file, restart the KDC service.
247+
248+
2. Prepare a principal named *krbtgt/REALM.COM\@AD.COM* in the KDC server with the following command:
249+
250+
```cmd
251+
Kadmin> addprinc krbtgt/[email protected]
252+
```
253+
254+
3. In the *hadoop.security.auth_to_local* HDFS service configuration file, add `RULE:[1:$1@$0](.*\@AD.COM)s/\@.*//`.
255+
256+
**On the domain controller:**
257+
258+
1. Run the following `Ksetup` commands to add a realm entry:
259+
260+
```cmd
261+
C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
262+
C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM
263+
```
264+
265+
2. Establish trust from the Windows domain to the Kerberos realm. [password] is the password for the principal *krbtgt/REALM.COM\@AD.COM*.
266+
267+
```cmd
268+
C:> netdom trust REALM.COM /Domain: AD.COM /add /realm /password:[password]
269+
```
270+
271+
3. Select the encryption algorithm that's used in Kerberos.
272+
273+
1. Select **Server Manager** > **Group Policy Management** > **Domain** > **Group Policy Objects** > **Default or Active Domain Policy**, and then select **Edit**.
274+
275+
1. On the **Group Policy Management Editor** pane, select **Computer Configuration** > **Policies** > **Windows Settings** > **Security Settings** > **Local Policies** > **Security Options**, and then configure **Network security: Configure Encryption types allowed for Kerberos**.
276+
277+
1. Select the encryption algorithm you want to use when you connect to the KDC server. You can select all the options.
278+
279+
:::image type="content" source="media/register-scan-hdfs/config-encryption-types-for-kerberos.png" alt-text="Screenshot of the Network security: Configure encryption types allowed for Kerberos pane.":::
280+
281+
1. Use the `Ksetup` command to specify the encryption algorithm to be used on the specified realm.
282+
283+
```cmd
284+
C:> ksetup /SetEncTypeAttr REALM.COM DES-CBC-CRC DES-CBC-MD5 RC4-HMAC-MD5 AES128-CTS-HMAC-SHA1-96 AES256-CTS-HMAC-SHA1-96
285+
```
286+
287+
4. Create the mapping between the domain account and the Kerberos principal, so that you can use the Kerberos principal in the Windows domain.
288+
289+
1. Select **Administrative tools** > **Active Directory Users and Computers**.
290+
291+
1. Configure advanced features by selecting **View** > **Advanced Features**.
292+
293+
1. On the **Advanced Features** pane, right-click the account to which you want to create mappings and, on the **Name Mappings** pane, select the **Kerberos Names** tab.
294+
295+
1. Add a principal from the realm.
296+
297+
:::image type="content" source="media/register-scan-hdfs/map-security-identity.png" alt-text="Screenshot of the Security Identity Mapping pane.":::
298+
299+
**On the self-hosted integration runtime machine:**
300+
301+
* Run the following `Ksetup` commands to add a realm entry.
302+
303+
```cmd
304+
C:> Ksetup /addkdc REALM.COM <your_kdc_server_address>
305+
C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM
306+
```
307+
308+
**In your Purview account:**
309+
310+
* Configure a credential with Kerberos authentication type with your Kerberos principal name and password to scan the HDFS. For configuration details, check the credential setting part in [Scan section](#scan).
311+
312+
## Known limitations
313+
314+
Currently, HDFS connector doesn't support custom resource set pattern rule for [advanced resource set](concept-resource-sets.md#advanced-resource-sets), the built-in resource set patterns will be applied.
315+
316+
[Sensitivity label](create-sensitivity-label.md) is not yet supported.
317+
318+
## Next steps
319+
320+
Now that you've registered your source, follow the below guides to learn more about Microsoft Purview and your data.
321+
322+
- [Search Data Catalog](how-to-search-catalog.md)
323+
- [Data Estate Insights in Microsoft Purview](concept-insights.md)

articles/purview/toc.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,8 @@ items:
197197
href: register-scan-erwin-source.md
198198
- name: Google BigQuery
199199
href: register-scan-google-bigquery-source.md
200+
- name: HDFS
201+
href: register-scan-hdfs.md
200202
- name: Hive Metastore Database
201203
href: register-scan-hive-metastore-source.md
202204
- name: Looker

0 commit comments

Comments
 (0)