|
| 1 | +--- |
| 2 | +title: Connect to and manage HDFS |
| 3 | +description: This guide describes how to connect to HDFS in Microsoft Purview, and use Microsoft Purview's features to scan and manage your HDFS source. |
| 4 | +author: linda33wj |
| 5 | +ms.author: jingwang |
| 6 | +ms.service: purview |
| 7 | +ms.subservice: purview-data-map |
| 8 | +ms.topic: how-to #Required; leave this attribute/value as-is. |
| 9 | +ms.date: 08/03/2022 |
| 10 | +ms.custom: template-how-to #Required; leave this attribute/value as-is. |
| 11 | +--- |
| 12 | + |
| 13 | +# Connect to and manage HDFS in Microsoft Purview |
| 14 | + |
| 15 | +This article outlines how to register Hadoop Distributed File System (HDFS), and how to authenticate and interact with HDFS in Microsoft Purview. For more information about Microsoft Purview, read the [introductory article](overview.md). |
| 16 | + |
| 17 | +## Supported capabilities |
| 18 | + |
| 19 | +|**Metadata Extraction**|**Full Scan**|**Incremental Scan**|**Scoped Scan**|**Classification**|**Access Policy**|**Lineage**|**Data Sharing**| |
| 20 | +|---|---|---|---|---|---|---|---| |
| 21 | +| [Yes](#register)| [Yes](#scan)| [Yes](#scan) | [Yes](#scan) | [Yes](#scan) | No| No | No| |
| 22 | + |
| 23 | +When scanning HDFS source, Microsoft Purview supports extracting technical metadata including HDFS: |
| 24 | + |
| 25 | +- Namenode |
| 26 | +- Folder |
| 27 | +- File |
| 28 | +- Resource set |
| 29 | + |
| 30 | +When setting up scan, you can choose to scan the entire HDFS or selective folders. Learn about the supported file format [here](microsoft-purview-connector-overview.md#file-types-supported-for-scanning). |
| 31 | + |
| 32 | +## Prerequisites |
| 33 | + |
| 34 | +- An Azure account with an active subscription. [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F). |
| 35 | +- An active [Microsoft Purview account](create-catalog-portal.md). |
| 36 | +- You need Data Source Administrator and Data Reader permissions to register a source and manage it in the Microsoft Purview governance portal. For more information about permissions, see [Access control in Microsoft Purview](catalog-permissions.md). |
| 37 | +- Set up the latest [self-hosted integration runtime](https://www.microsoft.com/download/details.aspx?id=39717). For more information, see [the create and configure a self-hosted integration runtime guide](manage-integration-runtimes.md). The minimal supported Self-hosted Integration Runtime version is 5.20.8235.2. |
| 38 | + |
| 39 | + * Ensure Visual C++ Redistributable for Visual Studio 2012 Update 4 is installed on the self-hosted integration runtime machine. If you don't have this update installed, [you can download it here](https://www.microsoft.com/download/details.aspx?id=30679). |
| 40 | + * Ensure JRE or OpenJDK is installed on the self-hosted integration runtime machine for parsing Parquet and ORC files. Learn more from [here](manage-integration-runtimes.md#java-runtime-environment-installation). |
| 41 | + * To set up your environment to enable Kerberos authentication, see the [Use Kerberos authentication for the HDFS connector](#use-kerberos-authentication-for-the-hdfs-connector) section. |
| 42 | + |
| 43 | +## Register |
| 44 | + |
| 45 | +This section describes how to register HDFS in Microsoft Purview using the [Microsoft Purview governance portal](https://web.purview.azure.com/). |
| 46 | + |
| 47 | +### Steps to register |
| 48 | + |
| 49 | +To register a new HDFS source in your data catalog, follow these steps: |
| 50 | + |
| 51 | +1. Navigate to your Microsoft Purview account in the [Microsoft Purview governance portal](https://web.purview.azure.com/resource/). |
| 52 | +1. Select **Data Map** on the left navigation. |
| 53 | +1. Select **Register** |
| 54 | +1. On Register sources, select **HDFS**. Select **Continue**. |
| 55 | + |
| 56 | +On the **Register sources (HDFS)** screen, follow these steps: |
| 57 | + |
| 58 | +1. Enter a **Name** that the data source will be listed within the Catalog. |
| 59 | + |
| 60 | +1. Enter the **Cluster URL** of the HDFS NameNode in the form of `https://<namenode>:<port>` or `http://<namenode>:<port>`, e.g. `https://namenodeserver.com:50470` or `http://namenodeserver.com:50070`. |
| 61 | + |
| 62 | +1. Select a collection or create a new one (Optional) |
| 63 | + |
| 64 | +1. Finish to register the data source. |
| 65 | + |
| 66 | + :::image type="content" source="media/register-scan-hdfs/register-sources.png" alt-text="Screenshot of HDFS source registration in Purview." border="true"::: |
| 67 | + |
| 68 | +## Scan |
| 69 | + |
| 70 | +Follow the steps below to scan HDFS to automatically identify assets. For more information about scanning in general, see our [introduction to scans and ingestion](concept-scans-and-ingestion.md). |
| 71 | + |
| 72 | +### Authentication for a scan |
| 73 | + |
| 74 | +The supported authentication type for an HDFS source is **Kerberos authentication**. |
| 75 | + |
| 76 | +### Create and run scan |
| 77 | + |
| 78 | +To create and run a new scan, follow these steps: |
| 79 | + |
| 80 | +1. Make sure a self-hosted integration runtime is set up. If it isn't set up, use the steps mentioned [here](./manage-integration-runtimes.md) to create a self-hosted integration runtime. |
| 81 | + |
| 82 | +1. Navigate to **Sources**. |
| 83 | + |
| 84 | +1. Select the registered HDFS source. |
| 85 | + |
| 86 | +1. Select **+ New scan**. |
| 87 | + |
| 88 | +1. On "**Scan *source_name***"" page, provide the below details: |
| 89 | + |
| 90 | + 1. **Name**: The name of the scan |
| 91 | + |
| 92 | + 1. **Connect via integration runtime**: Select the configured self-hosted integration runtime. See setup requirements in [Prerequisites](#prerequisites) section. |
| 93 | + |
| 94 | + 1. **Credential**: Select the credential to connect to your data source. Make sure to: |
| 95 | + * Select **Kerberos Authentication** while creating a credential. |
| 96 | + * Provide the user name in the format of `<username>@<domain>.com` in the User name input field. Learn more from [Use Kerberos authentication for the HDFS connector](#use-kerberos-authentication-for-the-hdfs-connector). |
| 97 | + * Store the user password used to connect to HDFS in the secret key. |
| 98 | + |
| 99 | + :::image type="content" source="media/register-scan-hdfs/scan.png" alt-text="Screenshot of HDFS scan configurations in Purview." border="true"::: |
| 100 | + |
| 101 | +1. Select **Test connection**. |
| 102 | + |
| 103 | +1. Select **Continue**. |
| 104 | + |
| 105 | +1. On "**Scope your scan**" page, select the path(s) that you want to scan. |
| 106 | + |
| 107 | +1. On "**Select a scan rule set**" page, select the scan rule set you want to use for schema extraction and classification. You can choose between the system default, existing custom rule sets, or create a new rule set inline. Learn more from [Create a scan rule set](create-a-scan-rule-set.md). |
| 108 | + |
| 109 | +1. On "**Set a scan trigger**" page, choose your **scan trigger**. You can set up a schedule or ran the scan once. |
| 110 | + |
| 111 | +1. Review your scan and select **Save and Run**. |
| 112 | + |
| 113 | +[!INCLUDE [create and manage scans](includes/view-and-manage-scans.md)] |
| 114 | + |
| 115 | +## Use Kerberos authentication for the HDFS connector |
| 116 | + |
| 117 | +There are two options for setting up the on-premises environment to use Kerberos authentication for the HDFS connector. You can choose the one that better fits your situation. |
| 118 | +* Option 1: [Join a self-hosted integration runtime machine in the Kerberos realm](#kerberos-join-realm) |
| 119 | +* Option 2: [Enable mutual trust between the Windows domain and the Kerberos realm](#kerberos-mutual-trust) |
| 120 | + |
| 121 | +For either option, make sure you turn on webhdfs for Hadoop cluster: |
| 122 | + |
| 123 | +1. Create the HTTP principal and keytab for webhdfs. |
| 124 | + |
| 125 | + > [!IMPORTANT] |
| 126 | + > The HTTP Kerberos principal must start with "**HTTP/**" according to Kerberos HTTP SPNEGO specification. Learn more from [here](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#HDFS_Configuration_Options). |
| 127 | +
|
| 128 | + ```bash |
| 129 | + Kadmin> addprinc -randkey HTTP/<namenode hostname>@<REALM.COM> |
| 130 | + Kadmin> ktadd -k /etc/security/keytab/spnego.service.keytab HTTP/<namenode hostname>@<REALM.COM> |
| 131 | + ``` |
| 132 | + |
| 133 | +2. HDFS configuration options: add the following three properties in `hdfs-site.xml`. |
| 134 | + ```xml |
| 135 | + <property> |
| 136 | + <name>dfs.webhdfs.enabled</name> |
| 137 | + <value>true</value> |
| 138 | + </property> |
| 139 | + <property> |
| 140 | + <name>dfs.web.authentication.kerberos.principal</name> |
| 141 | + <value>HTTP/_HOST@<REALM.COM></value> |
| 142 | + </property> |
| 143 | + <property> |
| 144 | + <name>dfs.web.authentication.kerberos.keytab</name> |
| 145 | + <value>/etc/security/keytab/spnego.service.keytab</value> |
| 146 | + </property> |
| 147 | + ``` |
| 148 | + |
| 149 | +### <a name="kerberos-join-realm"></a>Option 1: Join a self-hosted integration runtime machine in the Kerberos realm |
| 150 | + |
| 151 | +#### Requirements |
| 152 | + |
| 153 | +* The self-hosted integration runtime machine needs to join the Kerberos realm and can’t join any Windows domain. |
| 154 | + |
| 155 | +#### How to configure |
| 156 | + |
| 157 | +**On the KDC server:** |
| 158 | + |
| 159 | +Create a principal, and specify the password. |
| 160 | + |
| 161 | +> [!IMPORTANT] |
| 162 | +> The username should not contain the hostname. |
| 163 | + |
| 164 | +```bash |
| 165 | +Kadmin> addprinc <username>@<REALM.COM> |
| 166 | +``` |
| 167 | + |
| 168 | +**On the self-hosted integration runtime machine:** |
| 169 | + |
| 170 | +1. Run the Ksetup utility to configure the Kerberos Key Distribution Center (KDC) server and realm. |
| 171 | + |
| 172 | + The machine must be configured as a member of a workgroup, because a Kerberos realm is different from a Windows domain. You can achieve this configuration by setting the Kerberos realm and adding a KDC server by running the following commands. Replace *REALM.COM* with your own realm name. |
| 173 | + |
| 174 | + ```cmd |
| 175 | + C:> Ksetup /setdomain REALM.COM |
| 176 | + C:> Ksetup /addkdc REALM.COM <your_kdc_server_address> |
| 177 | + ``` |
| 178 | + |
| 179 | + After you run these commands, restart the machine. |
| 180 | + |
| 181 | +2. Verify the configuration with the `Ksetup` command. The output should be like: |
| 182 | + |
| 183 | + ```cmd |
| 184 | + C:> Ksetup |
| 185 | + default realm = REALM.COM (external) |
| 186 | + REALM.com: |
| 187 | + kdc = <your_kdc_server_address> |
| 188 | + ``` |
| 189 | + |
| 190 | +**In your Purview account:** |
| 191 | + |
| 192 | +* Configure a credential with Kerberos authentication type with your Kerberos principal name and password to scan the HDFS. For configuration details, check the credential setting part in [Scan section](#scan). |
| 193 | + |
| 194 | +### <a name="kerberos-mutual-trust"></a>Option 2: Enable mutual trust between the Windows domain and the Kerberos realm |
| 195 | + |
| 196 | +#### Requirements |
| 197 | + |
| 198 | +* The self-hosted integration runtime machine must join a Windows domain. |
| 199 | +* You need permission to update the domain controller's settings. |
| 200 | +
|
| 201 | +#### How to configure |
| 202 | +
|
| 203 | +> [!NOTE] |
| 204 | +> Replace REALM.COM and AD.COM in the following tutorial with your own realm name and domain controller. |
| 205 | +
|
| 206 | +**On the KDC server:** |
| 207 | +
|
| 208 | +1. Edit the KDC configuration in the *krb5.conf* file to let KDC trust the Windows domain by referring to the following configuration template. By default, the configuration is located at */etc/krb5.conf*. |
| 209 | +
|
| 210 | + ```config |
| 211 | + [logging] |
| 212 | + default = FILE:/var/log/krb5libs.log |
| 213 | + kdc = FILE:/var/log/krb5kdc.log |
| 214 | + admin_server = FILE:/var/log/kadmind.log |
| 215 | + |
| 216 | + [libdefaults] |
| 217 | + default_realm = REALM.COM |
| 218 | + dns_lookup_realm = false |
| 219 | + dns_lookup_kdc = false |
| 220 | + ticket_lifetime = 24h |
| 221 | + renew_lifetime = 7d |
| 222 | + forwardable = true |
| 223 | + |
| 224 | + [realms] |
| 225 | + REALM.COM = { |
| 226 | + kdc = node.REALM.COM |
| 227 | + admin_server = node.REALM.COM |
| 228 | + } |
| 229 | + AD.COM = { |
| 230 | + kdc = windc.ad.com |
| 231 | + admin_server = windc.ad.com |
| 232 | + } |
| 233 | + |
| 234 | + [domain_realm] |
| 235 | + .REALM.COM = REALM.COM |
| 236 | + REALM.COM = REALM.COM |
| 237 | + .ad.com = AD.COM |
| 238 | + ad.com = AD.COM |
| 239 | + |
| 240 | + [capaths] |
| 241 | + AD.COM = { |
| 242 | + REALM.COM = . |
| 243 | + } |
| 244 | + ``` |
| 245 | +
|
| 246 | + After you configure the file, restart the KDC service. |
| 247 | +
|
| 248 | +2. Prepare a principal named *krbtgt/REALM.COM\@AD.COM* in the KDC server with the following command: |
| 249 | +
|
| 250 | + ```cmd |
| 251 | + Kadmin> addprinc krbtgt/[email protected] |
| 252 | + ``` |
| 253 | +
|
| 254 | +3. In the *hadoop.security.auth_to_local* HDFS service configuration file, add `RULE:[1:$1@$0](.*\@AD.COM)s/\@.*//`. |
| 255 | +
|
| 256 | +**On the domain controller:** |
| 257 | +
|
| 258 | +1. Run the following `Ksetup` commands to add a realm entry: |
| 259 | +
|
| 260 | + ```cmd |
| 261 | + C:> Ksetup /addkdc REALM.COM <your_kdc_server_address> |
| 262 | + C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM |
| 263 | + ``` |
| 264 | +
|
| 265 | +2. Establish trust from the Windows domain to the Kerberos realm. [password] is the password for the principal *krbtgt/REALM.COM\@AD.COM*. |
| 266 | +
|
| 267 | + ```cmd |
| 268 | + C:> netdom trust REALM.COM /Domain: AD.COM /add /realm /password:[password] |
| 269 | + ``` |
| 270 | +
|
| 271 | +3. Select the encryption algorithm that's used in Kerberos. |
| 272 | + |
| 273 | + 1. Select **Server Manager** > **Group Policy Management** > **Domain** > **Group Policy Objects** > **Default or Active Domain Policy**, and then select **Edit**. |
| 274 | + |
| 275 | + 1. On the **Group Policy Management Editor** pane, select **Computer Configuration** > **Policies** > **Windows Settings** > **Security Settings** > **Local Policies** > **Security Options**, and then configure **Network security: Configure Encryption types allowed for Kerberos**. |
| 276 | + |
| 277 | + 1. Select the encryption algorithm you want to use when you connect to the KDC server. You can select all the options. |
| 278 | + |
| 279 | + :::image type="content" source="media/register-scan-hdfs/config-encryption-types-for-kerberos.png" alt-text="Screenshot of the Network security: Configure encryption types allowed for Kerberos pane."::: |
| 280 | + |
| 281 | + 1. Use the `Ksetup` command to specify the encryption algorithm to be used on the specified realm. |
| 282 | + |
| 283 | + ```cmd |
| 284 | + C:> ksetup /SetEncTypeAttr REALM.COM DES-CBC-CRC DES-CBC-MD5 RC4-HMAC-MD5 AES128-CTS-HMAC-SHA1-96 AES256-CTS-HMAC-SHA1-96 |
| 285 | + ``` |
| 286 | + |
| 287 | +4. Create the mapping between the domain account and the Kerberos principal, so that you can use the Kerberos principal in the Windows domain. |
| 288 | + |
| 289 | + 1. Select **Administrative tools** > **Active Directory Users and Computers**. |
| 290 | + |
| 291 | + 1. Configure advanced features by selecting **View** > **Advanced Features**. |
| 292 | + |
| 293 | + 1. On the **Advanced Features** pane, right-click the account to which you want to create mappings and, on the **Name Mappings** pane, select the **Kerberos Names** tab. |
| 294 | + |
| 295 | + 1. Add a principal from the realm. |
| 296 | + |
| 297 | + :::image type="content" source="media/register-scan-hdfs/map-security-identity.png" alt-text="Screenshot of the Security Identity Mapping pane."::: |
| 298 | + |
| 299 | +**On the self-hosted integration runtime machine:** |
| 300 | + |
| 301 | +* Run the following `Ksetup` commands to add a realm entry. |
| 302 | + |
| 303 | + ```cmd |
| 304 | + C:> Ksetup /addkdc REALM.COM <your_kdc_server_address> |
| 305 | + C:> ksetup /addhosttorealmmap HDFS-service-FQDN REALM.COM |
| 306 | + ``` |
| 307 | + |
| 308 | +**In your Purview account:** |
| 309 | + |
| 310 | +* Configure a credential with Kerberos authentication type with your Kerberos principal name and password to scan the HDFS. For configuration details, check the credential setting part in [Scan section](#scan). |
| 311 | + |
| 312 | +## Known limitations |
| 313 | + |
| 314 | +Currently, HDFS connector doesn't support custom resource set pattern rule for [advanced resource set](concept-resource-sets.md#advanced-resource-sets), the built-in resource set patterns will be applied. |
| 315 | +
|
| 316 | +[Sensitivity label](create-sensitivity-label.md) is not yet supported. |
| 317 | +
|
| 318 | +## Next steps |
| 319 | +
|
| 320 | +Now that you've registered your source, follow the below guides to learn more about Microsoft Purview and your data. |
| 321 | + |
| 322 | +- [Search Data Catalog](how-to-search-catalog.md) |
| 323 | +- [Data Estate Insights in Microsoft Purview](concept-insights.md) |
0 commit comments