|
| 1 | +--- |
| 2 | +title: Plan for network isolation |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Demystify Azure Machine Learning network isolation with recommendations and automation templates |
| 5 | +services: machine-learning |
| 6 | +ms.service: machine-learning |
| 7 | +ms.subservice: enterprise-readiness |
| 8 | +ms.reviewer: larryfr |
| 9 | +ms.author: jhirono |
| 10 | +author: jhirono |
| 11 | +ms.date: 02/14/2023 |
| 12 | +ms.topic: how-to |
| 13 | +ms.custom: |
| 14 | +--- |
| 15 | + |
| 16 | +# Plan for network isolation |
| 17 | + |
| 18 | +In this article, you learn how to plan your network isolation for Azure Machine Learning and our recommendations. This is a document for IT administrators who want to design network architecture. |
| 19 | + |
| 20 | +## Key considerations |
| 21 | + |
| 22 | +### Azure Machine Learning has both IaaS and PaaS resources |
| 23 | + |
| 24 | +Azure Machine Learning's network isolation involves both Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) components. PaaS services, such as the Azure Machine Learning workspace, storage, key vault, container registry, and monitor, can be isolated using Private Link. IaaS computing services, such as compute instances/clusters for AI model training, and Azure Kubernetes Service (AKS) or managed online endpoints for AI model scoring, can be injected into your virtual network and communicate with PaaS services using Private Link. The following diagram is an example of this architecture. |
| 25 | + |
| 26 | +:::image type="content" source="media/how-to-network-isolation-planning/iaas-paas-network-diagram.png" alt-text="Diagram of IaaS and PaaS components."::: |
| 27 | + |
| 28 | +In this diagram, the compute instances, compute clusters, and AKS Clusters are located within your virtual network. They can access the Azure Machine Learning workspace or storage using a private endpoint. Instead of a private endpoint, you can use a service endpoint for Azure Storage and Azure Key Vault. The other services don't support service endpoint. |
| 29 | + |
| 30 | +### Required inbound and outbound configurations |
| 31 | + |
| 32 | +Azure Machine Learning has [several required inbound and outbound configurations](how-to-access-azureml-behind-firewall.md) with your virtual network. If you have a standalone virtual network, the configuration is straightforward using network security group. However, you may have a hub-spoke or mesh network architecture, firewall, network virtual appliance, proxy, and user defined routing. In either case, make sure to allow inbound and outbound with your network security components. |
| 33 | + |
| 34 | +:::image type="content" source="media/how-to-network-isolation-planning/hub-spoke-network-diagram.png" alt-text="Diagram of hub-spoke network with outbound through firewall."::: |
| 35 | + |
| 36 | +In this diagram, you have a hub and spoke network architecture. The spoke VNet has resources for Azure Machine Learning. The hub VNet has a firewall that control internet outbound from your virtual networks. In this case, your firewall must allow outbound to required resources and your compute resources in spoke VNet must be able to reach your firewall. |
| 37 | + |
| 38 | +> [!TIP] |
| 39 | +> In the diagram, the compute instance and compute cluster are configured for no public IP. If you instead use a compute instance or cluster __with public IP__, you need to allow inbound from the Azure Machine Learning service tag using a Network Security Group (NSG) and user defined routing to skip your firewall. This inbound traffic would be from a Microsoft service (Azure Machine Learning). However, we recommend using the no public IP option to remove this inbound requirement. |
| 40 | +
|
| 41 | +### DNS resolution of private link resources and application on compute instance |
| 42 | + |
| 43 | +If you have your own DNS server hosted in Azure or on-premises, you need to create a conditional forwarder in your DNS server. The conditional forwarder sends DNS requests to the Azure DNS for all private link enabled PaaS services. For more information, see the [DNS configuration scenarios](/azure/private-link/private-endpoint-dns#dns-configuration-scenarios) and [Azure Machine Learning specific DNS configuration](how-to-custom-dns.md) articles. |
| 44 | + |
| 45 | +### Data exfiltration protection |
| 46 | + |
| 47 | +We have two types of outbound; read only and read/write. Read only outbound can't be exploited by malicious actors but read/write outbound can be. Azure Storage and Azure Frontdoor (the `frontdoor.frontend` service tag) are read/write outbound in our case. |
| 48 | + |
| 49 | +You can mitigate this data exfiltration risk using [our data exfiltration prevention solution](how-to-prevent-data-loss-exfiltration.md). We use a service endpoint policy with an Azure Machine Learning alias to allow outbound to only Azure Machine Learning managed storage accounts. You don't need to open outbound to Storage on your firewall. |
| 50 | + |
| 51 | +:::image type="content" source="media/how-to-network-isolation-planning/data-exfiltration-protection-diagram.png" alt-text="Diagram of network with exfiltration protection configuration."::: |
| 52 | + |
| 53 | +In this diagram, the compute instance and cluster need to access Azure Machine Learning managed storage accounts to get set-up scripts. Instead of opening the outbound to storage, you can use service endpoint policy with Azure Machine Learning alias to allow the storage access only to Azure Machine Learning storage accounts. |
| 54 | + |
| 55 | +The following tables list the required outbound [Azure Service Tags](/azure/virtual-network/service-tags-overview) and fully qualified domain names (FQDN) with data exfiltration protection setting: |
| 56 | + |
| 57 | +| Outbound service tag | Protocol | Port | |
| 58 | +| ---- | ---- | ---- | |
| 59 | +| `AzureActiveDirectory` | TCP | 80, 443 | |
| 60 | +| `AzureResourceManager` | TCP | 443 | |
| 61 | +| `AzureMachineLearning` | UDP | 5831 | |
| 62 | +| `BatchNodeManagement` | TCP | 443 | |
| 63 | + |
| 64 | +| Outbound FQDN | Protocol | Port | |
| 65 | +| ---- | ---- | ---- | |
| 66 | +| `mcr.microsoft.com` | TCP | 443 | |
| 67 | +| `*.data.mcr.microsoft.com` | TCP | 443 | |
| 68 | +| `ml.azure.com` | TCP | 443 | |
| 69 | +| `automlresources-prod.azureedge.net` | TCP | 443 | |
| 70 | + |
| 71 | +### Managed online endpoint |
| 72 | + |
| 73 | +Azure Machine Learning managed online endpoint uses Azure Machine Learning managed VNet, instead of using your VNet. If you want to disallow public access to your endpoint, set the `public_network_access` flag to disabled. When this flag is disabled, your endpoint can be accessed via the private endpoint of your workspace, and it can't be reached from public networks. If you want to use a private storage account for your deployment, set the `egress_public_network_access` flag disabled. It automatically creates private endpoints to access your private resources. |
| 74 | + |
| 75 | +> [!TIP] |
| 76 | +> The workspace default storage account is the only private storage account supported by managed online endpoint. |
| 77 | +
|
| 78 | +:::image type="content" source="media/how-to-secure-online-endpoint/endpoint-network-isolation-ingress-egress.png" alt-text="Diagram of managed online endpoint configuration in a VNet."::: |
| 79 | + |
| 80 | +For more information, see the [Network isolation of managed online endpoints](how-to-secure-online-endpoint.md) article. |
| 81 | + |
| 82 | +### Private IP address shortage in your main network |
| 83 | + |
| 84 | +Azure Machine Learning requires private IPs; one IP per compute instance, compute cluster node, and private endpoint. You also need many IPs if you use AKS. Your hub-spoke network connected with your on-premises network might not have a large enough private IP address space. In this scenario, you can use isolated, not-peered VNets for your Azure Machine Learning resources. |
| 85 | + |
| 86 | +:::image type="content" source="media/how-to-network-isolation-planning/isolated-not-peered-network-diagram.png" alt-text="Diagram of networks connected by private endpoints instead of peering."::: |
| 87 | + |
| 88 | +In this diagram, your main VNet requires the IPs for private endpoints. You can have hub-spoke VNets for multiple Azure Machine Learning workspaces with large address spaces. A downside of this architecture is to double the number of private endpoints. |
| 89 | + |
| 90 | +### Network policy enforcement |
| 91 | +You can use [built-in policies](/how-to-integrate-azure-policy.md) if you want to control network isolation parameters with self-service workspace and computing resources creation. |
| 92 | + |
| 93 | +### Other considerations |
| 94 | + |
| 95 | +#### Image build compute setting for ACR behind VNet |
| 96 | + |
| 97 | +If you put your Azure container registry (ACR) behind your private endpoint, your ACR can't build your docker images. You need to use compute instance or compute cluster to build images. For more information, see the [how to set image build compute](how-to-secure-workspace-vnet.md#enable-azure-container-registry-acr) article. |
| 98 | + |
| 99 | +#### Enablement of studio UI with private link enabled workspace |
| 100 | + |
| 101 | +If you plan on using the Azure Machine Learning studio, there are extra configuration steps that are needed. These steps are to preventing any data exfiltration scenarios. For more information, see the [how to use Azure Machine Learning studio in an Azure virtual network](how-to-enable-studio-virtual-network.md) article. |
| 102 | + |
| 103 | +<!-- ### Registry --> |
| 104 | + |
| 105 | +## Recommended architecture |
| 106 | + |
| 107 | +The following diagram is our recommended architecture to make all resources private but allow outbound internet access from your VNet. This diagram describes the following architecture: |
| 108 | +* Put all resources in the same region. |
| 109 | +* A hub VNet, which contains your firewall. |
| 110 | +* A spoke VNet, which contains the following resources: |
| 111 | + * A training subnet contains compute instances and clusters used for training ML models. These resources are configured for no public IP. |
| 112 | + * A scoring subnet contains an AKS cluster. |
| 113 | + * A 'pe' subnet contains private endpoints that connect to the workspace and private resources used by the workspace (storage, key vault, container registry, etc.) |
| 114 | +* Managed online endpoints use the private endpoint of the workspace to process incoming requests. A private endpoint is also used to allow managed online endpoint deployments to access private storage. |
| 115 | + |
| 116 | +This architecture balances your network security and your ML engineers' productivity. |
| 117 | + |
| 118 | +:::image type="content" source="media/how-to-network-isolation-planning/recommended-network-diagram.png" alt-text="Diagram of the recommended network architecture."::: |
| 119 | + |
| 120 | +You can automate this environments creation using [a template](tutorial-create-secure-workspace-template.md) without managed online endpoint or AKS. Managed online endpoint is the solution if you don't have an existing AKS cluster for your AI model scoring. See [how to secure online endpoint](how-to-secure-online-endpoint.md) documentation for more info. AKS with Azure Machine Learning extension is the solution if you have an existing AKS cluster for your AI model scoring. See [how to attach kubernetes](how-to-attach-kubernetes-anywhere.md) documentation for more info. |
| 121 | + |
| 122 | +### Removing firewall requirement |
| 123 | + |
| 124 | +If you want to remove the firewall requirement, you can use network security groups and [Azure virtual network NAT](/azure/virtual-network/nat-gateway/nat-overview) to allow internet outbound from your private computing resources. |
| 125 | + |
| 126 | +:::image type="content" source="media/how-to-network-isolation-planning/recommended-network-diagram-no-firewall.png" alt-text="Diagram of the recommended network architecture without a firewall."::: |
| 127 | + |
| 128 | +### Using public workspace |
| 129 | + |
| 130 | +You can use a public workspace if you're OK with Azure AD authentication and authorization with conditional access. A public workspace has some features to show data in your private storage account and we recommend using private workspace. |
| 131 | + |
| 132 | +## Recommended architecture with data exfiltration prevention |
| 133 | + |
| 134 | +This diagram shows the recommended architecture to make all resources private and control outbound destinations to prevent data exfiltration. We recommend this architecture when using Azure Machine Learning with your sensitive data in production. This diagram describes the following architecture: |
| 135 | +* Put all resources in the same region. |
| 136 | +* A hub VNet, which contains your firewall. |
| 137 | + * In addition to service tags, the firewall uses FQDNs to prevent data exfiltration. |
| 138 | +* A spoke VNet, which contains the following resources: |
| 139 | + * A training subnet contains compute instances and clusters used for training ML models. These resources are configured for no public IP. Additionally, a service endpoint and service endpoint policy are in place to prevent data exfiltration. |
| 140 | + * A scoring subnet contains an AKS cluster. |
| 141 | + * A 'pe' subnet contains private endpoints that connect to the workspace and private resources used by the workspace (storage, key vault, container registry, etc.) |
| 142 | +* Managed online endpoints use the private endpoint of the workspace to process incoming requests. A private endpoint is also used to allow managed online endpoint deployments to access private storage. |
| 143 | + |
| 144 | +:::image type="content" source="media/how-to-network-isolation-planning/recommended-network-data-exfiltration.png" alt-text="Diagram of recommended network with data exfiltration protection configuration."::: |
| 145 | + |
| 146 | +The following tables list the required outbound [Azure Service Tags](/azure/virtual-network/service-tags-overview) and fully qualified domain names (FQDN) with data exfiltration protection setting: |
| 147 | + |
| 148 | +| Outbound service tag | Protocol | Port | |
| 149 | +| ---- | ----- | ---- | |
| 150 | +| `AzureActiveDirectory` | TCP | 80, 443 | |
| 151 | +| `AzureResourceManager` | TCP | 443 | |
| 152 | +| `AzureMachineLearning` | UDP | 5831 | |
| 153 | +| `BatchNodeManagement` | TCP | 443 | |
| 154 | + |
| 155 | +| Outbound FQDN | Protocol | Port | |
| 156 | +| ---- | ---- | ---- | |
| 157 | +| `mcr.microsoft.com` | TCP | 443 | |
| 158 | +| `*.data.mcr.microsoft.com` | TCP | 443 | |
| 159 | +| `ml.azure.com` | TCP | 443 | |
| 160 | +| `automlresources-prod.azureedge.net` | TCP | 443 | |
| 161 | + |
| 162 | +### Using public workspace |
| 163 | + |
| 164 | +You can use the public workspace if you're OK with Azure AD authentication and authorization with conditional access. A public workspace has some features to show data in your private storage account and we recommend using private workspace. |
| 165 | + |
| 166 | +## Next steps |
| 167 | + |
| 168 | +* [Virtual network overview](how-to-network-security-overview.md) |
| 169 | +* [Secure the workspace resources](how-to-secure-workspace-vnet.md) |
| 170 | +* [Secure the training environment](how-to-secure-training-vnet.md) |
| 171 | +* [Secure the inference environment](how-to-secure-inferencing-vnet.md) |
| 172 | +* [Enable studio functionality](how-to-enable-studio-virtual-network.md) |
| 173 | +* [Configure inbound and outbound network traffic](how-to-access-azureml-behind-firewall.md) |
| 174 | +* [Use custom DNS](how-to-custom-dns.md) |
0 commit comments