Add kubenet namenode topology plugin (#11)

kimoonkim · web-flow · commit 66a2138540c9 · 2017-06-21T14:07:39.000-07:00
* Add a namenode topology plugin for kubenet fixing data locality

* Clean up README

* Address review comments

* Address review comments

* Remove extra empty lines
diff --git a/topology/README.md b/topology/README.md
@@ -0,0 +1,103 @@
+HDFS namenode topology plugins for various Kubernetes network providers.
+
+HDFS namenode handles RPC requests from clients. Namenode often gets the IP
+addresses of clients from the remote endpoints of RPC connections.
+In Kubernetes, HDFS clients may run inside pods. The client IP addresses can
+be virtual pod IP addresses. This can confuse the namenode when it runs
+the data locality optimization code, which requires the comparison of client
+IP addresses against the IP addresses associated with datanodes. The latter
+are physical IP addresses of cluster nodes that datanodes are running on.
+The client pod virtual IP addresses would not match any datanode IP addresses.
+
+We can configure namenode with the topology plugins in this directory to
+correct the namenode data locality code. So far, we learned that only
+Google Container Engine (GKE) suffers from the data locality issue caused
+by the virtual pod IP addresses exposed to namenode. (See below)
+GKE uses the native `kubenet` network provider.
+
+  - TODO: Currently, there is no easy way to launch the namenode helm chart
+    with a topology plugins configured. Build a new Docker image with
+    topology plugins and support the configuration. See plugin README
+    for installation/configuration instructions.
+
+Many K8s network providers do not need any topology plugins.  Most K8s network
+providers conduct IP masquerading or Network Address Translation (NAT), when pod
+packets head outside the pod IP subnet. They rewrite headers of pod packets by
+putting the physical IP addresses of the cluster nodes that pods are running on.
+The namenode and datanodes use `hostNetwork` and their IP addresses are outside
+the pod IP subnet. As the result, namenode will see the physical cluster node
+IP address from client RPC connections originating from pods. The data locality
+will work fine with them.
+
+Here is the list of network providers that conduct NAT:
+
+  - By design, overlay networks such as weave and flannel conduct NAT for any
+    pod packet heading outside a local pod network. This means packets coming to
+    a node IP also does NAT. (In overlay, pod packets heading to another pod in
+    a different node puts back the pod IPs once they got inside the destination
+    node)
+  - Calico is a popular non-overlay network provider. It turns out Calico can be
+    also configured to do NAT between pod subnet and node subnet thanks to the
+    `nat-outgoing` option. The option can be easily turned on and is enabled
+    by default.
+  - In EC2, the standard tool kops can provision k8s clusters using the same
+    native kubenet that GKE uses. Unlike GKE, it turns out kubenet in EC2 does
+    NAT between pod subnet to host network. This is because kops sets option
+    --non-masquerade-cidr=100.64.0.0/10 to cover only pod IP subnet. Traffic to
+    IPs ouside this range will do NAT. In EC2, cluster hosts like 172.20.47.241
+    sits outside this CIDR. This means pod packets heading to node IPs will do
+    masquerading. (Note GKE kubenet uses the default value of
+    --non-masquerade-cidr, 10.0.0.0/8, which covers both pod IP and node IP
+    subnets. GKE does not expose any way to override this value)
+
+Over time, we will also check the behaviors of other network providers and
+document them here.
+
+Here's how one can check if data locality in the namenode works.
+  1. Launch a HDFS client pod and go inside the pod.
+  ```
+  $ kubectl run -i --tty hadoop --image=uhopper/hadoop:2.7.2  \
+      --generator="run-pod/v1" --command -- /bin/bash
+  ```
+  2. Inside the pod, create a simple text file on HDFS.
+  ```
+  $ hadoop fs  \
+      -fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local  \
+      -cp file:/etc/hosts /hosts
+  ```
+  3. Set the number of replicas for the file to the number of your cluster
+  nodes. This ensures that there will be a copy of the file in the cluster node
+  that your client pod is running on. Wait some time until this happens.
+  ```
+  $ hadoop fs -setrep NUM-REPLICAS /hosts
+  ```
+  4. Run the following `hdfs cat` command. From the debug messages, see
+  which datanode is being used. Make sure it is your local datanode. (You can
+  get this from `$ kubectl get pods hadoop -o json | grep hostIP`. Do this
+  outside the pod)
+  ```
+  $ hadoop --loglevel DEBUG fs  \
+      -fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local  \
+      -cat /hosts
+  ...
+  17/04/24 20:51:28 DEBUG hdfs.DFSClient: Connecting to datanode 10.128.0.4:50010
+  ...
+  ```
+
+  If no, you should check if your local datanode is even in the list from the
+  debug messsages above. If it is not, then this is because step (3) did not
+  finish yet. Wait more. (You can use a smaller cluster for this test if that
+  is possible)
+  ```
+  17/04/24 20:51:28 DEBUG hdfs.DFSClient: newInfo = LocatedBlocks{
+    fileLength=199
+      underConstruction=false
+        blocks=[LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001;
+        getBlockSize()=199; corrupt=false; offset=0;
+        locs=[DatanodeInfoWithStorage[10.128.0.4:50010,DS-d2de9d29-6962-4435-a4b4-aadf4ea67e46,DISK],
+        DatanodeInfoWithStorage[10.128.0.3:50010,DS-0728ffcf-f400-4919-86bf-af0f9af36685,DISK],
+        DatanodeInfoWithStorage[10.128.0.2:50010,DS-3a881114-af08-47de-89cf-37dec051c5c2,DISK]]}]
+          lastLocatedBlock=LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001;
+  ```
+  5. Repeat the `hdfs cat` command multiple times. Check if the same datanode
+  is being consistently used.
diff --git a/topology/pod-cidr/.gitignore b/topology/pod-cidr/.gitignore
@@ -0,0 +1,3 @@
+.idea
+*.iml
+target
diff --git a/topology/pod-cidr/README.md b/topology/pod-cidr/README.md
@@ -0,0 +1,25 @@
+A namenode topology plugin mapping pods to cluster nodes for a K8s configured
+with pod CIDR. Currently, this is known to work only with the `kubenet` network
+provider. For more details, see README.md of the parent directory.
+
+## Installation
+To use this plugin, add the followings to the hdfs-site.xml:
+
+```
+  <property>
+    <name>net.topology.node.switch.mapping.impl</name>
+    <value>org.apache.hadoop.net.PodCIDRToNodeMapping</value>
+  </property>
+  <property>
+    <name>net.topology.impl</name>
+    <value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value>
+  </property>
+  <property>
+    <name>net.topology.nodegroup.aware</name>
+    <value>true</value>
+  </property>
+  <property>
+    <name>dfs.block.replicator.classname</name>
+    <value>org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup</value>
+  </property>
+```
diff --git a/topology/pod-cidr/pom.xml b/topology/pod-cidr/pom.xml
@@ -0,0 +1,83 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
+                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <artifactId>pod-cidr-namenode-topology-plugin</artifactId>
+  <groupId>hdfs-k8s</groupId>
+  <version>0.1-SNAPSHOT</version>
+  <description>HDFS topology plugin using pod CIDR</description>
+  <name>pod CIDR namenode topology plugin</name>
+  <packaging>jar</packaging>
+  <properties>
+    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+  </properties>
+  <build>
+    <plugins>
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-compiler-plugin</artifactId>
+        <version>3.5.1</version>
+        <configuration>
+          <source>1.7</source>
+          <target>1.7</target>
+        </configuration>
+      </plugin>
+    </plugins>
+  </build>
+  <dependencies>
+    <dependency>
+      <groupId>commons-cli</groupId>
+      <artifactId>commons-cli</artifactId>
+      <version>1.3.1</version>
+    </dependency>
+    <dependency>
+      <groupId>commons-logging</groupId>
+      <artifactId>commons-logging</artifactId>
+      <version>1.1</version>
+    </dependency>
+    <dependency>
+      <groupId>commons-net</groupId>
+      <artifactId>commons-net</artifactId>
+      <version>3.1</version>
+    </dependency>
+    <dependency>
+      <groupId>com.google.guava</groupId>
+      <artifactId>guava</artifactId>
+      <version>11.0.2</version>
+    </dependency>
+    <dependency>
+      <groupId>io.fabric8</groupId>
+      <artifactId>kubernetes-client</artifactId>
+      <version>2.2.1</version>
+    </dependency>
+    <dependency>
+      <groupId>log4j</groupId>
+      <artifactId>log4j</artifactId>
+      <version>1.2.17</version>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.commons</groupId>
+      <artifactId>commons-lang3</artifactId>
+      <version>3.5</version>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-common</artifactId>
+      <version>2.7.3</version>
+      <scope>provided</scope>
+    </dependency>
+  </dependencies>
+</project>
diff --git a/topology/pod-cidr/src/main/java/org/apache/hadoop/net/PodCIDRToNodeMapping.java b/topology/pod-cidr/src/main/java/org/apache/hadoop/net/PodCIDRToNodeMapping.java