Skip to content

Commit 66a2138

Browse files
authored
Add kubenet namenode topology plugin (#11)
* Add a namenode topology plugin for kubenet fixing data locality * Clean up README * Address review comments * Address review comments * Remove extra empty lines
1 parent e772378 commit 66a2138

File tree

5 files changed

+601
-0
lines changed

5 files changed

+601
-0
lines changed

topology/README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
HDFS namenode topology plugins for various Kubernetes network providers.
2+
3+
HDFS namenode handles RPC requests from clients. Namenode often gets the IP
4+
addresses of clients from the remote endpoints of RPC connections.
5+
In Kubernetes, HDFS clients may run inside pods. The client IP addresses can
6+
be virtual pod IP addresses. This can confuse the namenode when it runs
7+
the data locality optimization code, which requires the comparison of client
8+
IP addresses against the IP addresses associated with datanodes. The latter
9+
are physical IP addresses of cluster nodes that datanodes are running on.
10+
The client pod virtual IP addresses would not match any datanode IP addresses.
11+
12+
We can configure namenode with the topology plugins in this directory to
13+
correct the namenode data locality code. So far, we learned that only
14+
Google Container Engine (GKE) suffers from the data locality issue caused
15+
by the virtual pod IP addresses exposed to namenode. (See below)
16+
GKE uses the native `kubenet` network provider.
17+
18+
- TODO: Currently, there is no easy way to launch the namenode helm chart
19+
with a topology plugins configured. Build a new Docker image with
20+
topology plugins and support the configuration. See plugin README
21+
for installation/configuration instructions.
22+
23+
Many K8s network providers do not need any topology plugins. Most K8s network
24+
providers conduct IP masquerading or Network Address Translation (NAT), when pod
25+
packets head outside the pod IP subnet. They rewrite headers of pod packets by
26+
putting the physical IP addresses of the cluster nodes that pods are running on.
27+
The namenode and datanodes use `hostNetwork` and their IP addresses are outside
28+
the pod IP subnet. As the result, namenode will see the physical cluster node
29+
IP address from client RPC connections originating from pods. The data locality
30+
will work fine with them.
31+
32+
Here is the list of network providers that conduct NAT:
33+
34+
- By design, overlay networks such as weave and flannel conduct NAT for any
35+
pod packet heading outside a local pod network. This means packets coming to
36+
a node IP also does NAT. (In overlay, pod packets heading to another pod in
37+
a different node puts back the pod IPs once they got inside the destination
38+
node)
39+
- Calico is a popular non-overlay network provider. It turns out Calico can be
40+
also configured to do NAT between pod subnet and node subnet thanks to the
41+
`nat-outgoing` option. The option can be easily turned on and is enabled
42+
by default.
43+
- In EC2, the standard tool kops can provision k8s clusters using the same
44+
native kubenet that GKE uses. Unlike GKE, it turns out kubenet in EC2 does
45+
NAT between pod subnet to host network. This is because kops sets option
46+
--non-masquerade-cidr=100.64.0.0/10 to cover only pod IP subnet. Traffic to
47+
IPs ouside this range will do NAT. In EC2, cluster hosts like 172.20.47.241
48+
sits outside this CIDR. This means pod packets heading to node IPs will do
49+
masquerading. (Note GKE kubenet uses the default value of
50+
--non-masquerade-cidr, 10.0.0.0/8, which covers both pod IP and node IP
51+
subnets. GKE does not expose any way to override this value)
52+
53+
Over time, we will also check the behaviors of other network providers and
54+
document them here.
55+
56+
Here's how one can check if data locality in the namenode works.
57+
1. Launch a HDFS client pod and go inside the pod.
58+
```
59+
$ kubectl run -i --tty hadoop --image=uhopper/hadoop:2.7.2 \
60+
--generator="run-pod/v1" --command -- /bin/bash
61+
```
62+
2. Inside the pod, create a simple text file on HDFS.
63+
```
64+
$ hadoop fs \
65+
-fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local \
66+
-cp file:/etc/hosts /hosts
67+
```
68+
3. Set the number of replicas for the file to the number of your cluster
69+
nodes. This ensures that there will be a copy of the file in the cluster node
70+
that your client pod is running on. Wait some time until this happens.
71+
```
72+
$ hadoop fs -setrep NUM-REPLICAS /hosts
73+
```
74+
4. Run the following `hdfs cat` command. From the debug messages, see
75+
which datanode is being used. Make sure it is your local datanode. (You can
76+
get this from `$ kubectl get pods hadoop -o json | grep hostIP`. Do this
77+
outside the pod)
78+
```
79+
$ hadoop --loglevel DEBUG fs \
80+
-fs hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local \
81+
-cat /hosts
82+
...
83+
17/04/24 20:51:28 DEBUG hdfs.DFSClient: Connecting to datanode 10.128.0.4:50010
84+
...
85+
```
86+
87+
If no, you should check if your local datanode is even in the list from the
88+
debug messsages above. If it is not, then this is because step (3) did not
89+
finish yet. Wait more. (You can use a smaller cluster for this test if that
90+
is possible)
91+
```
92+
17/04/24 20:51:28 DEBUG hdfs.DFSClient: newInfo = LocatedBlocks{
93+
fileLength=199
94+
underConstruction=false
95+
blocks=[LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001;
96+
getBlockSize()=199; corrupt=false; offset=0;
97+
locs=[DatanodeInfoWithStorage[10.128.0.4:50010,DS-d2de9d29-6962-4435-a4b4-aadf4ea67e46,DISK],
98+
DatanodeInfoWithStorage[10.128.0.3:50010,DS-0728ffcf-f400-4919-86bf-af0f9af36685,DISK],
99+
DatanodeInfoWithStorage[10.128.0.2:50010,DS-3a881114-af08-47de-89cf-37dec051c5c2,DISK]]}]
100+
lastLocatedBlock=LocatedBlock{BP-347555225-10.128.0.2-1493066928989:blk_1073741825_1001;
101+
```
102+
5. Repeat the `hdfs cat` command multiple times. Check if the same datanode
103+
is being consistently used.

topology/pod-cidr/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
.idea
2+
*.iml
3+
target

topology/pod-cidr/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
A namenode topology plugin mapping pods to cluster nodes for a K8s configured
2+
with pod CIDR. Currently, this is known to work only with the `kubenet` network
3+
provider. For more details, see README.md of the parent directory.
4+
5+
## Installation
6+
To use this plugin, add the followings to the hdfs-site.xml:
7+
8+
```
9+
<property>
10+
<name>net.topology.node.switch.mapping.impl</name>
11+
<value>org.apache.hadoop.net.PodCIDRToNodeMapping</value>
12+
</property>
13+
<property>
14+
<name>net.topology.impl</name>
15+
<value>org.apache.hadoop.net.NetworkTopologyWithNodeGroup</value>
16+
</property>
17+
<property>
18+
<name>net.topology.nodegroup.aware</name>
19+
<value>true</value>
20+
</property>
21+
<property>
22+
<name>dfs.block.replicator.classname</name>
23+
<value>org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyWithNodeGroup</value>
24+
</property>
25+
```

topology/pod-cidr/pom.xml

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!--
3+
Licensed under the Apache License, Version 2.0 (the "License");
4+
you may not use this file except in compliance with the License.
5+
You may obtain a copy of the License at
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
Unless required by applicable law or agreed to in writing, software
8+
distributed under the License is distributed on an "AS IS" BASIS,
9+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10+
See the License for the specific language governing permissions and
11+
limitations under the License. See accompanying LICENSE file.
12+
-->
13+
<project xmlns="http://maven.apache.org/POM/4.0.0"
14+
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
15+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
16+
http://maven.apache.org/xsd/maven-4.0.0.xsd">
17+
<modelVersion>4.0.0</modelVersion>
18+
<artifactId>pod-cidr-namenode-topology-plugin</artifactId>
19+
<groupId>hdfs-k8s</groupId>
20+
<version>0.1-SNAPSHOT</version>
21+
<description>HDFS topology plugin using pod CIDR</description>
22+
<name>pod CIDR namenode topology plugin</name>
23+
<packaging>jar</packaging>
24+
<properties>
25+
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
26+
</properties>
27+
<build>
28+
<plugins>
29+
<plugin>
30+
<groupId>org.apache.maven.plugins</groupId>
31+
<artifactId>maven-compiler-plugin</artifactId>
32+
<version>3.5.1</version>
33+
<configuration>
34+
<source>1.7</source>
35+
<target>1.7</target>
36+
</configuration>
37+
</plugin>
38+
</plugins>
39+
</build>
40+
<dependencies>
41+
<dependency>
42+
<groupId>commons-cli</groupId>
43+
<artifactId>commons-cli</artifactId>
44+
<version>1.3.1</version>
45+
</dependency>
46+
<dependency>
47+
<groupId>commons-logging</groupId>
48+
<artifactId>commons-logging</artifactId>
49+
<version>1.1</version>
50+
</dependency>
51+
<dependency>
52+
<groupId>commons-net</groupId>
53+
<artifactId>commons-net</artifactId>
54+
<version>3.1</version>
55+
</dependency>
56+
<dependency>
57+
<groupId>com.google.guava</groupId>
58+
<artifactId>guava</artifactId>
59+
<version>11.0.2</version>
60+
</dependency>
61+
<dependency>
62+
<groupId>io.fabric8</groupId>
63+
<artifactId>kubernetes-client</artifactId>
64+
<version>2.2.1</version>
65+
</dependency>
66+
<dependency>
67+
<groupId>log4j</groupId>
68+
<artifactId>log4j</artifactId>
69+
<version>1.2.17</version>
70+
</dependency>
71+
<dependency>
72+
<groupId>org.apache.commons</groupId>
73+
<artifactId>commons-lang3</artifactId>
74+
<version>3.5</version>
75+
</dependency>
76+
<dependency>
77+
<groupId>org.apache.hadoop</groupId>
78+
<artifactId>hadoop-common</artifactId>
79+
<version>2.7.3</version>
80+
<scope>provided</scope>
81+
</dependency>
82+
</dependencies>
83+
</project>

0 commit comments

Comments
 (0)