Skip to content

Commit fa377c6

Browse files
authored
Design namenode HA (#32)
* Add namenode HA design doc * Switch images to png * Refer to png imaegs
1 parent a9f70b9 commit fa377c6

File tree

3 files changed

+167
-0
lines changed

3 files changed

+167
-0
lines changed

designs/journal-approach.png

91.4 KB
Loading

designs/namenode-HA.md

Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
# Namenode HA for HDFS on K8s
2+
3+
## Goals
4+
5+
1. Adopt one of existing namenode HA solutions and make it fit for HDFS on K8s:
6+
There are two HA solutions: an old NFS-based solution, and a new one based on
7+
the Quorum Journal Service. We are leaning toward the journal-based solution.
8+
We’ll discuss the details below.
9+
2. Keep HDFS on K8s easy to use: The current HDFS on K8s is known to be easy to
10+
set up, thanks to automations allowed by Kubernetes and Helm. We’d like to
11+
keep it that way even for the HA setup.
12+
13+
## Existing Namenode HA solutions
14+
### Terminology
15+
16+
- Primary namenode: A central daemon used in a non-HA setup that maintains the
17+
file system metadata.
18+
- Secondary namenode: The other namenode daemon instance used in a non-HA setup
19+
that runs along with the primary namenode. The secondary namenode creates new
20+
snapshots of namenode metadata by merging incremental updates.
21+
- Active namenode: A namenode instance used in a HA setup that is in charge of
22+
maintaining the file system metadata.
23+
- Standby namenode: The other namenode instance used in a HA setup that runs
24+
along with the active namenode. The standby namenode listens to metadata
25+
updates made by the active namenode and gets ready to take over in case the
26+
active namenode crashes.
27+
28+
### Namenode metadata
29+
30+
The namenode daemon maintains the file system metadata such as which directories
31+
have which files, file ownership, which datanode daemons have blocks of those
32+
files, etc.
33+
34+
NN manipulates the metadata mostly in memory. But it has to persist them to
35+
disks for **crash safety**. I.e. Avoid losing metadata when the NN crashes or
36+
restarts.
37+
38+
There are two disk files that NN writes:
39+
1. Snapshot of the metadata dumped at a time point in the past. This is called
40+
**fsimage**.
41+
2. Incremental updates since the snapshot time point. In non-HA setup, the
42+
updates are appended to a local file called **editlog**. (In journal-based
43+
HA, editlog is stored on shared network service)
44+
45+
The editlog is later merged into a new fsimage snapshot, starting a new cycle.
46+
47+
![namenode metadata](namenode-metadata.png)
48+
49+
Another important piece of metadata, the mapping of which datanodes have which
50+
file blocks, is *not* written to disk. After restart, NN rebuilds this mapping
51+
from datanode heartbeat messages. This takes a while and it is one of the
52+
reasons why restarting NN is slow.
53+
54+
### HA solution choices
55+
56+
In the HA setup, there are two NN instances: an active NN and a standby NN. The
57+
active NN handles clients’ requests and modifies the filesystem metadata. The
58+
modification goes to the editlog file. This editlog should be shared with the
59+
standby NN so that it can also have up-to-date metadata and quickly become the
60+
active NN when the prior active NN crashes.
61+
62+
Hadoop has two HA solutions, mainly based on how exactly the editlog is shared
63+
with the standby NN:
64+
65+
1. An old NFS-based solution described at
66+
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html.
67+
The editlog is placed in a NFS volume that both the active and standby NN
68+
have access to.
69+
1. a new one based on the Quorum Journal Service at
70+
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html.
71+
The Journal Service is zookeeper-like service that has an odd number of
72+
backing servers. The active NN writes to the journal service, while the
73+
standby listens to the service. For each metadata update, the majority of the
74+
journal servers should agree on the change.
75+
76+
The NFS approach has a flaw around the split-brain scenario. When both NNs think
77+
they are active, they will write to the editlog simultaneously, corrupting the
78+
file. So the NFS approach relies on forcibly shutting down one of the NNs.
79+
(Called fencing) And this requires special HWs at the BIOS or power switch
80+
level. Most people don’t like this extra requirement.
81+
82+
The Quorum Journal Service solves the split-brain issue at the service level.
83+
The service only honors one writer at a given time point. So no need to have
84+
special hardware for fencing. (Some soft fencing is still recommended to prevent
85+
the rogue NN from continuing to serve lingering read clients) To use the journal
86+
service, each NN host needs to run a client for it called Quorum Journal
87+
Manager. The journal manager with an active NN registers with the journal
88+
servers using a unique epoch number. Write requests come with the epoch number
89+
and they will be rejected if their epoch number is smaller than the servers
90+
expect. This way, they can reject requests from a rogue, previously active, NN
91+
with old epoch number. More details can be found at
92+
http://johnjianfang.blogspot.com/2015/02/quorum-journal-manager-part-i-protocol.html.
93+
94+
For HDFS on K8s, we are leaning toward the journal manager approach.
95+
96+
![journal-approach](journal-approach.png)
97+
98+
99+
### Other HA aspects
100+
101+
The standby NN does one more thing. It also merges the editlog into a new
102+
fsimage snapshot. And sends the new snapshot to the active NN via HTTP, so that
103+
they can drop earlier updates in the editlog. (For non-HA setup, this can be
104+
done by another special NN instance, called the **secondary** NN. But in HA, the
105+
standby NN will do that for us)
106+
107+
We said earlier that the block-to-datanode mapping is not persisted. So
108+
datanodes actually send heartbeats with the block mapping to both NNs, so that
109+
the standby NN can become active right away.
110+
111+
Clients also are aware of both NNs. There is a client-side library that will
112+
figure out who to talk to.
113+
114+
Automatic failover to a new active NN requires a zookeeper service, which needs
115+
an odd number of instances. (This is in addition to the journal manager, which
116+
is similar to zookeeper but not same). For this, the NN hosts should run an
117+
extra zookeeper client called Zookeeper Failover Controller. The controller
118+
monitors the health of the local NN and communicate with the zookeeper service
119+
in the right way so that the failing active NN can release the zookeeper lock to
120+
the standby NN.
121+
122+
## Namenode HA design for HDFS on K8s
123+
124+
So we need three K8s services for the HA setup.
125+
126+
1. Namenode service with two NNs
127+
2. Journal service with an odd number of journal servers
128+
3. Zookeeper with an odd number of servers.
129+
130+
For each of these, we’ll use a stateful set of a corresponding size. For
131+
Zookeeper, we already have a helm chart in
132+
https://github.com/kubernetes/contrib/tree/master/statefulsets/zookeeper. So we
133+
can reuse it. Each Zookeeper server writes its data to a persistent volume.
134+
135+
For journal servers, we need to write a new helm chart. This can be modeled
136+
after the zookeeper helm chart. This should be straightforward.
137+
138+
For NN, we have a helm chart for non-HA setup at
139+
https://github.com/apache-spark-on-k8s/kubernetes-HDFS/tree/master/charts/hdfs-namenode-k8s,
140+
which uses a statefulset of size 1. We can extend this to support HA setup as an
141+
option. We’ll have to do the following work:
142+
143+
1. The statefulset size is currently one. Extend it to two.
144+
2. Add all config options described at
145+
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html.
146+
This includes the config key for using the Quorum Journal servers as editlog
147+
destination.
148+
3. Add a container to each NN pod for running the Zookeeper Failover Controller.
149+
4. Optionally, use persistent volumes for storing fsimage files.
150+
151+
What is notably missing is support for fencing that we discussed above. We will
152+
leave this as an open problem that we may address in a later version.
153+
154+
Item (4) is significant because the NN pod in the non-HA setup stores the
155+
fsimage file on a HostPath volume. We also pins the NN to a particular K8s node
156+
using a K8s node label to make sure a restarted NN can find the right fsimage
157+
file. Hopefully, we can remove the HostPath and node pinning dependencies with
158+
(4). But we want to keep the old behavior as an option, in case people want to
159+
try HDFS on K8s on a very simple setup without persistent volumes and HA.
160+
161+
People have to upgrade HDFS software version occasionally, like HDFS 2.7 to 2.8.
162+
Sometimes the metadata format changes and NNs need to convert the metadata to a
163+
new format. Unfortunately, the format upgrade is done in a non-symmetric way.
164+
The active NN should do the format conversion and write the new metadata to the
165+
journal service. Then the standby NN should sync with it upon start. The NN helm
166+
chart for HA setup should support this in an automated fashion. We think we can
167+
do that using an init container. We’ll address this in a later PR.

designs/namenode-metadata.png

24 KB
Loading

0 commit comments

Comments
 (0)