|
| 1 | + |
| 2 | +# Hadoop Chart |
| 3 | + |
| 4 | +** This is the readme from the original hadoop helm chart (https://github.com/helm/charts/tree/master/stable/hadoop) ** |
| 5 | +** This version removes yarn manager and provides advanced hadoop configuration through env variables ** |
| 6 | + |
| 7 | +[Hadoop](https://hadoop.apache.org/) is a framework for running large scale distributed applications. |
| 8 | + |
| 9 | +This chart is primarily intended to be used for YARN and MapReduce job execution where HDFS is just used as a means to transport small artifacts within the framework and not for a distributed filesystem. Data should be read from cloud based datastores such as Google Cloud Storage, S3 or Swift. |
| 10 | + |
| 11 | +## Chart Details |
| 12 | + |
| 13 | +## Installing the Chart |
| 14 | + |
| 15 | +To install the chart with the release name `hadoop` that utilizes 50% of the available node resources: |
| 16 | + |
| 17 | +``` |
| 18 | +$ helm install --name hadoop $(stable/hadoop/tools/calc_resources.sh 50) stable/hadoop |
| 19 | +``` |
| 20 | + |
| 21 | +> Note that you need at least 2GB of free memory per NodeManager pod, if your cluster isn't large enough, not all pods will be scheduled. |
| 22 | +
|
| 23 | +The optional [`calc_resources.sh`](./tools/calc_resources.sh) script is used as a convenience helper to set the `yarn.numNodes`, and `yarn.nodeManager.resources` appropriately to utilize all nodes in the Kubernetes cluster and a given percentage of their resources. For example, with a 3 node `n1-standard-4` GKE cluster and an argument of `50`, this would create 3 NodeManager pods claiming 2 cores and 7.5Gi of memory. |
| 24 | + |
| 25 | +### Persistence |
| 26 | + |
| 27 | +To install the chart with persistent volumes: |
| 28 | + |
| 29 | +``` |
| 30 | +$ helm install --name hadoop $(stable/hadoop/tools/calc_resources.sh 50) \ |
| 31 | + --set persistence.nameNode.enabled=true \ |
| 32 | + --set persistence.nameNode.storageClass=standard \ |
| 33 | + --set persistence.dataNode.enabled=true \ |
| 34 | + --set persistence.dataNode.storageClass=standard \ |
| 35 | + stable/hadoop |
| 36 | +``` |
| 37 | + |
| 38 | +> Change the value of `storageClass` to match your volume driver. `standard` works for Google Container Engine clusters. |
| 39 | +
|
| 40 | +## Configuration |
| 41 | + |
| 42 | +The following table lists the configurable parameters of the Hadoop chart and their default values. |
| 43 | + |
| 44 | +| Parameter | Description | Default | |
| 45 | +| ------------------------------------------------- | ------------------------------- | ---------------------------------------------------------------- | |
| 46 | +| `image.repository` | Hadoop image ([source](https://github.com/Comcast/kube-yarn/tree/master/image)) | `danisla/hadoop` | |
| 47 | +| `image.tag` | Hadoop image tag | `2.9.0` | |
| 48 | +| `imagee.pullPolicy` | Pull policy for the images | `IfNotPresent` | |
| 49 | +| `hadoopVersion` | Version of hadoop libraries being used | `2.9.0` | |
| 50 | +| `antiAffinity` | Pod antiaffinity, `hard` or `soft` | `hard` | |
| 51 | +| `hdfs.nameNode.pdbMinAvailable` | PDB for HDFS NameNode | `1` | |
| 52 | +| `hdfs.nameNode.resources` | resources for the HDFS NameNode | `requests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m` | |
| 53 | +| `hdfs.dataNode.replicas` | Number of HDFS DataNode replicas | `1` | |
| 54 | +| `hdfs.dataNode.pdbMinAvailable` | PDB for HDFS DataNode | `1` | |
| 55 | +| `hdfs.dataNode.resources` | resources for the HDFS DataNode | `requests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m` | |
| 56 | +| `yarn.resourceManager.pdbMinAvailable` | PDB for the YARN ResourceManager | `1` | |
| 57 | +| `yarn.resourceManager.resources` | resources for the YARN ResourceManager | `requests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m` | |
| 58 | +| `yarn.nodeManager.pdbMinAvailable` | PDB for the YARN NodeManager | `1` | |
| 59 | +| `yarn.nodeManager.replicas` | Number of YARN NodeManager replicas | `2` | |
| 60 | +| `yarn.nodeManager.parallelCreate` | Create all nodeManager statefulset pods in parallel (K8S 1.7+) | `false` | |
| 61 | +| `yarn.nodeManager.resources` | Resource limits and requests for YARN NodeManager pods | `requests:memory=2048Mi,cpu=1000m,limits:memory=2048Mi,cpu=1000m`| |
| 62 | +| `persistence.nameNode.enabled` | Enable/disable persistent volume | `false` | |
| 63 | +| `persistence.nameNode.storageClass` | Name of the StorageClass to use per your volume provider | `-` | |
| 64 | +| `persistence.nameNode.accessMode` | Access mode for the volume | `ReadWriteOnce` | |
| 65 | +| `persistence.nameNode.size` | Size of the volume | `50Gi` | |
| 66 | +| `persistence.dataNode.enabled` | Enable/disable persistent volume | `false` | |
| 67 | +| `persistence.dataNode.storageClass` | Name of the StorageClass to use per your volume provider | `-` | |
| 68 | +| `persistence.dataNode.accessMode` | Access mode for the volume | `ReadWriteOnce` | |
| 69 | +| `persistence.dataNode.size` | Size of the volume | `200Gi` | |
| 70 | + |
| 71 | +## Related charts |
| 72 | + |
| 73 | +The [Zeppelin Notebook](https://github.com/kubernetes/charts/tree/master/stable/zeppelin) chart can use the hadoop config for the hadoop cluster and use the YARN executor: |
| 74 | + |
| 75 | +``` |
| 76 | +helm install --set hadoop.useConfigMap=true stable/zeppelin |
| 77 | +``` |
| 78 | + |
| 79 | +# References |
| 80 | + |
| 81 | +- This is a variation of the hadoop helm chart of stable helm repo (https://github.com/helm/charts/tree/master/stable/hadoop). |
| 82 | + |
| 83 | +- Original K8S Hadoop adaptation this chart was derived from: https://github.com/Comcast/kube-yarn |
0 commit comments