A terraform setup for setting up hdp's big data analytics server instance in aws. 🔥🔥🔥
- terraform-hdp
⚠️ Before running the scripts, create a remote s3 bucket to store the terraform state.
- By default, the name of the remote state bucket is
terraform-hadoop. - If you want to create your own bucket with any-other name, ensure that you replace the default remote bucket name mentioned in
state.tf.
To configure the public ip address, replace the HostIp environment variable found in env > dev.tfvars | prod.tfvars,
> curl https://checkip.amazonaws.com💡If you don't want to utilize global credentials, add
AWS PROFILE=username>to each terraform and aws command given below.
Initialize terraform
> cd terraform/private_vpc
> terraform initCreate AWS keypair that will be used to login into AWS instance, same KeyPair would be used for initializing the other instances too
> cd terraform/scripts # generate keys inside scripts
> aws ec2 create-key-pair --key-name hwsndbx --query 'KeyMaterial' --output text > hwsndbx.pem> terraform workspace list # created at terraform initTo create two new workspaces,
> terraform workspace new dev
> terraform workspace new prodIf we need to provision the resources in the dev workspaces we need to first select the dev workspace.
> terraform workspace select dev
> terraform applyApply terraform script,
> terraform plan
> terraform apply -auto-approveOptional: Apply terraform script with environment variables,
> terraform plan -var-file=./env/dev.tfvars
> terraform apply -auto-approve -var-file=./env/dev.tfvarsSince we need a proper way to access our server and we cant tie the server down to our local dynamic ip which changes everytime, we create a new ec2 instance with openvpn to act as the bastion host.
For OpenVPN setup refer to this video.
Change the openvpn_ami_id based on your specified region,
> aws --region=us-east-1 ec2 describe-images --owner=aws-marketplace --filters 'Name=name,Values=OpenVPN Access Server 2.7.5*'> cd terraform/bastion_host_openvpn
> terraform init
> terraform plan
> terraform applyRead through this for more setup.
Connect to the OpenVPN instance using the assigned elastic ip,
> ssh -i ./scripts/hwsndbx.pem openvpnas@<elasticip>Use all settings as default. And change the password
> sudo passwd openvpnThen go to the OpenVPN WebUI https://<elastic-ip>:943. Use username as openvpn and password configured in the terminal above.
- In Configuration > VPN Settings > Routing > Enable
Should client Internet traffic be routed through the VPN? - With this configuration, the VPN client IP address is translated before being presented to resources inside the VPC. That means the client’s original IP address is remapped to one belonging to the VPC IP address space.
We can use the domain by adding the nameserver generated by terraform apply output to the domain DNS.
Read more on adding SSL Cert.
Right now you should be access you VPN's admin GUI by going to https:///admin. However, your browser will show a warning as the SSL cert is not valid. You can bypass this warning to access the admin, but we should setup a valid SSL cert.
- Use ZeroSSL to obtain your cetificate for free.
Walk through the wizard to create a new Let's Encrypt certificate. You will be required to verify your domain as part of this process.
Copy the Certificate, CA Bundle and Private Key to files.
Login to your VPN access server GUI using the user openvpn and created on the server. Navigate to Settings > Web Server. From there, upload the Certificate, CA Bundle and Private Key files. Click validate and save if there are no errors.
> ssh root@<host> "cat server.csr"|pbcopy
> ssh root@<host> "cat server.key"|pbcopy Next, we will provision HDP as a spot instance if you need it as a readily-available instance change directory to ``.
> cd terraform/hdp_instance
> terraform init
> terraform plan
> terraform applySo to connect using ssh we need a permission of 400 but by default it will be 644,
> ls -la # to see the permission of the pem file
> chmod 400 ./scripts/hwsndbx.pem # same key for all
> ssh -i ./scripts/hwsndbx.pem ec2-user@<output_instance_ip>Install HDP through docker,
> docker info
> cd /tmp/hdp-docker-sandbox/HDP_2.6.5
> sudo bash docker-deploy-hdp265.sh
> docker ps
> docker ps -aTo restart the containers,
> cd /tmp/hdp-docker-sandbox
> sudo bash restart_docker.sh- After it finishes, access Ambari through
http://elastic-public-ip:8080/. - The default Ambari credential is
raj_ops:raj_opsandmaria_dev:maria_dev. The default AmbariShell login credential isroot:hadoop.
> sudo docker images
> sudo service docker restart
> sudo service docker statusRead cloudera hdp sandbox and apache ambari shell commands for more information.
To peek into the docker sandbox,
> docker exec -it <docker-sandbox-image-id> /bin/bash
> ssh root@localhost -p 2222 # or you can use this with password hadoop
> ambari-agent status
> ambari-agent start # if stopped start
> ambari-server restartHortonWorks doesnt come with lot of resources out-of-the-box to work with python,
> sudo su -
> yum install python-pip -y
> pip install google-api-python-client==1.6.4
# > curl https://bootstrap.pypa.io/pip/2.7/get-pip.py | python
# > pip install --ignore-installed pyparsing
> pip install mrjob==0.5.11 #MRJob
> yum install nano -yExample data files and scripts to play with,
> sudo su - maria_dev
> wget http://media.sundog-soft.com/hadoop/ml-100k/u.data
> wget http://media.sundog-soft.com/hadoop/RatingsBreakdown.py
> hadoop fs -copyFromLocal u.data /user/maria_dev/ml-100k/u.data
> python RatingsBreakdown.p u.data
> python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data #mrjob manually copies the file to hdfs temp location and executes it
> hostname -I | awk '{print $1}' # get the ip
> python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs://172.18.0.2:8020/user/maria_dev/ml-100k/u.data
> python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar hdfs:///user/maria_dev/ml-100k/u.dataLook into this script
Change the ambari password once you create the instance,
> docker exec -it sandbox-hdp /bin/bash
> ambari-admin-password-reset
> ambari-agent restart💡
C:\Windows\System32\drivers\etc\hostson Windows or/etc/hostson a MacOSX
In case you want a CNAME, you can add this line to your hosts file. Add hostip to the mac to use as a domain name locally, to save and exit out of nano editor ctrl + o > enter > ctrl + x
> sudo nano /etc/hosts # add the ip and map to a host
> sudo killall -HUP mDNSResponder # flush DNS cache127.0.0.1 sandbox-hdp.hortonworks.com
⚠️ Keep in mind, though there aren't any changes for a stopped instance, you may still incur charges forEBSstorage andElasticIPassociated to the instances.
Once created and you want to stop instances just execute,
> cd /tmp/hdp-docker-sandbox
> bash pause_docker.sh # pause the instance
> cd hdp_instance
> terraform output # get the id from output for hdp instance
> aws ec2 stop-instances --instance-ids <instance_id> --profile edutf
> cd bastion_host_openvpn
> terraform output # get the id from output for openvpn instance
> aws ec2 stop-instances --instance-ids <instance_id> --profile edutfOnce created and you want later to reboot after a stop,
> cd bastion_host_openvpn
> terraform output # get the id from output for openvpn instance
> aws ec2 start-instances --instance-ids <instance_id> --profile edutf
> cd hdp_instance
> terraform output # get the id from output for hdp instance
> aws ec2 start-instances --instance-ids <instance_id> --profile edutf
> cd terraform/hdp_instance
> ssh -i ./scripts/hwsndbx.pem ec2-user@<instance_ip>
> cd /tmp/hdp-docker-sandbox
> bash resume_docker.sh # resume the instance> ps -ef
> kill -HUP <PID>
> bash start_jupyter.sh sparkTo destroy the terraform instance,
> terraform destroy -auto-approve- Installation guide for a single cluster HDP installation.
- Installation guide for multiple cluster nodes.
- To increase the storage instance type.
- Maven and Java setup
- DDP + Ambari 2.7.5 CentOS7.
- Starting and stopping ambari services using CURL command
- Look into terraform local-exec for stopping and starting server instances
- Ambari REST Api to restart all services
- Ambari REST Api commands
- Solve PigTez Failure on Ambari 2.6.5
MIT © Murshid Azher.