projects using hadoops and sparks map reduce jobs
For simplicity the data was generated, using this code.
-
start containers in detachable mode (background)
- docker-compose up -d
-
get inside docker container
- docker exec -it namenode /bin/bash
-
create dir in hdfs
- hdfs dfs -mkdir -p /foldername
-
lookup ip of namenode with ifconfig
-
lookup port of namenode with docker container -ls
-
copy jar to namenode
- docker cp /pathToJar namenode:/tmp
Go to hadoop folder.
-
Format the filesystem:
- bin/hdfs namenode -format
-
Start NameNode daemon and DataNode daemon:
- sbin/start-dfs.sh
-
create dir:
- hdfs dfs -mkdir /foldername
-
upload file
- hdfs dfs -put fullPath/data.txt /foldername/
-
delete file
- bin/hdfs dfs -rm -r /foldername/data.txt
run command:
path/hadoop jar path/project-0.jar WordCount /cs585/data.txt /cs585/output2.txt
e.g. bin/hadoop jar /home/twobeers/IdeaProjects/wordCount/out/artifacts/wordCount_jar/wordCount.jar /cs585/data.txt /cs585/output2.txt
query1 bin/hadoop jar /home/twobeers/Desktop/bigData/big-data-projects/out/artifacts/big_data_projects_jar/big-data-projects.jar /project1/customers.csv /project1/output-query1.txt