Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
dcf03a3
Round 1 notes part 1
gowarrior Oct 30, 2018
ade52de
Round 1
gowarrior Oct 30, 2018
8681573
Update README.md
gowarrior Dec 10, 2018
fefb845
Update README.md
gowarrior Dec 10, 2018
97aaf80
Update README.md
gowarrior Dec 10, 2018
0ea9600
Update README.md
gowarrior Dec 10, 2018
43dde1c
Update README.md
gowarrior Dec 10, 2018
f5e7cf8
Update README.md
gowarrior Dec 10, 2018
c9cce91
Update README.md
gowarrior Dec 10, 2018
659c58e
Update README.md
gowarrior Dec 10, 2018
f88931c
Update README.md
gowarrior Dec 10, 2018
171ff33
Update README.md
gowarrior Dec 10, 2018
c337f96
Update README.md
gowarrior Dec 10, 2018
69c5cc7
Update README.md
gowarrior Dec 10, 2018
bebb371
Update README.md
gowarrior Dec 10, 2018
5d355a6
Update README.md
gowarrior Dec 10, 2018
8a51041
Update README.md
gowarrior Dec 10, 2018
8ccc87e
Update README.md
gowarrior Dec 10, 2018
9f963f9
Update README.md
gowarrior Dec 10, 2018
e8ce188
Update README.md
gowarrior Dec 10, 2018
3b4c8a5
Update README.md
gowarrior Dec 10, 2018
1295cc4
Update README.md
gowarrior Dec 10, 2018
9ac3012
Update README.md
gowarrior Dec 10, 2018
06ae803
Update README.md
gowarrior Dec 10, 2018
6d4217d
Update README.md
gowarrior Dec 10, 2018
0896058
Update README.md
gowarrior Dec 10, 2018
1ff89cd
Update README.md
gowarrior Dec 10, 2018
e893f70
Update README.md
gowarrior Dec 10, 2018
08ce4e3
Create Technical Report.md
gowarrior Dec 10, 2018
da54915
Update Technical Report.md
gowarrior Dec 10, 2018
857367f
Update Technical Report.md
gowarrior Dec 10, 2018
6679171
Update Technical Report.md
gowarrior Dec 10, 2018
b2a5cba
Update Technical Report.md
gowarrior Dec 10, 2018
50a8a33
Update Technical Report.md
gowarrior Dec 10, 2018
83239af
Update Technical Report.md
gowarrior Dec 10, 2018
e5e010b
Create readme.md
gowarrior Dec 10, 2018
58b8a40
Delete readme.md
gowarrior Dec 10, 2018
1037e86
Create read.md
gowarrior Dec 10, 2018
b80f0ec
Add files via upload
gowarrior Dec 10, 2018
cc2b6ac
Delete read.md
gowarrior Dec 10, 2018
2dee9e7
Update Technical Report.md
gowarrior Dec 10, 2018
7723f56
Update Technical Report.md
gowarrior Dec 10, 2018
1d87297
Update Technical Report.md
gowarrior Dec 10, 2018
1eaabe2
Add files via upload
gowarrior Dec 10, 2018
f447831
Update Technical Report.md
gowarrior Dec 10, 2018
128f347
Update Technical Report.md
gowarrior Dec 10, 2018
520a83d
Update Technical Report.md
gowarrior Dec 10, 2018
c864292
Update Technical Report.md
gowarrior Dec 10, 2018
b26c8d8
Update Technical Report.md
gowarrior Dec 10, 2018
a1b9b8d
Update Technical Report.md
gowarrior Dec 10, 2018
d8e2410
Update Technical Report.md
gowarrior Dec 10, 2018
8554723
Update Technical Report.md
gowarrior Dec 10, 2018
94d0f72
Update Technical Report.md
gowarrior Dec 11, 2018
f191c5e
Update Technical Report.md
gowarrior Dec 11, 2018
677c6eb
Add files via upload
gowarrior Dec 11, 2018
04e32c7
Update Technical Report.md
gowarrior Dec 11, 2018
3068489
Add files via upload
gowarrior Dec 11, 2018
02e938e
Update Technical Report.md
gowarrior Dec 11, 2018
76c3980
Update Technical Report.md
gowarrior Dec 11, 2018
e90c9be
Update README.md
gowarrior Dec 11, 2018
4a3adae
Update README.md
gowarrior Dec 11, 2018
8623738
Update README.md
gowarrior Dec 11, 2018
f954aa3
Update README.md
gowarrior Dec 11, 2018
6c032e2
Update README.md
gowarrior Dec 11, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,051 changes: 1,045 additions & 6 deletions README.md

Large diffs are not rendered by default.

40 changes: 40 additions & 0 deletions Technical Report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
<div align=center>

## Introduction to Hadoop Mapreduce, Spark and Comparison Between Them
</div>

### Introduction to Hadoop MapReduce
MapReduce started to get popular as a programming model in 2004 used by Google. Google used the MapReduce to collect and analyze website data for better search performance for users. Google ran it on its private file system which is called GFS, of course. At the same time, Apache also used the MapReduce for its own web search engine.


Hadoop MapReduce is a very important part of distributed system. It is a software framework for processing large amounts of data which could be provided by the Hadoop Distributed File System. The data could be either stuctured or unstructured and can be in terabytes or petabytes. The MapReduce algorithms consists of two important tasks, one is Map and another is Reduce..
<div align=center>
<img src = "https://github.com/gowarrior/dist-sys-practice/blob/master/technical-report/image.png" width="700" height="300" >
</div>

There are many benefits of using the MapReduce.It provides multiple programming language support including Java, C++, Python, thus developers can choose according to their needs. Apart from that, the scalability is definitely one of the most important part since it is designed for petabytes of data even stored in one cluster. Moreover, it is open source and the community can work together to make it more efficient and other improvements can be done. In addition, MapReduce programs are parallel in nature which could make it process data much more fast. Falure handling of course is quite important in the distributed system. MapReduce takes care of failures. It stores data with copies and JobTracker keeps track all of it. Plus, since the MapReduce program sends the compute to where the data resides aka Hadoop File System, it requires minimal data motion which is more reliable and less overhead.

<div align=center>
<img src = "https://github.com/gowarrior/dist-sys-practice/blob/master/technical-report/1.jpg" >
</div>

### Introduction to Apache Spark
Spark was originally developed at UC Berkeley AMPLab in 2009, later than MapReduce. In 2013, Spark was donated to the Apache Software Foundation thus it is called Apache Spark now. It is an open source, unified analytics engine and is mostly widely used by corporations around the world.

It has a variety of good features. It is suitable for dealing with a wide range of circumstances. It includes libraries for SQL called Spark SQL, machine learning called MLib, graph computation called GraphX and stream processing called Spark Streaming. Multiple programming languages are also supported by Spark including Java, Python, Scala, R and so on. In addition, one of the unique aspects of Apache Spark is its unique "in-memory" technology that allows it to be an extremely good data processing system. In this technology, Spark loads all of the data to the internal memory of the system and then unloads it on the disk later. This way, a user can save a part of the processed data on the internal memory and leave the remaining on the disk.
<div align=center>
<img src = "https://github.com/gowarrior/dist-sys-practice/blob/master/technical-report/3.png" >
</div>

### Hadoop MapReduce vs Apache Spark
<div align=center>
<img src = "https://github.com/gowarrior/dist-sys-practice/blob/master/technical-report/4.png" >
</div>

* Hadoop MapReduce is better for linear processing of a large amount of datasets and it is economical if you do not need the outcome immediately; Apache Spark is known for fast huge data processing, iterative processing, near real-time processing, graph processing, machine learning, and joining datasets. It looks like Apache Spark is a more modern solution and have more cutting-edge function that corporation needs.

* When it comes to Fault Tolerance, we find that they both provide good solution for fault handling but using different approches. It shows that MapReduce has slight better fault tolerance.

* Compatibility: Spark’s compatibility to data types and data sources is the same as Hadoop MapReduce.Apache Spark can run as standalone or on top of Hadoop YARN or Mesos on-premise or on the cloud. It supports data sources that implement Hadoop InputFormat, so it can integrate with all the data sources and file formats that are supported by Hadoop. According to the Spark website, it also works with BI tools via JDBC and ODBC. Hive and Pig integration are on the way.

* Security: Spark security is still kind of not very reliable while Hadoop MapReduce has more security features and projects.
Binary file added technical-report/1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added technical-report/3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added technical-report/4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added technical-report/image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.