forked from aroegies/bigdata-2019w
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathsoftware.html
More file actions
207 lines (162 loc) · 8.4 KB
/
software.html
File metadata and controls
207 lines (162 loc) · 8.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 431/631 451/651 Data-Intensive Distributed Computing (Winter 2019) at the University of Waterloo">
<meta name="author" content="Adam Roegiest">
<title>Data-Intensive Distributed Computing</title>
<!-- Bootstrap core CSS -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<!-- Just for debugging purposes. Don't actually copy these 2 lines! -->
<!--[if lt IE 9]><script src="../../assets/js/ie8-responsive-file-warning.js"></script><![endif]-->
<script src="js/ie-emulation-modes-warning.js"></script>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li><a href="assignments.html">Assignments</a></li>
<li class="active"><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"><img width="250" src="images/waterloo_logo.png" alt="University of Waterloo logo"/></div>
<h1>Software <br/><small>Data-Intensive Distributed Computing (Fall 2018)</small></h1>
</div>
<div>
<h3>Bespin</h3>
<p><a href="http://bespin.io">Bespin</a> is a software library that
contains reference implementations of "big data" algorithms in
MapReduce and Spark. It provides sample code for many of the
algorithms we'll be discussing in class and also provides starting
points for the assignments. You'll want to familiarize yourself
with the library.</p>
<h3>Single-Node Hadoop: Linux Student CS Environment</h3>
<p>A single-node Hadoop cluster (also called "local" mode) comes
pre-configured in the <code>linux.student.cs.uwaterloo.ca</code>
environment. We will ensure that everything works correctly in this
environment.</p>
<p><b>TL;DR.</b> Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):</p>
<pre>
export PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin:/u3/cs451/packages/spark/bin:/u3/cs451/packages/hadoop/bin:/u3/cs451/packages/maven/bin:/u3/cs451/packages/scala/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
</pre>
<p>Note that we <b>do not</b> advise you to add the above lines to
your shell config file (e.g., <code>.bash_profile</code>), but rather
to set up your environment <i>explicitly</i> every time you log
in. The reason for this is to reduce the possibility of conflicts when
you start using the Datasci cluster (see below).</p>
<p><b>Details.</b> For the course we need Java, Scala, Hadoop, Spark,
and Maven. Java is already available in the default user environment
(but we need to point to the right version). The rest of the packages
are installed in <code>/u3/cs451/packages/</code>. The
directories <code>scala</code>, <code>hadoop</code>, <code>spark</code>,
and <code>maven</code> are actually symlinks to specific
versions. This is so that we can transparently change the links to
point to different versions if necessary without affecting downstream
users. Currently, the versions are:</p>
<ul>
<li>Java: 1.8.0</li>
<li>Scala: 2.11.8</li>
<li>Hadoop: 3.0.3</li>
<li>Spark: 2.3.1</li>
<li>Maven: 3.3.9</li>
</ul>
</div>
<div>
<h3>Single-Node Hadoop: Personal Install</h3>
<p>As an alternative of using the single-node Hadoop cluster
on <code>linux.student.cs.uwaterloo.ca</code>, you may wish to install
all necessary software packages locally on your own machine. We
provide basic installation instructions here, but the course staff
cannot provide technical support due to the size of the class and the
idiosyncrasies of individual systems. We will be responsible for
making sure everything works properly in the Linux Student CS
Environment (above), but if you want to install everything on your own
machine for convenience, you're on your own.</p>
<p>Both Hadoop and Spark work fine on Mac OS X and Linux, but may be
difficult to get working on Windows. Note that to run Hadoop and Spark
on your local machine comfortably, you'll need at least 4 GB memory
and plenty of disk space (at least 10 GB).</p>
<p>You'll also need Java (JDK 1.8), Scala (use Scala 2.11), and
Maven (any reasonably recent version).</p>
<p>The versions of the packages installed
on <code>linux.student.cs.uwaterloo.ca</code> are as follows:</p>
<ul>
<li><a href="http://www-us.apache.org/dist/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz">Hadoop 3.0.3</a></li>
<li><a href="http://www-us.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz ">Spark 2.3.1</a></li>
</ul>
<p>Download the above packages, unpack the tarball, add their
respective <code>bin/</code> directories to your path (and your shell
config), and you should be go to go.</p>
<p>Alternatively, you can also install the various packages using a
package manager, e.g., <code>apt-get</code>, MacPorts, etc. However,
make sure you get the right version.</p>
</div>
<div>
<h3>Distributed Hadoop Cluster: Datasci</h3>
<p>In addition to running "toy" Hadoop on a single node (which
obviously defeats the point of a distributed framework), we're going
to using the school's modest Hadoop teaching cluster called Datasci.</p>
<p>Accounts are already set up for students enrolled in the
course. You should be able to log into the cluster as follows:</p>
<pre>
ssh -D 1080 datasci.cs.uwaterloo.ca
</pre>
<p>The <code>-D</code> option specifies dynamic port forwarding, which
you'll need for accessing the Hadoop UIs through a SOCKS proxy. The
simplest approach is via the Firefox browser: go to preferences and
access "Network Proxy" settings: your settings should look
something <a href="images/datasci-proxy.png">like this</a>. You
should then be able to access the Resource Manager (RM) webapp
at <a href="http://datasci.datasci-domain.cs.uwaterloo.ca:8088/cluster"><code>http://datasci.datasci-domain.cs.uwaterloo.ca:8088/cluster</code></a>.
It's important that you get the proxy working, because the RM webapp
is the primary point of access for examining and debugging jobs on the
cluster.</p>
<p>With Firefox, the proxy setup limits your ability to access other
sites; turn off the proxy once you're done with the cluster. One
helpful tip while working on assignments is to access the cluster
webapp in Firefox, and use another browser for accessing other
sites. There are many other equivalent ways to set up your proxy
(different OSes, different browsers, etc.) as well as alternative
workflows. Feel free to share tips, experiences, etc. on Piazza.</p>
</div>
<div style="padding-bottom: 100px"></div>
</div><!-- /.container -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>