Skip to content

Commit ccd2250

Browse files
Merge pull request #5 from riptano/DSP-5934
Dsp 5934
2 parents 85abb0c + 1c0a4eb commit ccd2250

File tree

62 files changed

+2957
-651
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+2957
-651
lines changed

README.md

Lines changed: 115 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -12,24 +12,29 @@ See [Troubleshooting Tips](doc/troubleshooting.md) as well as [Yarn tips](doc/ya
1212

1313
(Please add yourself to this list!)
1414

15-
- Ooyala
16-
- Netflix
17-
- Avenida.com
15+
- [Ooyala](http://www.ooyala.com)
16+
- [Netflix](http://www.netflix.com)
17+
- [Avenida.com](http://www.avenida.com)
1818
- GumGum
1919
- Fuse Elements
2020
- Frontline Solvers
2121
- Aruba Networks
22-
- [Zed Worldwide](www.zed.com)
22+
- [Zed Worldwide](http://www.zed.com)
23+
- [KNIME](https://www.knime.org/)
24+
- [Azavea](http://azavea.com)
25+
- [Maana](http://maana.io/)
2326

2427
## Features
2528

26-
- *"Spark as a Service"*: Simple REST interface for all aspects of job, context management
29+
- *"Spark as a Service"*: Simple REST interface (including HTTPS) for all aspects of job, context management
2730
- Support for Spark SQL, Hive, Streaming Contexts/jobs and custom job contexts! See [Contexts](doc/contexts.md).
31+
- LDAP Auth support via Apache Shiro integration
2832
- Supports sub-second low-latency jobs via long-running job contexts
2933
- Start and stop job contexts for RDD sharing and low-latency jobs; change resources on restart
30-
- Kill running jobs via stop context
34+
- Kill running jobs via stop context and delete job
3135
- Separate jar uploading step for faster job startup
3236
- Asynchronous and synchronous job API. Synchronous API is great for low latency jobs!
37+
- Preliminary support for Java (see `JavaSparkJob`)
3338
- Works with Standalone Spark as well as Mesos and yarn-client
3439
- Job and jar info is persisted via a pluggable DAO interface
3540
- Named RDDs to cache and retrieve RDDs by name, improving RDD sharing and reuse among jobs.
@@ -44,12 +49,18 @@ See [Troubleshooting Tips](doc/troubleshooting.md) as well as [Yarn tips](doc/ya
4449
| 0.4.1 | 1.1.0 |
4550
| 0.5.0 | 1.2.0 |
4651
| 0.5.1 | 1.3.0 |
52+
| 0.5.2 | 1.3.1 |
53+
| master | 1.4.1 |
4754

4855
For release notes, look in the `notes/` directory. They should also be up on [ls.implicit.ly](http://ls.implicit.ly/spark-jobserver/spark-jobserver).
4956

50-
## Quick start / development mode
57+
## Quick Start
5158

52-
NOTE: This quick start guide uses SBT to run the job server and the included test jar, but the normal development process is to create a separate project for Job Server jobs and to deploy the job server to a Spark cluster. Please see the deployment section below for more details.
59+
The easiest way to get started is to try the [Docker container](doc/docker.md) which prepackages a Spark distribution with the job server and lets you start and deploy it.
60+
61+
## Development mode
62+
63+
The example walk-through below shows you how to use the job server with an included example job, by running the job server in local development mode in SBT. This is not an example of usage in production.
5364

5465
You need to have [SBT](http://www.scala-sbt.org/release/docs/Getting-Started/Setup.html) installed.
5566

@@ -67,6 +78,8 @@ Note that reStart (SBT Revolver) forks the job server in a separate process. If
6778
type reStart again at the SBT shell prompt, it will compile your changes and restart the jobserver. It enables
6879
very fast turnaround cycles.
6980

81+
**NOTE2**: You cannot do `sbt reStart` from the OS shell. SBT will start job server and immediately kill it.
82+
7083
For example jobs see the job-server-tests/ project / folder.
7184

7285
When you use `reStart`, the log file goes to `job-server/job-server-local.log`. There is also an environment variable
@@ -80,7 +93,7 @@ Then go ahead and start the job server using the instructions above.
8093

8194
Let's upload the jar:
8295

83-
curl --data-binary @job-server-tests/target/job-server-tests-$VER.jar localhost:8090/jars/test
96+
curl --data-binary @job-server-tests/target/scala-2.10/job-server-tests-$VER.jar localhost:8090/jars/test
8497
OK⏎
8598

8699
#### Ad-hoc Mode - Single, Unrelated Jobs (Transient Context)
@@ -150,11 +163,11 @@ In your `build.sbt`, add this to use the job server jar:
150163

151164
resolvers += "Job Server Bintray" at "https://dl.bintray.com/spark-jobserver/maven"
152165

153-
libraryDependencies += "spark.jobserver" %% "job-server-api" % "0.5.1" % "provided"
166+
libraryDependencies += "spark.jobserver" %% "job-server-api" % "0.5.2" % "provided"
154167

155168
If a SQL or Hive job/context is desired, you also want to pull in `job-server-extras`:
156169

157-
libraryDependencies += "spark.jobserver" %% "job-server-extras" % "0.5.1" % "provided"
170+
libraryDependencies += "spark.jobserver" %% "job-server-extras" % "0.5.2" % "provided"
158171

159172
For most use cases it's better to have the dependencies be "provided" because you don't want SBT assembly to include the whole job server jar.
160173

@@ -168,9 +181,9 @@ object SampleJob extends SparkJob {
168181
```
169182

170183
- `runJob` contains the implementation of the Job. The SparkContext is managed by the JobServer and will be provided to the job through this method.
171-
This releaves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to
184+
This relieves the developer from the boiler-plate configuration management that comes with the creation of a Spark job and allows the Job Server to
172185
manage and re-use contexts.
173-
- `validate` allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning `spark.jobserver.SparkJobValid` will let the job execute, otherwise returning `spark.jobserver.SparkJobInvalid(reason)` prevents the job from running and provides means to convey the reason of failure. In this case, the call immediatly returns an `HTTP/1.1 400 Bad Request` status code.
186+
- `validate` allows for an initial validation of the context and any provided configuration. If the context and configuration are OK to run the job, returning `spark.jobserver.SparkJobValid` will let the job execute, otherwise returning `spark.jobserver.SparkJobInvalid(reason)` prevents the job from running and provides means to convey the reason of failure. In this case, the call immediately returns an `HTTP/1.1 400 Bad Request` status code.
174187
`validate` helps you preventing running jobs that will eventually fail due to missing or wrong configuration and save both time and resources.
175188

176189
Let's try running our sample job with an invalid configuration:
@@ -229,12 +242,72 @@ def validate(sc:SparkContext, config: Contig): SparkJobValidation = {
229242
}
230243
```
231244

245+
### HTTPS / SSL Configuration
246+
To activate ssl communication, set these flags in your application.conf file (Section 'spray.can.server'):
247+
```
248+
ssl-encryption = on
249+
# absolute path to keystore file
250+
keystore = "/some/path/sjs.jks"
251+
keystorePW = "changeit"
252+
```
253+
254+
You will need a keystore that contains the server certificate. The bare minimum is achieved with this command which creates a self-signed certificate:
255+
```
256+
keytool -genkey -keyalg RSA -alias jobserver -keystore ~/sjs.jks -storepass changeit -validity 360 -keysize 2048
257+
```
258+
You may place the keystore anywhere.
259+
Here is an example of a simple curl command that utilizes ssl:
260+
```
261+
curl -k https://localhost:8090/contexts
262+
```
263+
The ```-k``` flag tells curl to "Allow connections to SSL sites without certs". Export your server certificate and import it into the client's truststore to fully utilize ssl security.
264+
265+
### Authentication
266+
267+
Authentication uses the [Apache Shiro](http://shiro.apache.org/index.html) framework. Authentication is activated by setting this flag (Section 'shiro'):
268+
```
269+
authentication = on
270+
# absolute path to shiro config file, including file name
271+
config.path = "/some/path/shiro.ini"
272+
```
273+
Shiro-specific configuration options should be placed into a file named 'shiro.ini' in the directory as specified by the config option 'config.path'.
274+
Here is an example that configures LDAP with user group verification:
275+
```
276+
# use this for basic ldap authorization, without group checking
277+
# activeDirectoryRealm = org.apache.shiro.realm.ldap.JndiLdapRealm
278+
# use this for checking group membership of users based on the 'member' attribute of the groups:
279+
activeDirectoryRealm = spark.jobserver.auth.LdapGroupRealm
280+
# search base for ldap groups (only relevant for LdapGroupRealm):
281+
activeDirectoryRealm.contextFactory.environment[ldap.searchBase] = dc=xxx,dc=org
282+
# allowed groups (only relevant for LdapGroupRealm):
283+
activeDirectoryRealm.contextFactory.environment[ldap.allowedGroups] = "cn=group1,ou=groups", "cn=group2,ou=groups"
284+
activeDirectoryRealm.contextFactory.environment[java.naming.security.credentials] = password
285+
activeDirectoryRealm.contextFactory.url = ldap://localhost:389
286+
activeDirectoryRealm.userDnTemplate = cn={0},ou=people,dc=xxx,dc=org
287+
288+
cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager
289+
290+
securityManager.cacheManager = $cacheManager
291+
```
292+
293+
Make sure to edit the url, credentials, userDnTemplate, ldap.allowedGroups and ldap.searchBase settings in accordance with your local setup.
294+
295+
Here is an example of a simple curl command that authenticates a user and uses ssl (you may want to use -H to hide the
296+
credentials, this is just a simple example to get you started):
297+
```
298+
curl -k --basic --user 'user:pw' https://localhost:8090/contexts
299+
```
300+
232301
## Deployment
233302

234-
1. Copy `config/local.sh.template` to `<environment>.sh` and edit as appropriate.
235-
2. `bin/server_deploy.sh <environment>` -- this packages the job server along with config files and pushes
303+
### Manual steps
304+
305+
1. Copy `config/local.sh.template` to `<environment>.sh` and edit as appropriate. NOTE: be sure to set SPARK_VERSION if you need to compile against a different version, ie. 1.4.1 for job server 0.5.2
306+
2. Copy `config/shiro.ini.template` to `shiro.ini` and edit as appropriate. NOTE: only required when `authentication = on`
307+
3. Copy `config/local.conf.template` to `<environment>.conf` and edit as appropriate.
308+
4. `bin/server_deploy.sh <environment>` -- this packages the job server along with config files and pushes
236309
it to the remotes you have configured in `<environment>.sh`
237-
3. On the remote server, start it in the deployed directory with `server_start.sh` and stop it with `server_stop.sh`
310+
5. On the remote server, start it in the deployed directory with `server_start.sh` and stop it with `server_stop.sh`
238311

239312
The `server_start.sh` script uses `spark-submit` under the hood and may be passed any of the standard extra arguments from `spark-submit`.
240313

@@ -243,10 +316,14 @@ NOTE: by default the assembly jar from `job-server-extras`, which includes suppo
243316
Note: to test out the deploy to a local staging dir, or package the job server for Mesos,
244317
use `bin/server_package.sh <environment>`.
245318

319+
### Chef
320+
321+
There is also a [Chef cookbook](https://github.com/spark-jobserver/chef-spark-jobserver) which can be used to deploy Spark Jobserver.
322+
246323
## Architecture
247324

248325
The job server is intended to be run as one or more independent processes, separate from the Spark cluster
249-
(though it very well may be colocated with say the Master).
326+
(though it very well may be collocated with say the Master).
250327

251328
At first glance, it seems many of these functions (eg job management) could be integrated into the Spark standalone master. While this is true, we believe there are many significant reasons to keep it separate:
252329

@@ -266,8 +343,8 @@ Flow diagrams are checked in in the doc/ subdirectory. .diagram files are for w
266343

267344
### Contexts
268345

269-
GET /contexts - lists all current contexts
270-
POST /contexts/<name> - creates a new context
346+
GET /contexts - lists all current contexts
347+
POST /contexts/<name> - creates a new context
271348
DELETE /contexts/<name> - stops a context and all jobs running in it
272349

273350
### Jobs
@@ -284,6 +361,24 @@ the REST API.
284361

285362
For details on the Typesafe config format used for input (JSON also works), see the [Typesafe Config docs](https://github.com/typesafehub/config).
286363

364+
### Data
365+
366+
It is sometime necessary to programmatically upload files to the server. Use these paths to manage such files:
367+
368+
GET /data - Lists previously uploaded files that were not yet deleted
369+
POST /data/<prefix> - Uploads a new file, the full path of the file on the server is returned, the
370+
prefix is the prefix of the actual filename used on the server (a timestamp is
371+
added to ensure uniqueness)
372+
DELETE /data/<filename> - Deletes the specified file (only if under control of the JobServer)
373+
374+
These files are uploaded to the server and are stored in a local temporary
375+
directory on the server where the JobServer runs. The POST command returns the full
376+
pathname and filename of the uploaded file so that later jobs can work with this
377+
just the same as with any other server-local file. A job could therefore add this file to HDFS or distribute
378+
it to worker nodes via the SparkContext.addFile command.
379+
For files that are larger than a few hundred MB, it is recommended to manually upload these files to the server or
380+
to directly add them to your HDFS.
381+
287382
### Context configuration
288383

289384
A number of context-specific settings can be controlled when creating a context (POST /contexts) or running an
@@ -359,17 +454,13 @@ for instance: `sbt ++2.11.6 job-server/compile`
359454

360455
### Publishing packages
361456

362-
- Be sure you are in the master project
363-
- Run `+test` to ensure all tests pass for all scala versions
364-
- Now just run `+publish` and package will be published to bintray
457+
In the root project, do `release cross`.
365458

366459
To announce the release on [ls.implicit.ly](http://ls.implicit.ly/), use
367460
[Herald](https://github.com/n8han/herald#install) after adding release notes in
368461
the `notes/` dir. Also regenerate the catalog with `lsWriteVersion` SBT task
369462
and `lsync`, in project job-server.
370463

371-
TODO: Automate the above steps with `sbt-release`.
372-
373464
## Contact
374465

375466
For user/dev questions, we are using google group for discussions:
@@ -381,11 +472,8 @@ Please report bugs/problems to:
381472
## License
382473
Apache 2.0, see LICENSE.md
383474

384-
Copyright(c) 2014, Ooyala, Inc.
385-
386475
## TODO
387476

388-
- Have server_start.sh use spark-submit (#155, others) - would help resolve classpath/dependency issues.
389477
- More debugging for classpath issues
390478
- Update .g8 template, consider creating Activator template for sample job
391479
- Add Swagger support. See the spray-swagger project.

akka-app/src/ooyala.common.akka/web/WebService.scala

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@ package ooyala.common.akka.web
22

33
import akka.actor.ActorSystem
44
import spray.routing.{Route, SimpleRoutingApp}
5+
import javax.net.ssl.SSLContext
6+
import spray.io.ServerSSLEngineProvider
57

68
/**
79
* Contains methods for starting an embedded Spray web server.
@@ -17,7 +19,8 @@ object WebService extends SimpleRoutingApp {
1719
* @param port The port number to bind to
1820
*/
1921
def start(route: Route, system: ActorSystem,
20-
host: String = "0.0.0.0", port: Int = 8080) {
22+
host: String = "0.0.0.0", port: Int = 8080)(implicit sslContext: SSLContext,
23+
sslEngineProvider: ServerSSLEngineProvider) {
2124
implicit val actorSystem = system
2225
startServer(host, port)(route)
2326
}

bin/server_deploy.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,9 @@ fi
4141
FILES="job-server-extras/target/scala-$majorVersion/spark-job-server.jar
4242
bin/server_start.sh
4343
bin/server_stop.sh
44+
bin/kill-process-tree.sh
4445
$CONFIG_DIR/$ENV.conf
46+
config/shiro.ini
4547
config/log4j-server.properties"
4648

4749
ssh_key_to_use=""

bin/server_start.sh

Lines changed: 36 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,11 @@
22
# Script to start the job server
33
# Extra arguments will be spark-submit options, for example
44
# ./server_start.sh --jars cassandra-spark-connector.jar
5+
#
6+
# Environment vars (note settings.sh overrides):
7+
# JOBSERVER_MEMORY - defaults to 1G, the amount of memory (eg 512m, 2G) to give to job server
8+
# JOBSERVER_CONFIG - alternate configuration file to use
9+
# JOBSERVER_FG - launches job server in foreground; defaults to forking in background
510
set -e
611

712
get_abs_script_path() {
@@ -18,18 +23,26 @@ GC_OPTS="-XX:+UseConcMarkSweepGC
1823
-XX:MaxPermSize=512m
1924
-XX:+CMSClassUnloadingEnabled "
2025

21-
JAVA_OPTS="-XX:MaxDirectMemorySize=512M
22-
-XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true
23-
-Dcom.sun.management.jmxremote.port=9999
24-
-Dcom.sun.management.jmxremote.authenticate=false
26+
# To truly enable JMX in AWS and other containerized environments, also need to set
27+
# -Djava.rmi.server.hostname equal to the hostname in that environment. This is specific
28+
# depending on AWS vs GCE etc.
29+
JAVA_OPTS="-XX:MaxDirectMemorySize=512M \
30+
-XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true \
31+
-Dcom.sun.management.jmxremote.port=9999 \
32+
-Dcom.sun.management.jmxremote.rmi.port=9999 \
33+
-Dcom.sun.management.jmxremote.authenticate=false \
2534
-Dcom.sun.management.jmxremote.ssl=false"
2635

2736
MAIN="spark.jobserver.JobServer"
2837

29-
conffile="$(ls -1 "$appdir"/*.conf | head -1)"
30-
if [ -z "$conffile" ]; then
31-
echo "No configuration file found"
32-
exit 1
38+
if [ -f "$JOBSERVER_CONFIG" ]; then
39+
conffile="$JOBSERVER_CONFIG"
40+
else
41+
conffile=$(ls -1 $appdir/*.conf | head -1)
42+
if [ -z "$conffile" ]; then
43+
echo "No configuration file found"
44+
exit 1
45+
fi
3346
fi
3447

3548
if [ -f "$appdir/settings.sh" ]; then
@@ -71,12 +84,14 @@ if [ "$PORT" != "" ]; then
7184
CONFIG_OVERRIDES+="-Dspark.jobserver.port=$PORT "
7285
fi
7386

74-
if [ -z "$DRIVER_MEMORY" ]; then
75-
DRIVER_MEMORY=1G
87+
if [ -z "$JOBSERVER_MEMORY" ]; then
88+
JOBSERVER_MEMORY=1G
7689
fi
7790

7891
# This needs to be exported for standalone mode so drivers can connect to the Spark cluster
7992
export SPARK_HOME
93+
export YARN_CONF_DIR
94+
export HADOOP_CONF_DIR
8095

8196
# Identify location of dse command
8297
DSE="/usr/bin/dse"
@@ -95,8 +110,14 @@ if [ ! -e "$DSE" ]; then
95110
fi
96111

97112
# Submit the job server
98-
"$DSE" spark-submit --class "$MAIN" --driver-memory 5G \
99-
--conf "spark.executor.extraJavaOptions=$LOGGING_OPTS" \
100-
--driver-java-options "$GC_OPTS $JAVA_OPTS $LOGGING_OPTS $CONFIG_OVERRIDES" \
101-
"$@" "$appdir/spark-job-server.jar" "$conffile" 2>&1 &
102-
echo "$!" > "$pidFilePath"
113+
cmd='$DSE spark-submit --class $MAIN --driver-memory $JOBSERVER_MEMORY
114+
--conf "spark.executor.extraJavaOptions=$LOGGING_OPTS"
115+
--driver-java-options "$GC_OPTS $JAVA_OPTS $LOGGING_OPTS $CONFIG_OVERRIDES"
116+
$@ $appdir/spark-job-server.jar $conffile'
117+
118+
if [ -z "$JOBSERVER_FG" ]; then
119+
eval $cmd 2>&1 &
120+
echo $! > $pidFilePath
121+
else
122+
eval $cmd
123+
fi

build.sbt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
enablePlugins(DockerPlugin)

0 commit comments

Comments
 (0)