Skip to content

Commit c00186f

Browse files
Thomas Gravestgravescs
authored andcommitted
[SPARK-25023] Clarify Spark security documentation
## What changes were proposed in this pull request? Clarify documentation about security. ## How was this patch tested? None, just documentation Closes apache#22852 from tgravescs/SPARK-25023. Authored-by: Thomas Graves <[email protected]> Signed-off-by: Thomas Graves <[email protected]>
1 parent e91b607 commit c00186f

File tree

7 files changed

+45
-2
lines changed

7 files changed

+45
-2
lines changed

docs/index.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,11 @@ It provides high-level APIs in Java, Scala, Python and R,
1010
and an optimized engine that supports general execution graphs.
1111
It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](ml-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
1212

13+
# Security
14+
15+
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
16+
Please see [Spark Security](security.html) before downloading and running Spark.
17+
1318
# Downloading
1419

1520
Get Spark from the [downloads page](https://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. Spark uses Hadoop's client libraries for HDFS and YARN. Downloads are pre-packaged for a handful of popular Hadoop versions.

docs/quick-start.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,11 @@ you can download a package for any version of Hadoop.
1717

1818
Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more detailed reference at the [RDD programming guide](rdd-programming-guide.html). However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the [SQL programming guide](sql-programming-guide.html) to get more information about Dataset.
1919

20+
# Security
21+
22+
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
23+
Please see [Spark Security](security.html) before running Spark.
24+
2025
# Interactive Analysis with the Spark Shell
2126

2227
## Basics

docs/running-on-kubernetes.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,11 @@ Kubernetes scheduler that has been added to Spark.
1212
In future versions, there may be behavioral changes around configuration,
1313
container images and entrypoints.**
1414

15+
# Security
16+
17+
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
18+
Please see [Spark Security](security.html) and the specific security sections in this doc before running Spark.
19+
1520
# Prerequisites
1621

1722
* A runnable distribution of Spark 2.3 or above.

docs/running-on-mesos.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@ The advantages of deploying Spark with Mesos include:
1313
[frameworks](https://mesos.apache.org/documentation/latest/frameworks/)
1414
- scalable partitioning between multiple instances of Spark
1515

16+
# Security
17+
18+
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
19+
Please see [Spark Security](security.html) and the specific security sections in this doc before running Spark.
20+
1621
# How it Works
1722

1823
In a standalone cluster deployment, the cluster manager in the below diagram is a Spark master

docs/running-on-yarn.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,11 @@ Support for running on [YARN (Hadoop
99
NextGen)](http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html)
1010
was added to Spark in version 0.6.0, and improved in subsequent releases.
1111

12+
# Security
13+
14+
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
15+
Please see [Spark Security](security.html) and the specific security sections in this doc before running Spark.
16+
1217
# Launching Spark on YARN
1318

1419
Ensure that `HADOOP_CONF_DIR` or `YARN_CONF_DIR` points to the directory which contains the (client side) configuration files for the Hadoop cluster.

docs/security.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,20 @@ title: Security
66
* This will become a table of contents (this text will be scraped).
77
{:toc}
88

9-
# Spark RPC
9+
# Spark Security: Things You Need To Know
10+
11+
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
12+
Spark supports multiple deployments types and each one supports different levels of security. Not
13+
all deployment types will be secure in all environments and none are secure by default. Be
14+
sure to evaluate your environment, what Spark supports, and take the appropriate measure to secure
15+
your Spark deployment.
16+
17+
There are many different types of security concerns. Spark does not necessarily protect against
18+
all things. Listed below are some of the things Spark supports. Also check the deployment
19+
documentation for the type of deployment you are using for deployment specific settings. Anything
20+
not documented, Spark does not support.
21+
22+
# Spark RPC (Communication protocol between Spark processes)
1023

1124
## Authentication
1225

@@ -123,7 +136,7 @@ The following table describes the different options available for configuring th
123136
Spark supports encrypting temporary data written to local disks. This covers shuffle files, shuffle
124137
spills and data blocks stored on disk (for both caching and broadcast variables). It does not cover
125138
encrypting output data generated by applications with APIs such as `saveAsHadoopFile` or
126-
`saveAsTable`.
139+
`saveAsTable`. It also may not cover temporary files created explicitly by the user.
127140

128141
The following settings cover enabling encryption for data written to disk:
129142

docs/spark-standalone.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@ title: Spark Standalone Mode
88

99
In addition to running on the Mesos or YARN cluster managers, Spark also provides a simple standalone deploy mode. You can launch a standalone cluster either manually, by starting a master and workers by hand, or use our provided [launch scripts](#cluster-launch-scripts). It is also possible to run these daemons on a single machine for testing.
1010

11+
# Security
12+
13+
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default.
14+
Please see [Spark Security](security.html) and the specific security sections in this doc before running Spark.
15+
1116
# Installing Spark Standalone to a Cluster
1217

1318
To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster. You can obtain pre-built versions of Spark with each release or [build it yourself](building-spark.html).

0 commit comments

Comments
 (0)