Enable use of remote dependencies through init-container #582

liyinan926 · 2017-12-22T19:56:17Z

This is the same PR as apache#19954, but against our fork for triggering integration tests.

…ort. Still look at the old one in case any Spark user is setting it explicitly, though. Author: Marcelo Vanzin <[email protected]> Closes apache#19983 from vanzin/SPARK-22788.

## What changes were proposed in this pull request? Some users depend on source compatibility with the org.apache.spark.sql.execution.streaming.Offset class. Although this is not a stable interface, we can keep it in place for now to simplify upgrades to 2.3. Author: Jose Torres <[email protected]> Closes apache#20012 from joseph-torres/binary-compat.

## What changes were proposed in this pull request? unpersist unused datasets ## How was this patch tested? existing tests and local check in Spark-Shell Author: Zheng RuiFeng <[email protected]> Closes apache#20017 from zhengruifeng/bkm_unpersist.

## What changes were proposed in this pull request? In the previous PR apache#5755 (comment), we dropped `(-[classifier])` from the retrieval pattern. We should add it back; otherwise, > If this pattern for instance doesn't has the [type] or [classifier] token, Ivy will download the source/javadoc artifacts to the same file as the regular jar. ## How was this patch tested? The existing tests Author: gatorsmile <[email protected]> Closes apache#20037 from gatorsmile/addClassifier.

## What changes were proposed in this pull request? * Under Spark Scala Examples: Some of the syntax were written like Java way, It has been re-written as per scala style guide. * Most of all changes are followed to println() statement. ## How was this patch tested? Since, All changes proposed are re-writing println statements in scala way, manual run used to test println. Author: chetkhatri <[email protected]> Closes apache#20016 from chetkhatri/scala-style-spark-examples.

…assigning schedulingPool for stage ## What changes were proposed in this pull request? In AppStatusListener's onStageSubmitted(event: SparkListenerStageSubmitted) method, there are duplicate code: ``` // schedulingPool was assigned twice with the same code stage.schedulingPool = Option(event.properties).flatMap { p => Option(p.getProperty("spark.scheduler.pool")) }.getOrElse(SparkUI.DEFAULT_POOL_NAME) ... ... ... stage.schedulingPool = Option(event.properties).flatMap { p => Option(p.getProperty("spark.scheduler.pool")) }.getOrElse(SparkUI.DEFAULT_POOL_NAME) ``` But, it does not make any sense to do this and there are no comment to explain for this. ## How was this patch tested? N/A Author: wuyi <[email protected]> Closes apache#20033 from Ngone51/dev-spark-22847.

…ay to take time instead of int ## What changes were proposed in this pull request? Fixing configuration that was taking an int which should take time. Discussion in apache#19946 (comment) Made the granularity milliseconds as opposed to seconds since there's a use-case for sub-second reactions to scale-up rapidly especially with dynamic allocation. ## How was this patch tested? TODO: manual run of integration tests against this PR. PTAL cc/ mccheah liyinan926 kimoonkim vanzin mridulm jiangxb1987 ueshin Author: foxish <[email protected]> Closes apache#20032 from foxish/fix-time-conf.

…h huber loss. ## What changes were proposed in this pull request? Expose Python API for _LinearRegression_ with _huber_ loss. ## How was this patch tested? Unit test. Author: Yanbo Liang <[email protected]> Closes apache#19994 from yanboliang/spark-22810.

…e options ## What changes were proposed in this pull request? Introduce a new interface `SessionConfigSupport` for `DataSourceV2`, it can help to propagate session configs with the specified key-prefix to all data source operations in this session. ## How was this patch tested? Add new test suite `DataSourceV2UtilsSuite`. Author: Xingbo Jiang <[email protected]> Closes apache#19861 from jiangxb1987/datasource-configs.

…orized summarizer ## What changes were proposed in this pull request? Make several improvements in dataframe vectorized summarizer. 1. Make the summarizer return `Vector` type for all metrics (except "count"). It will return "WrappedArray" type before which won't be very convenient. 2. Make `MetricsAggregate` inherit `ImplicitCastInputTypes` trait. So it can check and implicitly cast input values. 3. Add "weight" parameter for all single metric method. 4. Update doc and improve the example code in doc. 5. Simplified test cases. ## How was this patch tested? Test added and simplified. Author: WeichenXu <[email protected]> Closes apache#19156 from WeichenXu123/improve_vec_summarizer.

## What changes were proposed in this pull request? This PR eliminates mutable states from the generated code for `Stack`. ## How was this patch tested? Existing test suites Author: Kazuaki Ishizaki <[email protected]> Closes apache#20035 from kiszk/SPARK-22848.

## What changes were proposed in this pull request? Upgrade Spark to Arrow 0.8.0 for Java and Python. Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements. The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include: * Java refactoring for more simple API * Java reduced heap usage and streamlined hot code paths * Type support for DecimalType, ArrayType * Improved type casting support in Python * Simplified type checking in Python ## How was this patch tested? Existing tests Author: Bryan Cutler <[email protected]> Author: Shixiong Zhu <[email protected]> Closes apache#19884 from BryanCutler/arrow-upgrade-080-SPARK-22324.

## What changes were proposed in this pull request? Moves the -Xlint:unchecked flag in the sbt build configuration from Compile to (Compile, compile) scope, allowing publish and publishLocal commands to work. ## How was this patch tested? Successfully published the spark-launcher subproject from within sbt successfully, where it fails without this patch. Author: Erik LaBianca <[email protected]> Closes apache#20040 from easel/javadoc-xlint.

Prevents Scala 2.12 scaladoc from blowing up attempting to parse java comments. ## What changes were proposed in this pull request? Adds -no-java-comments to docs/scalacOptions under Scala 2.12. Also moves scaladoc configs out of the TestSettings and into the standard sharedSettings section in SparkBuild.scala. ## How was this patch tested? SBT_OPTS=-Dscala-2.12 sbt ++2.12.4 tags/publishLocal Author: Erik LaBianca <[email protected]> Closes apache#20042 from easel/scaladoc-212.

…split by CodegenContext.splitExpressions() ## What changes were proposed in this pull request? Passing global variables to the split method is dangerous, as any mutating to it is ignored and may lead to unexpected behavior. To prevent this, one approach is to make sure no expression would output global variables: Localizing lifetime of mutable states in expressions. Another approach is, when calling `ctx.splitExpression`, make sure we don't use children's output as parameter names. Approach 1 is actually hard to do, as we need to check all expressions and operators that support whole-stage codegen. Approach 2 is easier as the callers of `ctx.splitExpressions` are not too many. Besides, approach 2 is more flexible, as children's output may be other stuff that can't be parameter name: literal, inlined statement(a + 1), etc. close apache#19865 close apache#19938 ## How was this patch tested? existing tests Author: Wenchen Fan <[email protected]> Closes apache#20021 from cloud-fan/codegen.

## What changes were proposed in this pull request? In apache#19681 we introduced a new interface called `AppStatusPlugin`, to register listeners and set up the UI for both live and history UI. However I think it's an overkill for live UI. For example, we should not register `SQLListener` if users are not using SQL functions. Previously we register the `SQLListener` and set up SQL tab when `SparkSession` is firstly created, which indicates users are going to use SQL functions. But in apache#19681 , we register the SQL functions during `SparkContext` creation. The same thing should apply to streaming too. I think we should keep the previous behavior, and only use this new interface for history server. To reflect this change, I also rename the new interface to `SparkHistoryUIPlugin` This PR also refines the tests for sql listener. ## How was this patch tested? existing tests Author: Wenchen Fan <[email protected]> Closes apache#19981 from cloud-fan/listener.

…ecision ## What changes were proposed in this pull request? Test Coverage for `WindowFrameCoercion` and `DecimalPrecision`, this is a Sub-tasks for [SPARK-22722](https://issues.apache.org/jira/browse/SPARK-22722). ## How was this patch tested? N/A Author: Yuming Wang <[email protected]> Closes apache#20008 from wangyum/SPARK-22822.

…ild's partitioning is not decided ## What changes were proposed in this pull request? This is a followup PR of apache#19257 where gatorsmile had left couple comments wrt code style. ## How was this patch tested? Doesn't change any functionality. Will depend on build to see if no checkstyle rules are violated. Author: Tejas Patil <[email protected]> Closes apache#20041 from tejasapatil/followup_19257.

When one execution has multiple jobs, we need to append to the set of stages, not replace them on every job. Added unit test and ran existing tests on jenkins Author: Imran Rashid <[email protected]> Closes apache#20047 from squito/SPARK-22861.

What changes were proposed in this pull request? This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by apache#19717 and apache#19468 which have merged already. How was this patch tested? The script has been in use for releases on our fork. Rest is documentation. cc rxin mateiz (shepherd) k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko reviewers: vanzin felixcheung jiangxb1987 mridulm TODO: - [x] Add dockerfiles directory to built distribution. (apache#20007) - [x] Change references to docker to instead say "container" (apache#19995) - [x] Update configuration table. - [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (apache#20032) Author: foxish <[email protected]> Closes apache#19946 from foxish/update-k8s-docs.

The code was ignoring SparkListenerLogStart, which was added somewhat recently to record the Spark version used to generate an event log. Author: Marcelo Vanzin <[email protected]> Closes apache#20049 from vanzin/SPARK-22854.

## What changes were proposed in this pull request? The PR introduces a new method `addImmutableStateIfNotExists ` to `CodeGenerator` to allow reusing and sharing the same global variable between different Expressions. This helps reducing the number of global variables needed, which is important to limit the impact on the constant pool. ## How was this patch tested? added UTs Author: Marco Gaido <[email protected]> Author: Marco Gaido <[email protected]> Closes apache#19940 from mgaido91/SPARK-22750.

…- LabeledPoint/VectorWithNorm/TreePoint ## What changes were proposed in this pull request? register following classes in Kryo: `org.apache.spark.mllib.regression.LabeledPoint` `org.apache.spark.mllib.clustering.VectorWithNorm` `org.apache.spark.ml.feature.LabeledPoint` `org.apache.spark.ml.tree.impl.TreePoint` `org.apache.spark.ml.tree.impl.BaggedPoint` seems also need to be registered, but I don't know how to do it in this safe way. WeichenXu123 cloud-fan ## How was this patch tested? added tests Author: Zheng RuiFeng <[email protected]> Closes apache#19950 from zhengruifeng/labeled_kryo.

## What changes were proposed in this pull request? The path was recently changed in apache#19946, but the dockerfile was not updated. This is a trivial 1 line fix. ## How was this patch tested? `./sbin/build-push-docker-images.sh -r spark-repo -t latest build` cc/ vanzin mridulm rxin jiangxb1987 liyinan926 Author: Anirudh Ramanathan <[email protected]> Author: foxish <[email protected]> Closes apache#20051 from foxish/patch-1.

…oder This behavior has confused some users, so lets clarify it. Author: Michael Armbrust <[email protected]> Closes apache#20048 from marmbrus/datasetAsDocs.

…seVersion. ## What changes were proposed in this pull request? Currently we check pandas version by capturing if `ImportError` for the specific imports is raised or not but we can compare `LooseVersion` of the version strings as the same as we're checking pyarrow version. ## How was this patch tested? Existing tests. Author: Takuya UESHIN <[email protected]> Closes apache#20054 from ueshin/issues/SPARK-22874.

foxish · 2017-12-22T20:18:18Z

The checkstyle failures (in the Full Build test on this PR) seems unrelated to our code - maybe it's broken in upstream/master?

kimoonkim · 2017-12-22T21:31:02Z

rerun integration test please

kimoonkim · 2017-12-22T21:57:01Z

rerun integration test please

kimoonkim · 2017-12-22T22:18:21Z

We are running the new integration test repo code. I modified other builds like unit tests to exclude the master branch. So going forward, only integration test will trigger from the master branch.

kimoonkim · 2017-12-22T22:19:17Z

if the PR is against master, it will always run the new setup?
Correct.

kimoonkim · 2017-12-22T22:21:21Z

X default Details

The failed Jenkins build with "default" label is actually from the new integration test Jenkins job. I'll find a way to change the label.

kimoonkim · 2017-12-22T22:31:32Z

rerun integration test please

kimoonkim · 2017-12-22T23:11:15Z

rerun integration test please

kimoonkim · 2017-12-22T23:17:11Z

The latest two Jenkins jobs ran the new integration tests. The "Make Distribution" job built a distro tarball off this PR. The "Integration Tests" job ran tests against the tarball. It failed because of config issue that I just fixed.

foxish · 2017-12-22T23:21:39Z

@kimoonkim, we should see the make distribution and integration tests pass now?

kimoonkim · 2017-12-22T23:23:48Z

I am hoping the next runs will pass. Getting there.

kimoonkim · 2017-12-22T23:37:37Z

Ok. It seems the latest test failure is genuine. @liyinan926 Can you please take a look? Maybe your branch is outdated and need to merge apache/spark#20051

From http://spark-k8s-jenkins.pepperdata.org:8080/job/pr-spark-integration/5/:

Discovery starting.
Discovery completed in 145 milliseconds.
Run starting. Expected test count is: 2
KubernetesSuite:
*** RUN ABORTED ***
com.spotify.docker.client.exceptions.DockerException: ProgressMessage{id=null, status=null, stream=null, error=lstat dockerfiles/spark-base/entrypoint.sh: no such file or directory, progress=null, progressDetail=null}
at com.spotify.docker.client.LoggingBuildHandler.progress(LoggingBuildHandler.java:33)
at com.spotify.docker.client.DefaultDockerClient.build(DefaultDockerClient.java:1157)
at org.apache.spark.deploy.k8s.integrationtest.docker.SparkDockerImageBuilder.buildImage(SparkDockerImageBuilder.scala:70)
at org.apache.spark.deploy.k8s.integrationtest.docker.SparkDockerImageBuilder.buildSparkDockerImages(SparkDockerImageBuilder.scala:64)
at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend.initialize(MinikubeTestBackend.scala:31)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:42)
at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:33)
at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:33)
...

foxish · 2017-12-22T23:39:34Z

Might be best to rebase onto upstream/master

…

On Fri, Dec 22, 2017 at 3:37 PM, Kimoon Kim ***@***.***> wrote: Ok. It seems the latest test failure is genuine. @liyinan926 <https://github.com/liyinan926> Can you please take a look? Maybe your branch is outdated and need to merge apache/spark#20051 <apache#20051> From http://spark-k8s-jenkins.pepperdata.org:8080/job/pr- spark-integration/5/: Discovery starting. Discovery completed in 145 milliseconds. Run starting. Expected test count is: 2 KubernetesSuite: *** RUN ABORTED *** com.spotify.docker.client.exceptions.DockerException: ProgressMessage{id=null, status=null, stream=null, error=lstat dockerfiles/spark-base/entrypoint.sh: no such file or directory, progress=null, progressDetail=null} at com.spotify.docker.client.LoggingBuildHandler.progress( LoggingBuildHandler.java:33) at com.spotify.docker.client.DefaultDockerClient.build( DefaultDockerClient.java:1157) at org.apache.spark.deploy.k8s.integrationtest.docker. SparkDockerImageBuilder.buildImage(SparkDockerImageBuilder.scala:70) at org.apache.spark.deploy.k8s.integrationtest.docker. SparkDockerImageBuilder.buildSparkDockerImages( SparkDockerImageBuilder.scala:64) at org.apache.spark.deploy.k8s.integrationtest.backend. minikube.MinikubeTestBackend.initialize(MinikubeTestBackend.scala:31) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll( KubernetesSuite.scala:42) at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll. scala:187) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll( KubernetesSuite.scala:33) at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org $scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:33) ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#582 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3U55v1djkfJyfpnHEIeKpkVo0dJDpSks5tDD1CgaJpZM4RLYI3> .

-- Anirudh Ramanathan

…etes mode

liyinan926 · 2017-12-23T00:12:56Z

Rebased onto latest upstream/master.

kimoonkim · 2017-12-23T01:31:43Z

Integration test has passed now!

kimoonkim · 2017-12-27T23:07:52Z

rerun integration tests please

liyinan926 · 2017-12-27T23:32:34Z

rerun integration tests please

liyinan926 · 2018-01-02T05:52:38Z

Closing as the upstream has been merged.

Marcelo Vanzin and others added 26 commits December 20, 2017 11:31

[SPARK-22788][STREAMING] Use correct hadoop config for fs append supp…

7570eab

…ort. Still look at the old one in case any Spark user is setting it explicitly, though. Author: Marcelo Vanzin <[email protected]> Closes apache#19983 from vanzin/SPARK-22788.

[SPARK-22854][UI] Read Spark version from event logs.

c0abb1d

The code was ignoring SparkListenerLogStart, which was added somewhat recently to record the Spark version used to generate an event log. Author: Marcelo Vanzin <[email protected]> Closes apache#20049 from vanzin/SPARK-22854.

[SPARK-22862] Docs on lazy elimination of columns missing from an enc…

8df1da3

…oder This behavior has confused some users, so lets clarify it. Author: Michael Armbrust <[email protected]> Closes apache#20048 from marmbrus/datasetAsDocs.

foxish mentioned this pull request Dec 22, 2017

Remove unnecessary copy dockerfiles step apache-spark-on-k8s/spark-integration#9

Merged

liyinan926 added 11 commits December 22, 2017 16:11

[SPARK-22757][Kubernetes] Enable use of remote dependencies in Kubern…

d3cbbdd

…etes mode

Addressed first round of comments

5d2cbc8

Addressed the second round of comments

4ee76af

Create one task per jar/file to download in the init-container

9c8051a

More review comments

1f65417

Shorten variable names

109ad80

Removed traits that have only a single implementation

c21fdcf

Remove unused class arguments

a3cd71d

Improved documentation

23c5cd9

Addressed latest round of comments

2ec15c4

Addressed more comments

5d1f889

liyinan926 added 3 commits December 22, 2017 19:10

Updated names of two configuration properties

9d9c841

Addressed more comments

c51bc56

Addressed one more comment

28343fb

liyinan926 closed this Jan 2, 2018

Enable use of remote dependencies through init-container #582

Enable use of remote dependencies through init-container #582

Uh oh!

Conversation

liyinan926 commented Dec 22, 2017

Uh oh!

foxish commented Dec 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

foxish commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

kimoonkim commented Dec 22, 2017

Uh oh!

foxish commented Dec 22, 2017 via email

Uh oh!

liyinan926 commented Dec 23, 2017

Uh oh!

kimoonkim commented Dec 23, 2017

Uh oh!

kimoonkim commented Dec 27, 2017

Uh oh!

liyinan926 commented Dec 27, 2017

Uh oh!

liyinan926 commented Jan 2, 2018

Uh oh!

Uh oh!

foxish commented Dec 22, 2017 •

edited

Loading