Skip to content
This repository was archived by the owner on Jan 9, 2020. It is now read-only.

Conversation

liyinan926
Copy link
Member

This is the same PR as apache#19954, but against our fork for triggering integration tests.

@kimoonkim @foxish

Marcelo Vanzin and others added 26 commits December 20, 2017 11:31
…ort.

Still look at the old one in case any Spark user is setting it
explicitly, though.

Author: Marcelo Vanzin <[email protected]>

Closes apache#19983 from vanzin/SPARK-22788.
## What changes were proposed in this pull request?

Some users depend on source compatibility with the org.apache.spark.sql.execution.streaming.Offset class. Although this is not a stable interface, we can keep it in place for now to simplify upgrades to 2.3.

Author: Jose Torres <[email protected]>

Closes apache#20012 from joseph-torres/binary-compat.
## What changes were proposed in this pull request?
unpersist unused datasets

## How was this patch tested?
existing tests and local check in Spark-Shell

Author: Zheng RuiFeng <[email protected]>

Closes apache#20017 from zhengruifeng/bkm_unpersist.
## What changes were proposed in this pull request?
In the previous PR apache#5755 (comment), we dropped `(-[classifier])` from the retrieval pattern. We should add it back; otherwise,
> If this pattern for instance doesn't has the [type] or [classifier] token, Ivy will download the source/javadoc artifacts to the same file as the regular jar.

## How was this patch tested?
The existing tests

Author: gatorsmile <[email protected]>

Closes apache#20037 from gatorsmile/addClassifier.
## What changes were proposed in this pull request?

* Under Spark Scala Examples: Some of the syntax were written like Java way, It has been re-written as per scala style guide.
* Most of all changes are followed to println() statement.

## How was this patch tested?

Since, All changes proposed are re-writing println statements in scala way, manual run used to test println.

Author: chetkhatri <[email protected]>

Closes apache#20016 from chetkhatri/scala-style-spark-examples.
…assigning schedulingPool for stage

## What changes were proposed in this pull request?

In AppStatusListener's onStageSubmitted(event: SparkListenerStageSubmitted) method, there are duplicate code:
```
// schedulingPool was assigned twice with the same code
stage.schedulingPool = Option(event.properties).flatMap { p =>
      Option(p.getProperty("spark.scheduler.pool"))
    }.getOrElse(SparkUI.DEFAULT_POOL_NAME)
...
...
...
stage.schedulingPool = Option(event.properties).flatMap { p =>
      Option(p.getProperty("spark.scheduler.pool"))
    }.getOrElse(SparkUI.DEFAULT_POOL_NAME)

```
But, it does not make any sense to do this and there are no comment to explain for this.

## How was this patch tested?
N/A

Author: wuyi <[email protected]>

Closes apache#20033 from Ngone51/dev-spark-22847.
…ay to take time instead of int

## What changes were proposed in this pull request?

Fixing configuration that was taking an int which should take time. Discussion in apache#19946 (comment)
Made the granularity milliseconds as opposed to seconds since there's a use-case for sub-second reactions to scale-up rapidly especially with dynamic allocation.

## How was this patch tested?

TODO: manual run of integration tests against this PR.
PTAL

cc/ mccheah liyinan926 kimoonkim vanzin mridulm jiangxb1987 ueshin

Author: foxish <[email protected]>

Closes apache#20032 from foxish/fix-time-conf.
…h huber loss.

## What changes were proposed in this pull request?
Expose Python API for _LinearRegression_ with _huber_ loss.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <[email protected]>

Closes apache#19994 from yanboliang/spark-22810.
…e options

## What changes were proposed in this pull request?

Introduce a new interface `SessionConfigSupport` for `DataSourceV2`, it can help to propagate session configs with the specified key-prefix to all data source operations in this session.

## How was this patch tested?

Add new test suite `DataSourceV2UtilsSuite`.

Author: Xingbo Jiang <[email protected]>

Closes apache#19861 from jiangxb1987/datasource-configs.
…orized summarizer

## What changes were proposed in this pull request?

Make several improvements in dataframe vectorized summarizer.

1. Make the summarizer return `Vector` type for all metrics (except "count").
It will return "WrappedArray" type before which won't be very convenient.

2. Make `MetricsAggregate` inherit `ImplicitCastInputTypes` trait. So it can check and implicitly cast input values.

3. Add "weight" parameter for all single metric method.

4. Update doc and improve the example code in doc.

5. Simplified test cases.

## How was this patch tested?

Test added and simplified.

Author: WeichenXu <[email protected]>

Closes apache#19156 from WeichenXu123/improve_vec_summarizer.
## What changes were proposed in this pull request?

This PR eliminates mutable states from the generated code for `Stack`.

## How was this patch tested?

Existing test suites

Author: Kazuaki Ishizaki <[email protected]>

Closes apache#20035 from kiszk/SPARK-22848.
## What changes were proposed in this pull request?

Upgrade Spark to Arrow 0.8.0 for Java and Python.  Also includes an upgrade of Netty to 4.1.17 to resolve dependency requirements.

The highlights that pertain to Spark for the update from Arrow versoin 0.4.1 to 0.8.0 include:

* Java refactoring for more simple API
* Java reduced heap usage and streamlined hot code paths
* Type support for DecimalType, ArrayType
* Improved type casting support in Python
* Simplified type checking in Python

## How was this patch tested?

Existing tests

Author: Bryan Cutler <[email protected]>
Author: Shixiong Zhu <[email protected]>

Closes apache#19884 from BryanCutler/arrow-upgrade-080-SPARK-22324.
## What changes were proposed in this pull request?

Moves the -Xlint:unchecked flag in the sbt build configuration from Compile to (Compile, compile) scope, allowing publish and publishLocal commands to work.

## How was this patch tested?

Successfully published the spark-launcher subproject from within sbt successfully, where it fails without this patch.

Author: Erik LaBianca <[email protected]>

Closes apache#20040 from easel/javadoc-xlint.
Prevents Scala 2.12 scaladoc from blowing up attempting to parse java comments.

## What changes were proposed in this pull request?

Adds -no-java-comments to docs/scalacOptions under Scala 2.12. Also
moves scaladoc configs out of the TestSettings and into the standard sharedSettings
section in SparkBuild.scala.

## How was this patch tested?

SBT_OPTS=-Dscala-2.12 sbt
++2.12.4
tags/publishLocal

Author: Erik LaBianca <[email protected]>

Closes apache#20042 from easel/scaladoc-212.
…split by CodegenContext.splitExpressions()

## What changes were proposed in this pull request?

Passing global variables to the split method is dangerous, as any mutating to it is ignored and may lead to unexpected behavior.

To prevent this, one approach is to make sure no expression would output global variables: Localizing lifetime of mutable states in expressions.

Another approach is, when calling `ctx.splitExpression`, make sure we don't use children's output as parameter names.

Approach 1 is actually hard to do, as we need to check all expressions and operators that support whole-stage codegen. Approach 2 is easier as the callers of `ctx.splitExpressions` are not too many.

Besides, approach 2 is more flexible, as children's output may be other stuff that can't be parameter name: literal, inlined statement(a + 1), etc.

close apache#19865
close apache#19938

## How was this patch tested?

existing tests

Author: Wenchen Fan <[email protected]>

Closes apache#20021 from cloud-fan/codegen.
## What changes were proposed in this pull request?

In apache#19681 we introduced a new interface called `AppStatusPlugin`, to register listeners and set up the UI for both live and history UI.

However I think it's an overkill for live UI. For example, we should not register `SQLListener` if users are not using SQL functions. Previously we register the `SQLListener` and set up SQL tab when `SparkSession` is firstly created, which indicates users are going to use SQL functions. But in apache#19681 , we register the SQL functions during `SparkContext` creation. The same thing should apply to streaming too.

I think we should keep the previous behavior, and only use this new interface for history server.

To reflect this change, I also rename the new interface to `SparkHistoryUIPlugin`

This PR also refines the tests for sql listener.

## How was this patch tested?

existing tests

Author: Wenchen Fan <[email protected]>

Closes apache#19981 from cloud-fan/listener.
…ecision

## What changes were proposed in this pull request?

Test Coverage for `WindowFrameCoercion` and `DecimalPrecision`, this is a Sub-tasks for [SPARK-22722](https://issues.apache.org/jira/browse/SPARK-22722).

## How was this patch tested?

N/A

Author: Yuming Wang <[email protected]>

Closes apache#20008 from wangyum/SPARK-22822.
…ild's partitioning is not decided

## What changes were proposed in this pull request?

This is a followup PR of apache#19257 where gatorsmile had left couple comments wrt code style.

## How was this patch tested?

Doesn't change any functionality. Will depend on build to see if no checkstyle rules are violated.

Author: Tejas Patil <[email protected]>

Closes apache#20041 from tejasapatil/followup_19257.
When one execution has multiple jobs, we need to append to the set of
stages, not replace them on every job.

Added unit test and ran existing tests on jenkins

Author: Imran Rashid <[email protected]>

Closes apache#20047 from squito/SPARK-22861.
What changes were proposed in this pull request?

This PR contains documentation on the usage of Kubernetes scheduler in Spark 2.3, and a shell script to make it easier to build docker images required to use the integration. The changes detailed here are covered by apache#19717 and apache#19468 which have merged already.

How was this patch tested?
The script has been in use for releases on our fork. Rest is documentation.

cc rxin mateiz (shepherd)
k8s-big-data SIG members & contributors: foxish ash211 mccheah liyinan926 erikerlandson ssuchter varunkatta kimoonkim tnachen ifilonenko
reviewers: vanzin felixcheung jiangxb1987 mridulm

TODO:
- [x] Add dockerfiles directory to built distribution. (apache#20007)
- [x] Change references to docker to instead say "container" (apache#19995)
- [x] Update configuration table.
- [x] Modify spark.kubernetes.allocation.batch.delay to take time instead of int (apache#20032)

Author: foxish <[email protected]>

Closes apache#19946 from foxish/update-k8s-docs.
The code was ignoring SparkListenerLogStart, which was added
somewhat recently to record the Spark version used to generate
an event log.

Author: Marcelo Vanzin <[email protected]>

Closes apache#20049 from vanzin/SPARK-22854.
## What changes were proposed in this pull request?

The PR introduces a new method `addImmutableStateIfNotExists ` to `CodeGenerator` to allow reusing and sharing the same global variable between different Expressions. This helps reducing the number of global variables needed, which is important to limit the impact on the constant pool.

## How was this patch tested?

added UTs

Author: Marco Gaido <[email protected]>
Author: Marco Gaido <[email protected]>

Closes apache#19940 from mgaido91/SPARK-22750.
…- LabeledPoint/VectorWithNorm/TreePoint

## What changes were proposed in this pull request?
register following classes in Kryo:
`org.apache.spark.mllib.regression.LabeledPoint`
`org.apache.spark.mllib.clustering.VectorWithNorm`
`org.apache.spark.ml.feature.LabeledPoint`
`org.apache.spark.ml.tree.impl.TreePoint`

`org.apache.spark.ml.tree.impl.BaggedPoint` seems also need to be registered, but I don't know how to do it in this safe way.
WeichenXu123 cloud-fan

## How was this patch tested?
added tests

Author: Zheng RuiFeng <[email protected]>

Closes apache#19950 from zhengruifeng/labeled_kryo.
## What changes were proposed in this pull request?

The path was recently changed in apache#19946, but the dockerfile was not updated.
This is a trivial 1 line fix.

## How was this patch tested?

`./sbin/build-push-docker-images.sh -r spark-repo -t latest build`

cc/ vanzin mridulm rxin jiangxb1987 liyinan926

Author: Anirudh Ramanathan <[email protected]>
Author: foxish <[email protected]>

Closes apache#20051 from foxish/patch-1.
…oder

This behavior has confused some users, so lets clarify it.

Author: Michael Armbrust <[email protected]>

Closes apache#20048 from marmbrus/datasetAsDocs.
…seVersion.

## What changes were proposed in this pull request?

Currently we check pandas version by capturing if `ImportError` for the specific imports is raised or not but we can compare `LooseVersion` of the version strings as the same as we're checking pyarrow version.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <[email protected]>

Closes apache#20054 from ueshin/issues/SPARK-22874.
@foxish
Copy link
Member

foxish commented Dec 22, 2017

The checkstyle failures (in the Full Build test on this PR) seems unrelated to our code - maybe it's broken in upstream/master?

@kimoonkim
Copy link
Member

rerun integration test please

1 similar comment
@kimoonkim
Copy link
Member

rerun integration test please

@kimoonkim
Copy link
Member

We are running the new integration test repo code. I modified other builds like unit tests to exclude the master branch. So going forward, only integration test will trigger from the master branch.

@kimoonkim
Copy link
Member

if the PR is against master, it will always run the new setup?
Correct.

@kimoonkim
Copy link
Member

X default Details

The failed Jenkins build with "default" label is actually from the new integration test Jenkins job. I'll find a way to change the label.

@kimoonkim
Copy link
Member

rerun integration test please

1 similar comment
@kimoonkim
Copy link
Member

rerun integration test please

@kimoonkim
Copy link
Member

The latest two Jenkins jobs ran the new integration tests. The "Make Distribution" job built a distro tarball off this PR. The "Integration Tests" job ran tests against the tarball. It failed because of config issue that I just fixed.

@foxish
Copy link
Member

foxish commented Dec 22, 2017

@kimoonkim, we should see the make distribution and integration tests pass now?

@kimoonkim
Copy link
Member

I am hoping the next runs will pass. Getting there.

@kimoonkim
Copy link
Member

Ok. It seems the latest test failure is genuine. @liyinan926 Can you please take a look? Maybe your branch is outdated and need to merge apache/spark#20051

From http://spark-k8s-jenkins.pepperdata.org:8080/job/pr-spark-integration/5/:

Discovery starting.
Discovery completed in 145 milliseconds.
Run starting. Expected test count is: 2
KubernetesSuite:
*** RUN ABORTED ***
com.spotify.docker.client.exceptions.DockerException: ProgressMessage{id=null, status=null, stream=null, error=lstat dockerfiles/spark-base/entrypoint.sh: no such file or directory, progress=null, progressDetail=null}
at com.spotify.docker.client.LoggingBuildHandler.progress(LoggingBuildHandler.java:33)
at com.spotify.docker.client.DefaultDockerClient.build(DefaultDockerClient.java:1157)
at org.apache.spark.deploy.k8s.integrationtest.docker.SparkDockerImageBuilder.buildImage(SparkDockerImageBuilder.scala:70)
at org.apache.spark.deploy.k8s.integrationtest.docker.SparkDockerImageBuilder.buildSparkDockerImages(SparkDockerImageBuilder.scala:64)
at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend.initialize(MinikubeTestBackend.scala:31)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:42)
at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:33)
at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:33)
...

@foxish
Copy link
Member

foxish commented Dec 22, 2017 via email

@liyinan926
Copy link
Member Author

Rebased onto latest upstream/master.

@kimoonkim
Copy link
Member

Integration test has passed now!

@kimoonkim
Copy link
Member

rerun integration tests please

1 similar comment
@liyinan926
Copy link
Member Author

rerun integration tests please

@liyinan926
Copy link
Member Author

Closing as the upstream has been merged.

@liyinan926 liyinan926 closed this Jan 2, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.