Skip to content

Commit 6bb1615

Browse files
authored
Merge pull request #15 from Affirm/hossein/rebase-master-from-upstream
Hossein/rebase master from upstream
2 parents bd7510b + 530c8ad commit 6bb1615

File tree

6,057 files changed

+674942
-291286
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

6,057 files changed

+674942
-291286
lines changed

.asf.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one or more
2+
# contributor license agreements. See the NOTICE file distributed with
3+
# this work for additional information regarding copyright ownership.
4+
# The ASF licenses this file to You under the Apache License, Version 2.0
5+
# (the "License"); you may not use this file except in compliance with
6+
# the License. You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# https://cwiki.apache.org/confluence/display/INFRA/git+-+.asf.yaml+features
17+
---
18+
github:
19+
description: "Apache Spark - A unified analytics engine for large-scale data processing"
20+
homepage: https://spark.apache.org/
21+
labels:
22+
- python
23+
- scala
24+
- r
25+
- java
26+
- big-data
27+
- jdbc
28+
- sql
29+
- spark

.gitattributes

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,7 @@
11
*.bat text eol=crlf
22
*.cmd text eol=crlf
3+
*.java text eol=lf
4+
*.scala text eol=lf
5+
*.xml text eol=lf
6+
*.py text eol=lf
7+
*.R text eol=lf

.github/PULL_REQUEST_TEMPLATE

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ Thanks for sending a pull request! Here are some tips for you:
66
4. Be sure to keep the PR description updated to reflect all changes.
77
5. Please write your PR title to summarize what this PR proposes.
88
6. If possible, provide a concise example to reproduce the issue for a faster review.
9+
7. If you want to add a new configuration, please read the guideline first for naming configurations in
10+
'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
11+
8. If you want to add or modify an error type or message, please read the guideline first in
12+
'core/src/main/resources/error/README.md'.
913
-->
1014

1115
### What changes were proposed in this pull request?
@@ -27,9 +31,11 @@ Please clarify why the changes are needed. For instance,
2731
-->
2832

2933

30-
### Does this PR introduce any user-facing change?
34+
### Does this PR introduce _any_ user-facing change?
3135
<!--
36+
Note that it means *any* user-facing change including all aspects such as the documentation fix.
3237
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
38+
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
3339
If no, write 'No'.
3440
-->
3541

.github/labeler.yml

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one
3+
# or more contributor license agreements. See the NOTICE file
4+
# distributed with this work for additional information
5+
# regarding copyright ownership. The ASF licenses this file
6+
# to you under the Apache License, Version 2.0 (the
7+
# "License"); you may not use this file except in compliance
8+
# with the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing,
13+
# software distributed under the License is distributed on an
14+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
# KIND, either express or implied. See the License for the
16+
# specific language governing permissions and limitations
17+
# under the License.
18+
#
19+
20+
#
21+
# Pull Request Labeler Github Action Configuration: https://github.com/marketplace/actions/labeler
22+
#
23+
# Note that we currently cannot use the negatioon operator (i.e. `!`) for miniglob matches as they
24+
# would match any file that doesn't touch them. What's needed is the concept of `any `, which takes a
25+
# list of constraints / globs and then matches all of the constraints for either `any` of the files or
26+
# `all` of the files in the change set.
27+
#
28+
# However, `any`/`all` are not supported in a released version and testing off of the `main` branch
29+
# resulted in some other errors when testing.
30+
#
31+
# An issue has been opened upstream requesting that a release be cut that has support for all/any:
32+
# - https://github.com/actions/labeler/issues/111
33+
#
34+
# While we wait for this issue to be handled upstream, we can remove
35+
# the negated / `!` matches for now and at least have labels again.
36+
#
37+
INFRA:
38+
- ".github/**/*"
39+
- "appveyor.yml"
40+
- "tools/**/*"
41+
- "dev/create-release/**/*"
42+
- ".asf.yaml"
43+
- ".gitattributes"
44+
- ".gitignore"
45+
- "dev/github_jira_sync.py"
46+
- "dev/merge_spark_pr.py"
47+
- "dev/run-tests-jenkins*"
48+
BUILD:
49+
# Can be supported when a stable release with correct all/any is released
50+
#- any: ['dev/**/*', '!dev/github_jira_sync.py', '!dev/merge_spark_pr.py', '!dev/.rat-excludes']
51+
- "dev/**/*"
52+
- "build/**/*"
53+
- "project/**/*"
54+
- "assembly/**/*"
55+
- "**/*pom.xml"
56+
- "bin/docker-image-tool.sh"
57+
- "bin/find-spark-home*"
58+
- "scalastyle-config.xml"
59+
# These can be added in the above `any` clause (and the /dev/**/* glob removed) when
60+
# `any`/`all` support is released
61+
# - "!dev/github_jira_sync.py"
62+
# - "!dev/merge_spark_pr.py"
63+
# - "!dev/run-tests-jenkins*"
64+
# - "!dev/.rat-excludes"
65+
DOCS:
66+
- "docs/**/*"
67+
- "**/README.md"
68+
- "**/CONTRIBUTING.md"
69+
EXAMPLES:
70+
- "examples/**/*"
71+
- "bin/run-example*"
72+
# CORE needs to be updated when all/any are released upstream.
73+
CORE:
74+
# - any: ["core/**/*", "!**/*UI.scala", "!**/ui/**/*"] # If any file matches all of the globs defined in the list started by `any`, label is applied.
75+
- "core/**/*"
76+
- "common/kvstore/**/*"
77+
- "common/network-common/**/*"
78+
- "common/network-shuffle/**/*"
79+
- "python/pyspark/**/*.py"
80+
- "python/pyspark/tests/**/*.py"
81+
SPARK SUBMIT:
82+
- "bin/spark-submit*"
83+
SPARK SHELL:
84+
- "repl/**/*"
85+
- "bin/spark-shell*"
86+
SQL:
87+
#- any: ["**/sql/**/*", "!python/pyspark/sql/avro/**/*", "!python/pyspark/sql/streaming.py", "!python/pyspark/sql/tests/test_streaming.py"]
88+
- "**/sql/**/*"
89+
- "common/unsafe/**/*"
90+
#- "!python/pyspark/sql/avro/**/*"
91+
#- "!python/pyspark/sql/streaming.py"
92+
#- "!python/pyspark/sql/tests/test_streaming.py"
93+
- "bin/spark-sql*"
94+
- "bin/beeline*"
95+
- "sbin/*thriftserver*.sh"
96+
- "**/*SQL*.R"
97+
- "**/DataFrame.R"
98+
- "**/*WindowSpec.R"
99+
- "**/*catalog.R"
100+
- "**/*column.R"
101+
- "**/*functions.R"
102+
- "**/*group.R"
103+
- "**/*schema.R"
104+
- "**/*types.R"
105+
AVRO:
106+
- "external/avro/**/*"
107+
- "python/pyspark/sql/avro/**/*"
108+
DSTREAM:
109+
- "streaming/**/*"
110+
- "data/streaming/**/*"
111+
- "external/kinesis*"
112+
- "external/kafka*"
113+
- "python/pyspark/streaming/**/*"
114+
GRAPHX:
115+
- "graphx/**/*"
116+
- "data/graphx/**/*"
117+
ML:
118+
- "**/ml/**/*"
119+
- "**/*mllib_*.R"
120+
MLLIB:
121+
- "**/spark/mllib/**/*"
122+
- "mllib-local/**/*"
123+
- "python/pyspark/mllib/**/*"
124+
STRUCTURED STREAMING:
125+
- "**/sql/**/streaming/**/*"
126+
- "external/kafka-0-10-sql/**/*"
127+
- "python/pyspark/sql/streaming.py"
128+
- "python/pyspark/sql/tests/test_streaming.py"
129+
- "**/*streaming.R"
130+
PYTHON:
131+
- "bin/pyspark*"
132+
- "**/python/**/*"
133+
R:
134+
- "**/r/**/*"
135+
- "**/R/**/*"
136+
- "bin/sparkR*"
137+
YARN:
138+
- "resource-managers/yarn/**/*"
139+
MESOS:
140+
- "resource-managers/mesos/**/*"
141+
- "sbin/*mesos*.sh"
142+
KUBERNETES:
143+
- "resource-managers/kubernetes/**/*"
144+
WINDOWS:
145+
- "**/*.cmd"
146+
- "R/pkg/tests/fulltests/test_Windows.R"
147+
WEB UI:
148+
- "**/ui/**/*"
149+
- "**/*UI.scala"
150+
DEPLOY:
151+
- "sbin/**/*"
152+

.github/workflows/benchmark.yml

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
name: Run benchmarks
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
class:
7+
description: 'Benchmark class'
8+
required: true
9+
default: '*'
10+
jdk:
11+
description: 'JDK version: 8 or 11'
12+
required: true
13+
default: '8'
14+
failfast:
15+
description: 'Failfast: true or false'
16+
required: true
17+
default: 'true'
18+
num-splits:
19+
description: 'Number of job splits'
20+
required: true
21+
default: '1'
22+
23+
jobs:
24+
matrix-gen:
25+
name: Generate matrix for job splits
26+
runs-on: ubuntu-20.04
27+
outputs:
28+
matrix: ${{ steps.set-matrix.outputs.matrix }}
29+
env:
30+
SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}
31+
steps:
32+
- name: Generate matrix
33+
id: set-matrix
34+
run: echo "::set-output name=matrix::["`seq -s, 1 $SPARK_BENCHMARK_NUM_SPLITS`"]"
35+
36+
benchmark:
37+
name: "Run benchmarks: ${{ github.event.inputs.class }} (JDK ${{ github.event.inputs.jdk }}, ${{ matrix.split }} out of ${{ github.event.inputs.num-splits }} splits)"
38+
needs: matrix-gen
39+
# Ubuntu 20.04 is the latest LTS. The next LTS is 22.04.
40+
runs-on: ubuntu-20.04
41+
strategy:
42+
fail-fast: false
43+
matrix:
44+
split: ${{fromJSON(needs.matrix-gen.outputs.matrix)}}
45+
env:
46+
SPARK_BENCHMARK_FAILFAST: ${{ github.event.inputs.failfast }}
47+
SPARK_BENCHMARK_NUM_SPLITS: ${{ github.event.inputs.num-splits }}
48+
SPARK_BENCHMARK_CUR_SPLIT: ${{ matrix.split }}
49+
SPARK_GENERATE_BENCHMARK_FILES: 1
50+
SPARK_LOCAL_IP: localhost
51+
# To prevent spark.test.home not being set. See more detail in SPARK-36007.
52+
SPARK_HOME: ${{ github.workspace }}
53+
steps:
54+
- name: Checkout Spark repository
55+
uses: actions/checkout@v2
56+
# In order to get diff files
57+
with:
58+
fetch-depth: 0
59+
- name: Cache Scala, SBT and Maven
60+
uses: actions/cache@v2
61+
with:
62+
path: |
63+
build/apache-maven-*
64+
build/scala-*
65+
build/*.jar
66+
~/.sbt
67+
key: build-${{ hashFiles('**/pom.xml', 'project/build.properties', 'build/mvn', 'build/sbt', 'build/sbt-launch-lib.bash', 'build/spark-build-info') }}
68+
restore-keys: |
69+
build-
70+
- name: Cache Coursier local repository
71+
uses: actions/cache@v2
72+
with:
73+
path: ~/.cache/coursier
74+
key: benchmark-coursier-${{ github.event.inputs.jdk }}-${{ hashFiles('**/pom.xml', '**/plugins.sbt') }}
75+
restore-keys: |
76+
benchmark-coursier-${{ github.event.inputs.jdk }}
77+
- name: Install Java ${{ github.event.inputs.jdk }}
78+
uses: actions/setup-java@v1
79+
with:
80+
java-version: ${{ github.event.inputs.jdk }}
81+
- name: Run benchmarks
82+
run: |
83+
./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pspark-ganglia-lgpl test:package
84+
# Make less noisy
85+
cp conf/log4j.properties.template conf/log4j.properties
86+
sed -i 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g' conf/log4j.properties
87+
# In benchmark, we use local as master so set driver memory only. Note that GitHub Actions has 7 GB memory limit.
88+
bin/spark-submit \
89+
--driver-memory 6g --class org.apache.spark.benchmark.Benchmarks \
90+
--jars "`find . -name '*-SNAPSHOT-tests.jar' -o -name '*avro*-SNAPSHOT.jar' | paste -sd ',' -`" \
91+
"`find . -name 'spark-core*-SNAPSHOT-tests.jar'`" \
92+
"${{ github.event.inputs.class }}"
93+
# To keep the directory structure and file permissions, tar them
94+
# See also https://github.com/actions/upload-artifact#maintaining-file-permissions-and-case-sensitive-files
95+
echo "Preparing the benchmark results:"
96+
tar -cvf benchmark-results-${{ github.event.inputs.jdk }}.tar `git diff --name-only` `git ls-files --others --exclude-standard`
97+
- name: Upload benchmark results
98+
uses: actions/upload-artifact@v2
99+
with:
100+
name: benchmark-results-${{ github.event.inputs.jdk }}-${{ matrix.split }}
101+
path: benchmark-results-${{ github.event.inputs.jdk }}.tar
102+

0 commit comments

Comments
 (0)