Skip to content

Commit 0984ee7

Browse files
authored
Merge pull request #2227 from pareenaverma/content_review
Tech review of Spark on GCP
2 parents b253985 + d59ddb2 commit 0984ee7

File tree

6 files changed

+66
-49
lines changed

6 files changed

+66
-49
lines changed

content/learning-paths/servers-and-cloud-computing/spark-on-gcp/_index.md

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,25 @@
11
---
2-
title: Deploy Apache Spark on Google Axion C4A virtual machine
2+
title: Deploy Apache Spark on Google Axion processors
33

44
draft: true
55
cascade:
66
draft: true
77

88
minutes_to_complete: 60
99

10-
who_is_this_for: This is an introductory topic for the software developers who are willing to migrate their Apache Spark workloads from the x86_64 platforms to Arm-based platforms, or on Google Axion-based C4A virtual machines specifically.
10+
who_is_this_for: This is an introductory topic for the software developers interested in migrating their Apache Spark workloads from x86_64 platforms to Arm-based platforms, or on Google Axion based C4A virtual machines specifically.
1111

1212
learning_objectives:
13-
- Provision an Arm virtual machine on the Google Cloud Platform using the C4A Google Axion instance family, and RHEL 9 as the base image.
14-
- Understand how to install and configure Apache Spark on Arm-based GCP C4A instances.
13+
- Start an Arm virtual machine on the Google Cloud Platform using the C4A Google Axion instance family with RHEL 9 as the base image.
14+
- Learn how to install and configure Apache Spark on Arm-based GCP C4A instances.
1515
- Validate the functionality of spark through baseline testing.
16-
- Perform benchmarking to evaluate Apache Spark’s performance on Arm.
16+
- Benchmark Apache Spark’s performance on Arm.
1717

1818
prerequisites:
1919
- A [Google Cloud Platform (GCP)](https://cloud.google.com/free?utm_source=google&hl=en) account with billing enabled.
20-
- Basic understanding of Linux command line.
2120
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).
2221

23-
author: Jason Andrews
22+
author: Pareena Verma
2423

2524
##### Tags
2625
skilllevels: Advanced
@@ -53,7 +52,7 @@ further_reading:
5352

5453
- resource:
5554
title: The Scala programming language official website
56-
link: scala-lang.org
55+
link: https://scala-lang.org
5756
type: website
5857

5958

content/learning-paths/servers-and-cloud-computing/spark-on-gcp/background.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
11
---
2-
title: "About Google Axion C4A series and Apache Spark"
2+
title: "Google Axion C4A and Apache Spark"
33

44
weight: 2
55

66
layout: "learningpathall"
77
---
88

9-
## Google Axion C4A series
9+
## Google Axion C4A instances
1010

11-
The Google Axion C4A series is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machine offer strong performance ideal for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
11+
Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machine offer strong performance ideal for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
1212

13-
The C4A series provides a cost-effective alternative to x86 virtual machine while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
13+
The C4A series provides a cost-effective alternative to x86 virtual machines while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
1414

15-
To learn more about Google Axion, refer to the blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu).
15+
To learn more about Google Axion, refer to the [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu) blog.
1616

1717
## Apache Spark
1818

content/learning-paths/servers-and-cloud-computing/spark-on-gcp/baseline.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,17 @@ layout: learningpathall
77
---
88

99

10-
Since Apache Spark is installed successfully on your GCP C4A Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
10+
With Apache Spark installed successfully on your GCP C4A Arm-based virtual machine, you can now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
1111

1212
## Spark Baseline Test
1313

14-
Create a simple Spark job file:
14+
Using a file editor of your choice, create a simple Spark job file:
1515
```console
1616
nano ~/spark_baseline_test.scala
1717
```
18-
Below is this content of **spark_baseline_test.scala** file:
18+
Copy the content below into `spark_baseline_test.scala`:
1919

20-
```scala
20+
```console
2121
val data = Seq(1, 2, 3, 4, 5)
2222
val distData = spark.sparkContext.parallelize(data)
2323
@@ -26,10 +26,9 @@ val squared = distData.map(x => x * x).collect()
2626
2727
println("Squared values: " + squared.mkString(", "))
2828
```
29-
Code Explanation:
30-
This code is a basic Apache Spark example in Scala, demonstrating how to create an RDD (Resilient Distributed Dataset), perform a transformation, and collect results.
29+
This is a basic Apache Spark example in Scala, demonstrating how to create an RDD (Resilient Distributed Dataset), perform a transformation, and collect results.
3130

32-
What it does, step by step:
31+
Lets look into the code, step by step:
3332

3433
- **val data = Seq(1, 2, 3, 4, 5)** : Creates a local Scala sequence of integers.
3534
- **val distData = spark.sparkContext.parallelize(data)** : Uses parallelize to convert the local sequence into a distributed RDD (so Spark can operate on it in parallel across cluster nodes or CPU cores).
@@ -39,11 +38,11 @@ What it does, step by step:
3938

4039
### Run the Test in Spark Shell
4140

42-
Run the test in the interactive shell:
41+
Run the test you created in the interactive shell:
4342
```console
4443
spark-shell < ~/spark_baseline_test.scala
4544
```
46-
You should see an output similar to:
45+
The output should look similar to:
4746
```output
4847
Squared values: 1, 4, 9, 16, 25
4948
```

content/learning-paths/servers-and-cloud-computing/spark-on-gcp/benchmarking.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,21 @@
11
---
2-
title: Spark Internal Benchmarking
2+
title: Run Spark Benchmarks
33
weight: 6
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Apache Spark Internal Benchmarking
10-
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing platforms such as x86_64 vs Arm64.
11-
Below are the steps to run Spark’s built-in SQL benchmarks using the SBT-based framework.
9+
## Apache Spark Benchmarking
10+
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing performance on x86_64 vs Arm64 platforms.
11+
12+
Follow the steps outlined to run Spark’s built-in SQL benchmarks using the SBT-based framework.
1213

1314
1. Clone the Apache Spark source code
1415
```console
1516
git clone https://github.com/apache/spark.git
1617
```
17-
This downloads the full Spark source including internal test suites and the benchmarking tools.
18+
This clones the full Spark source code including internal test suites and the benchmarking tools.
1819

1920
2. Checkout the desired Spark version
2021
```console
@@ -32,9 +33,9 @@ This compiles Spark and its dependencies, enabling the benchmarks build profile
3233
```console
3334
./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.AggregateBenchmark"
3435
```
35-
This executes the AggregateBenchmark, which compares performance of SQL aggregation operations (e.g., SUM, STDDEV) with and without WholeStageCodegen. WholeStageCodegen is an optimization technique used by Spark SQL to improve the performance of query execution by generating Java bytecode for entire query stages (aka whole stages) instead of interpreting them step-by-step.
36+
This executes the `AggregateBenchmark`, which compares performance of SQL aggregation operations (e.g., SUM, STDDEV) with and without `WholeStageCodegen`. `WholeStageCodegen` is an optimization technique used by Spark SQL to improve the performance of query execution by generating Java bytecode for entire query stages (aka whole stages) instead of interpreting them step-by-step.
3637

37-
You should see an output similar to:
38+
You should see output similar to:
3839
```output
3940
[info] Running benchmark: agg w/o group
4041
[info] Running case: agg w/o group wholestage off
@@ -243,8 +244,8 @@ You should see an output similar to:
243244
- **Per Row (ns):** Average time taken per row (in nanoseconds).
244245
- **Relative Speed comparison:** baseline (1.0X) is the slower version.
245246

246-
### Benchmark summary on x86_64:
247-
The following benchmark results are collected on a c3-standard-4 (4 vCPU, 2 core, 16 GB Memory) x86_64 environment, running RHEL 9.
247+
### Benchmark summary on `x86_64`:
248+
The following benchmark results were collected by running the same benchmark on a `c3-standard-4` (4 vCPU, 2 core, 16 GB Memory) x86_64 virtual machine in GCP, running RHEL 9.
248249

249250
| **Benchmark Case** | **Sub-Case / Config** | **Best Time (ms)** | **Avg Time (ms)** | **Stdev (ms)** | **Rate (M/s)** | **Per Row (ns)** | **Relative** |
250251
|---------------------------|--------------------------------------|--------------------|-------------------|----------------|----------------|------------------|--------------|
@@ -330,7 +331,8 @@ The following benchmark results are collected on a c4a-standard-4 (4 vCPU, 16 G
330331
| BytesToBytesMap | fast hash | 42 | 42 | 0 | 499.2 | 2.0 | 3.3X |
331332
| BytesToBytesMap |Aggregate HashMap | 23 | 23 | 0 | 913.0 | 1.1 | 5.9X |
332333

333-
### **Highlights from GCP C4A Arm virtual machine**
334+
### Benchmarking comparison summary
335+
When you compare the benchmarking results you will notice that on the Google Axion C4A instances:
334336

335337
- **Whole-stage code generation significantly boosts performance**, improving execution by up to **38×** (e.g., `agg w/o group` from 33.4s to 0.86s).
336338
- **Vectorized and row-based hash maps** consistently outperform non-codegen and traditional hashmap approaches, especially for aggregation with keys and complex data types (e.g., decimal keys: **6.8× faste**r with vectorized hashmap).

content/learning-paths/servers-and-cloud-computing/spark-on-gcp/instance.md

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Create Google Axion C4A Arm virtual machine
2+
title: Create a Google Axion C4A Arm virtual machine
33
weight: 3
44

55
### FIXED, DO NOT MODIFY
@@ -8,9 +8,7 @@ layout: learningpathall
88

99
## Introduction
1010

11-
This guide walks you through provisioning **Google Axion C4A Arm virtual machine** on GCP with the **c4a-standard-4 (4 vCPUs, 16 GB Memory)** machine type, using the **Google Cloud Console**.
12-
13-
If you are new to Google Cloud, it is recommended to follow the [GCP Quickstart Guide to Create a virtual machine](https://cloud.google.com/compute/docs/instances/create-start-instance).
11+
In this section you will learn how to provision a **Google Axion C4A Arm virtual machine** on GCP with the **c4a-standard-4 (4 vCPUs, 16 GB Memory)** machine type, using the **Google Cloud Console**.
1412

1513
For more details, kindly follow the Learning Path on [Getting Started with Google Cloud Platform](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/google/).
1614

@@ -24,6 +22,6 @@ To create a virtual machine based on the C4A Arm architecture:
2422
- Choose the **Series** as `C4A`.
2523
- Select a machine type such as `c4a-standard-4`.
2624
![Instance Screenshot](./image1.png)
27-
4. Under the **OS and Storage**, click on **Change**, and select Arm64 based OS Image of your choice. For this Learning Path, we pick **Red Hat Enterprise Linux** as the Operating System with **Red Hat Enterprise Linux 9** as the Version. Make sure you pick the version of image for Arm.
28-
5. Under **Networking**, enable **Allow HTTP traffic** to allow HTTP communications.
25+
4. Under the **OS and Storage**, click on **Change**, and select Arm64 based OS Image of your choice. For this Learning Path, choose **Red Hat Enterprise Linux** as the Operating System with **Red Hat Enterprise Linux 9** as the Version. Make sure you pick the version of image for Arm. Click on the **Select**.
26+
5. Under **Networking**, enable **Allow HTTP traffic** to allow HTTP communication.
2927
6. Click on **Create**, and the instance will launch.
Lines changed: 30 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Deploy Apache Spark on Google Axion C4A virtual machine
2+
title: Deploy Apache Spark on a Google Axion C4A virtual machine
33
weight: 4
44

55
### FIXED, DO NOT MODIFY
@@ -9,46 +9,65 @@ layout: learningpathall
99

1010
## Deploy Apache Spark on Google Axion C4A virtual machine
1111

12-
This Learning Path shows how to deploy Apache Spark on a Google Cloud C4A Arm virtual machine running Red Hat Enterprise Linux. It covers installing Java, Scala, Maven, and Spark, followed by functional validation through baseline testing.
13-
Finally, it includes benchmarking to compare Spark’s performance on Arm64 versus x86 architectures—optimizing data processing workloads on cost-efficient Arm-based infrastructure.
12+
In this section you will learn how to deploy Apache Spark on a Google Cloud C4A Arm virtual machine running Red Hat Enterprise Linux. You will install Java, Scala, Maven, and Spark. In the following sections you will run functional tests to validate your installation and benchmarking to compare Spark’s performance on Arm64 versus x86 architectures.
13+
14+
First, SSH into the Google Cloud C4A VM you created in the previous section.
15+
16+
On your running VM, install Java, Maven and the other dependencies needed for deploying Spark:
17+
1418

1519
### Install Required Packages
1620

1721
```console
18-
sudo tdnf update -y
19-
sudo tdnf install -y java-17-openjdk java-17-openjdk-devel git maven wget nano curl unzip tar
22+
sudo dnf update -y
23+
sudo dnf install -y java-17-openjdk java-17-openjdk-devel git maven wget nano curl unzip tar
2024
```
2125
Verify Java installation:
2226
```console
2327
java -version
2428
```
29+
The output will look like:
30+
31+
```output
32+
openjdk 17.0.16 2025-07-15 LTS
33+
OpenJDK Runtime Environment (Red_Hat-17.0.16.0.8-1) (build 17.0.16+8-LTS)
34+
OpenJDK 64-Bit Server VM (Red_Hat-17.0.16.0.8-1) (build 17.0.16+8-LTS, mixed mode, sharing)
35+
```
2536

2637
### Install Apache Spark on Arm
38+
39+
You can now download and install Apache Spark on your running Arm-based virtual machine:
40+
2741
```console
2842
wget https://downloads.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
2943
tar -xzf spark-3.5.6-bin-hadoop3.tgz
3044
sudo mv spark-3.5.6-bin-hadoop3 /opt/spark
3145
```
32-
### Set Environment Variables
33-
Add this line to ~/.bashrc or ~/.zshrc to make the change persistent across terminal sessions.
46+
### Set Environment Variables
47+
Setup the environment variables to use Spark.
48+
49+
Add the lines below to your shell configuration scripts to make the change persistent across terminal sessions:
3450

35-
```cosole
51+
```console
3652
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
53+
echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk' >> ~/.bashrc
3754
echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
3855

3956
```
40-
Apply changes immediately
57+
Apply changes immediately by sourcing the script:
4158

4259
```console
4360
source ~/.bashrc
4461
```
4562

4663
### Verify Spark Installation
4764

65+
You can now verify your Spark installation:
66+
4867
```console
4968
spark-shell --version
5069
```
51-
You should see an output similar to:
70+
The output should look similar to:
5271

5372
```output
5473
Welcome to
@@ -60,4 +79,4 @@ Welcome to
6079
6180
Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 17.0.15
6281
```
63-
Spark installation is complete. You can now proceed with the baseline testing.
82+
Spark installation is complete. You can now proceed to the next section where you perform baseline testing of Spark.

0 commit comments

Comments
 (0)