Merge pull request #2246 from madeline-underwood/apache_spark

pareenaverma · web-flow · commit 0e17670b562f · 2025-08-20T09:46:34.000-04:00
Apache spark_PV to sign off
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/_index.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/_index.md
@@ -1,23 +1,19 @@
 ---
 title: Deploy Apache Spark on Google Axion processors
-
-draft: true
-cascade:
-    draft: true
-    
+   
 minutes_to_complete: 60
 
-who_is_this_for: This is an introductory topic for the software developers interested in migrating their Apache Spark workloads from x86_64 platforms to Arm-based platforms, or on Google Axion based C4A virtual machines specifically.  
+who_is_this_for: This introductory topic is for software developers interested in migrating their Apache Spark workloads from x86_64 platforms to Arm-based platforms, specifically on Google Axion–based C4A virtual machines.  
 
 learning_objectives:
-       - Start an Arm virtual machine on the Google Cloud Platform using the C4A Google Axion instance family with RHEL 9 as the base image.
-       - Learn how to install and configure Apache Spark on Arm-based GCP C4A instances.
-       - Validate the functionality of spark through baseline testing.
-       - Benchmark Apache Spark’s performance on Arm.
+  - Start an Arm virtual machine on Google Cloud Platform (GCP) using the C4A Google Axion instance family with RHEL 9 as the base image
+  - Install and configure Apache Spark on Arm-based GCP C4A instances
+  - Validate Spark functionality through baseline testing
+  - Benchmark Apache Spark performance on Arm
 
 prerequisites:
-     - A [Google Cloud Platform (GCP)](https://cloud.google.com/free?utm_source=google&hl=en) account with billing enabled.
-     - Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/). 
+  - A [Google Cloud Platform (GCP)](https://cloud.google.com/free?utm_source=google&hl=en) account with billing enabled
+  - Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/)
 
 author: Pareena Verma
 
@@ -27,36 +23,37 @@ subjects: Performance and Architecture
 cloud_service_providers: Google Cloud
 
 armips:
-    - Neoverse
+  - Neoverse
 
 tools_software_languages:
   - Apache Spark
   - Python
 
 operatingsystems:
-    - Linux
+  - Linux
 
 # ================================================================================
 #       FIXED, DO NOT MODIFY
 # ================================================================================
 further_reading:
-    - resource:
-        title: Google Cloud official website and documentation
-        link: https://cloud.google.com/docs
-        type: documentation
-
-    - resource:
-        title: Spark official website and documentation
-        link: https://spark.apache.org/
-        type: documentation
+  - resource:
+      title: Google Cloud official documentation
+      link: https://cloud.google.com/docs
+      type: documentation
 
-    - resource:
-        title: The Scala programming language official website
-        link: https://scala-lang.org
-        type: website
+  - resource:
+      title: Apache Spark documentation
+      link: https://spark.apache.org/
+      type: documentation
 
+  - resource:
+      title: Scala programming language official website
+      link: https://scala-lang.org
+      type: website
 
 weight: 1                       # _index.md always has weight of 1 to order correctly
 layout: "learningpathall"       # All files under learning paths have this same wrapper
 learning_path_main_page: "yes"  # Indicates this should be surfaced when looking for related content. Only set for _index.md of learning path content.
 ---
+
+
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/background.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/background.md
@@ -1,20 +1,20 @@
 ---
-title: "Google Axion C4A and Apache Spark"
+title: Getting started with Apache Spark on Google Axion C4A (Arm Neoverse-V2)
 
 weight: 2
 
 layout: "learningpathall"
 ---
 
-## Google Axion C4A instances
+## Google Axion C4A Arm instances in Google Cloud
 
-Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machine offer strong performance ideal for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
+Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machines offer strong performance for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
 
 The C4A series provides a cost-effective alternative to x86 virtual machines while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
 
 To learn more about Google Axion, refer to the [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu) blog.
 
-## Apache Spark
+## Apache Spark for big data processing on Arm
 
 Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing. 
 
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/baseline.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/baseline.md
@@ -1,50 +1,64 @@
 ---
-title: Baseline Testing
+title: Apache Spark baseline testing on Google Axion C4A Arm VM
 weight: 5
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
+## Validate Apache Spark installation with a baseline test
 
+With Apache Spark installed successfully on your GCP C4A Arm-based virtual machine, you can now perform simple baseline testing to validate that Spark runs correctly and produces the expected output.
 
-With Apache Spark installed successfully on your GCP C4A Arm-based virtual machine, you can now perform simple baseline testing to validate that Spark runs correctly and gives expected output. 
+## Run a baseline test for Apache Spark on Arm
 
-## Spark Baseline Test
+Use a text editor of your choice to create a simple Spark job file:
 
-Using a file editor of your choice, create a simple Spark job file: 
 ```console
 nano ~/spark_baseline_test.scala
 ```
-Copy the content below into `spark_baseline_test.scala`:
 
-```console
-val data = Seq(1, 2, 3, 4, 5) 
-val distData = spark.sparkContext.parallelize(data) 
- 
-// Basic transformation and action 
-val squared = distData.map(x => x * x).collect() 
- 
-println("Squared values: " + squared.mkString(", ")) 
+Add the following code to `spark_baseline_test.scala`:
+
+```scala
+val data = Seq(1, 2, 3, 4, 5)
+val distData = spark.sparkContext.parallelize(data)
+
+// Basic transformation and action
+val squared = distData.map(x => x * x).collect()
+
+println("Squared values: " + squared.mkString(", "))
 ```
-This is a basic Apache Spark example in Scala, demonstrating how to create an RDD (Resilient Distributed Dataset), perform a transformation, and collect results.
 
-Lets look into the code, step by step:
+This Scala example shows how to create an RDD (Resilient Distributed Dataset), apply a transformation, and collect results.
 
-- **val data = Seq(1, 2, 3, 4, 5)** : Creates a local Scala sequence of integers.
-- **val distData = spark.sparkContext.parallelize(data)** : Uses parallelize to convert the local sequence into a distributed RDD (so Spark can operate on it in parallel across cluster nodes or CPU cores).
-- **val squared = distData.map(x => x * x).collect()** : `map(x => x * x)` squares each element in the list, `.collect()` brings all the transformed data back to the driver program as a regular Scala collection.
-- **println("Squared values: " + squared.mkString(", "))** : Prints the squared values, joined by commas.
+Here’s a step-by-step breakdown of the code:
 
+- **`val data = Seq(1, 2, 3, 4, 5)`**: Creates a local Scala sequence of integers  
+- **`val distData = spark.sparkContext.parallelize(data)`**: Converts the local sequence into a distributed RDD, so Spark can process it in parallel across CPU cores or cluster nodes  
+- **`val squared = distData.map(x => x * x).collect()`**: Squares each element using `map`, then gathers results back to the driver program with `collect`  
+- **`println("Squared values: " + squared.mkString(", "))`**: Prints the squared values as a comma-separated list  
 
-### Run the Test in Spark Shell
+## Run the Apache Spark baseline test in Spark shell
+
+Run the test file in the interactive Spark shell:
 
-Run the test you created in the interactive shell: 
 ```console
-spark-shell < ~/spark_baseline_test.scala 
+spark-shell < ~/spark_baseline_test.scala
 ```
-The output should look similar to:
+
+Alternatively, you can start the spark shell and then load the file from inside the shell:
+
+```console
+spark-shell
+```
+```scala
+:load spark_baseline_test.scala
+```
+
+You should see output similar to:
+
 ```output
 Squared values: 1, 4, 9, 16, 25
 ```
-This confirms that Spark is working correctly with its driver, executor, and cluster manager in local mode. 
- 
+
+This confirms that Spark is running correctly in local mode with its driver, executor, and cluster manager.
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/benchmarking.md
@@ -1,12 +1,11 @@
 ---
-title: Run Spark Benchmarks
+title: Apache Spark performance benchmarks on Arm64 and x86_64 in Google Cloud
 weight: 6
 
-### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Apache Spark Benchmarking
+## How to run Apache Spark benchmarks on Arm64 in GCP
 Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing performance on x86_64 vs Arm64 platforms.
 
 Follow the steps outlined to run Spark’s built-in SQL benchmarks using the SBT-based framework.
@@ -33,9 +32,11 @@ This compiles Spark and its dependencies, enabling the benchmarks build profile
 ```console
 ./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.AggregateBenchmark"
 ```  
-This executes the `AggregateBenchmark`, which compares performance of SQL aggregation operations (e.g., SUM, STDDEV) with and without `WholeStageCodegen`. `WholeStageCodegen` is an optimization technique used by Spark SQL to improve the performance of query execution by generating Java bytecode for entire query stages (aka whole stages) instead of interpreting them step-by-step.
+This executes the `AggregateBenchmark`, which compares performance of SQL aggregation operations (e.g., SUM, STDDEV) with and without `WholeStageCodegen`. `WholeStageCodegen` is an optimization technique used by Spark SQL to improve the performance of query execution by generating Java bytecode for entire query stages instead of interpreting them step-by-step.
 
+## Example Apache Spark benchmark output (Arm64)
 You should see output similar to:
+
 ```output
 [info] Running benchmark: agg w/o group
 [info]   Running case: agg w/o group wholestage off
@@ -235,7 +236,7 @@ You should see output similar to:
 [success] Total time: 669 s (11:09), completed Jul 24, 2025, 5:41:24 AM
 
 ```
-### Benchmark Results Table Explained:
+## Understanding Apache Spark benchmark metrics and results
 
 - **Best Time (ms):** Fastest execution time observed (in milliseconds).
 - **Avg Time (ms):** Average time across all iterations.
@@ -244,7 +245,7 @@ You should see output similar to:
 - **Per Row (ns):** Average time taken per row (in nanoseconds).
 - **Relative Speed comparison:** baseline (1.0X) is the slower version.
 
-### Benchmark summary on `x86_64`:
+## Apache Spark performance benchmark results on x86_64
 The following benchmark results were collected by running the same benchmark on a `c3-standard-4` (4 vCPU, 2 core, 16 GB Memory) x86_64 virtual machine in GCP, running RHEL 9.
 
 | **Benchmark Case**         | **Sub-Case / Config**                | **Best Time (ms)** | **Avg Time (ms)** | **Stdev (ms)** | **Rate (M/s)** | **Per Row (ns)** | **Relative** |
@@ -293,9 +294,10 @@ The following benchmark results were collected by running the same benchmark on
 | BytesToBytesMap           | BytesToBytesMap (on Heap)            | 624                | 627               | 3              | 33.6           | 29.8             | 0.3X         |
 | BytesToBytesMap           | Aggregate HashMap                    | 31                 | 31                | 0              | 680.7          | 1.5              | 6.6X         |
 
+---
 
-### Benchmark summary on Arm64:
-For easier comparison, the benchmark results collected from the earlier run on the `c4a-standard-4` (4 vCPU, 16 GB Memory) virtual machine, running RHEL 9 is summarized below:
+## Apache Spark performance benchmark results on Arm64
+Results from the earlier run on the `c4a-standard-4` (4 vCPU, 16 GB memory) Arm64 VM in GCP (RHEL 9):
 
 | Benchmark Case             | Sub-Case / Config        | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
 |----------------------------|--------------------------|----------------|----------------|------------|-------------|----------------|-----------|
@@ -331,13 +333,15 @@ For easier comparison, the benchmark results collected from the earlier run on t
 | BytesToBytesMap            | fast hash                | 42             | 42             | 0          | 499.2       | 2.0            | 3.3X      |
 | BytesToBytesMap    |Aggregate HashMap                 | 23             | 23             | 0          | 913.0       | 1.1            | 5.9X      |
 
-### Benchmarking comparison summary
+---
+
+## Apache Spark performance benchmarking comparison on Arm64 and x86_64
 When you compare the benchmarking results you will notice that on the Google Axion C4A Arm-based instances:
 
 - **Whole-stage code generation significantly boosts performance**, improving execution by up to **3×** (e.g., `agg w/o group` from 2728 ms to 856 ms).
 - **Aggregation with Keys**, across row-based and non-hashmap variants deliver ~1.7–5.4× speedups.
-For simple codegen+vectorized hashmap, x86 and Arm-based instances show similar performance.
 - **Arm-based Spark shows strong hash performance**, `murmur3` and `UnsafeRowhash` on Arm-based instances are ~3×–5× faster, with the aggregate hashmap ~6× faster; the `fast hash` path is roughly on par.
 
 Overall, when whole-stage codegen and vectorized hashmap paths are used, you should see multi-fold speedups on the Google Axion C4A Arm-based instances.
  
+
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/instance.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/instance.md
@@ -1,27 +1,30 @@
 ---
-title: Create a Google Axion C4A Arm virtual machine 
+title: How to create a Google Axion C4A Arm virtual machine on GCP 
 weight: 3
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Introduction
+## How to create a Google Axion C4A Arm VM on Google Cloud
 
-In this section you will learn how to provision a **Google Axion C4A Arm virtual machine** on GCP with the **c4a-standard-4 (4 vCPUs, 16 GB Memory)** machine type, using the **Google Cloud Console**. 
+In this section, you learn how to provision a **Google Axion C4A Arm virtual machine** on Google Cloud Platform (GCP) using the **c4a-standard-4 (4 vCPUs, 16 GB memory)** machine type in the **Google Cloud Console**.  
 
-For more details, kindly follow the Learning Path on [Getting Started with Google Cloud Platform](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/google/).
+For background on GCP setup, see the Learning Path [Getting started with Google Cloud Platform](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/google/).
 
-### Create an Arm-based Virtual Machine (C4A)
+### Create a Google Axion C4A Arm VM in Google Cloud Console
 
 To create a virtual machine based on the C4A Arm architecture:
 1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/).
-2. Go to **Compute Engine > VM Instances** and click on **Create Instance**. 
-3. Under the **Machine Configuration**:
-      - Fill in basic details like **Instance Name**, **Region**, and **Zone**.
-      - Choose the **Series** as `C4A`.
-      - Select a machine type such as `c4a-standard-4`.
-![Instance Screenshot](./image1.png)
-4. Under the **OS and Storage**, click on **Change**, and select Arm64 based OS Image of your choice. For this Learning Path, choose **Red Hat Enterprise Linux** as the Operating System with **Red Hat Enterprise Linux 9** as the Version. Make sure you pick the version of image for Arm. Click on the **Select**.
-5. Under **Networking**, enable **Allow HTTP traffic** to allow HTTP communication.
-6. Click on **Create**, and the instance will launch.
+2. Go to **Compute Engine > VM Instances** and select **Create Instance**. 
+3. Under **Machine configuration**:
+   - Enter details such as **Instance name**, **Region**, and **Zone**.
+   - Set **Series** to `C4A`.
+   - Select a machine type such as `c4a-standard-4`.
+
+   ![Create a Google Axion C4A Arm virtual machine in the Google Cloud Console with c4a-standard-4 selected alt-text#center](./image1.png "Google Cloud Console – creating a Google Axion C4A Arm virtual machine")
+
+4. Under **OS and Storage**, select **Change**, then choose an Arm64-based OS image.  
+   For this Learning Path, use **Red Hat Enterprise Linux 9**. Ensure you select the **Arm image** variant. Click **Select**.
+5. Under **Networking**, enable **Allow HTTP traffic**.
+6. Click **Create** to launch the instance.
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/spark-deployment.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/spark-deployment.md