Merge pull request #2227 from pareenaverma/content_review

pareenaverma · web-flow · commit 0984ee7ed445 · 2025-08-14T16:26:38.000-05:00
Tech review of Spark on GCP
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/_index.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/_index.md
@@ -1,26 +1,25 @@
 ---
-title: Deploy Apache Spark on Google Axion C4A virtual machine
+title: Deploy Apache Spark on Google Axion processors
 
 draft: true
 cascade:
     draft: true
     
 minutes_to_complete: 60
 
-who_is_this_for: This is an introductory topic for the software developers who are willing to migrate their Apache Spark workloads from the x86_64 platforms to Arm-based platforms, or on Google Axion-based C4A virtual machines specifically.  
+who_is_this_for: This is an introductory topic for the software developers interested in migrating their Apache Spark workloads from x86_64 platforms to Arm-based platforms, or on Google Axion based C4A virtual machines specifically.  
 
 learning_objectives:
-       - Provision an Arm virtual machine on the Google Cloud Platform using the C4A Google Axion instance family, and RHEL 9 as the base image.
-       - Understand how to install and configure Apache Spark on Arm-based GCP C4A instances.
+       - Start an Arm virtual machine on the Google Cloud Platform using the C4A Google Axion instance family with RHEL 9 as the base image.
+       - Learn how to install and configure Apache Spark on Arm-based GCP C4A instances.
        - Validate the functionality of spark through baseline testing.
-       - Perform benchmarking to evaluate Apache Spark’s performance on Arm.
+       - Benchmark Apache Spark’s performance on Arm.
 
 prerequisites:
      - A [Google Cloud Platform (GCP)](https://cloud.google.com/free?utm_source=google&hl=en) account with billing enabled.
-     - Basic understanding of Linux command line.
      - Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/). 
 
-author: Jason Andrews
+author: Pareena Verma
 
 ##### Tags
 skilllevels: Advanced
@@ -53,7 +52,7 @@ further_reading:
 
     - resource:
         title: The Scala programming language official website
-        link: scala-lang.org
+        link: https://scala-lang.org
         type: website
 
 
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/background.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/background.md
@@ -1,18 +1,18 @@
 ---
-title: "About Google Axion C4A series and Apache Spark"
+title: "Google Axion C4A and Apache Spark"
 
 weight: 2
 
 layout: "learningpathall"
 ---
 
-## Google Axion C4A series
+## Google Axion C4A instances
 
-The Google Axion C4A series is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machine offer strong performance ideal for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
+Google Axion C4A is a family of Arm-based virtual machines built on Google’s custom Axion CPU, which is based on Arm Neoverse-V2 cores. Designed for high-performance and energy-efficient computing, these virtual machine offer strong performance ideal for modern cloud workloads such as CI/CD pipelines, microservices, media processing, and general-purpose applications.
 
-The C4A series provides a cost-effective alternative to x86 virtual machine while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
+The C4A series provides a cost-effective alternative to x86 virtual machines while leveraging the scalability and performance benefits of the Arm architecture in Google Cloud.
 
-To learn more about Google Axion, refer to the blog [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu).
+To learn more about Google Axion, refer to the [Introducing Google Axion Processors, our new Arm-based CPUs](https://cloud.google.com/blog/products/compute/introducing-googles-new-arm-based-cpu) blog.
 
 ## Apache Spark
 
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/baseline.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/baseline.md
@@ -7,17 +7,17 @@ layout: learningpathall
 ---
 
 
-Since Apache Spark is installed successfully on your GCP C4A Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output. 
+With Apache Spark installed successfully on your GCP C4A Arm-based virtual machine, you can now perform simple baseline testing to validate that Spark runs correctly and gives expected output. 
 
 ## Spark Baseline Test
 
-Create a simple Spark job file: 
+Using a file editor of your choice, create a simple Spark job file: 
 ```console
 nano ~/spark_baseline_test.scala
 ```
-Below is this content of **spark_baseline_test.scala** file:
+Copy the content below into `spark_baseline_test.scala`:
 
-```scala
+```console
 val data = Seq(1, 2, 3, 4, 5) 
 val distData = spark.sparkContext.parallelize(data) 
  
@@ -26,10 +26,9 @@ val squared = distData.map(x => x * x).collect()
  
 println("Squared values: " + squared.mkString(", ")) 
 ```
-Code Explanation:
-This code is a basic Apache Spark example in Scala, demonstrating how to create an RDD (Resilient Distributed Dataset), perform a transformation, and collect results.
+This is a basic Apache Spark example in Scala, demonstrating how to create an RDD (Resilient Distributed Dataset), perform a transformation, and collect results.
 
-What it does, step by step:
+Lets look into the code, step by step:
 
 - **val data = Seq(1, 2, 3, 4, 5)** : Creates a local Scala sequence of integers.
 - **val distData = spark.sparkContext.parallelize(data)** : Uses parallelize to convert the local sequence into a distributed RDD (so Spark can operate on it in parallel across cluster nodes or CPU cores).
@@ -39,11 +38,11 @@ What it does, step by step:
 
 ### Run the Test in Spark Shell
 
-Run the test in the interactive shell: 
+Run the test you created in the interactive shell: 
 ```console
 spark-shell < ~/spark_baseline_test.scala 
 ```
-You should see an output similar to:
+The output should look similar to:
 ```output
 Squared values: 1, 4, 9, 16, 25
 ```
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/benchmarking.md
@@ -1,20 +1,21 @@
 ---
-title: Spark Internal Benchmarking
+title: Run Spark Benchmarks
 weight: 6
 
 ### FIXED, DO NOT MODIFY
 layout: learningpathall
 ---
 
-## Apache Spark Internal Benchmarking
-Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing platforms such as x86_64 vs Arm64.
-Below are the steps to run Spark’s built-in SQL benchmarks using the SBT-based framework.
+## Apache Spark Benchmarking
+Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing performance on x86_64 vs Arm64 platforms.
+
+Follow the steps outlined to run Spark’s built-in SQL benchmarks using the SBT-based framework.
 
 1. Clone the Apache Spark source code
 ```console
 git clone https://github.com/apache/spark.git
 ```
-This downloads the full Spark source including internal test suites and the benchmarking tools.
+This clones the full Spark source code including internal test suites and the benchmarking tools.
 
 2. Checkout the desired Spark version
 ```console
@@ -32,9 +33,9 @@ This compiles Spark and its dependencies, enabling the benchmarks build profile
 ```console
 ./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.AggregateBenchmark"
 ```  
-This executes the AggregateBenchmark, which compares performance of SQL aggregation operations (e.g., SUM, STDDEV) with and without WholeStageCodegen. WholeStageCodegen is an optimization technique used by Spark SQL to improve the performance of query execution by generating Java bytecode for entire query stages (aka whole stages) instead of interpreting them step-by-step.
+This executes the `AggregateBenchmark`, which compares performance of SQL aggregation operations (e.g., SUM, STDDEV) with and without `WholeStageCodegen`. `WholeStageCodegen` is an optimization technique used by Spark SQL to improve the performance of query execution by generating Java bytecode for entire query stages (aka whole stages) instead of interpreting them step-by-step.
 
-You should see an output similar to:
+You should see output similar to:
 ```output
 [info] Running benchmark: agg w/o group
 [info]   Running case: agg w/o group wholestage off
@@ -243,8 +244,8 @@ You should see an output similar to:
 - **Per Row (ns):** Average time taken per row (in nanoseconds).
 - **Relative Speed comparison:** baseline (1.0X) is the slower version.
 
-### Benchmark summary on x86_64:
-The following  benchmark results are collected on a c3-standard-4 (4 vCPU, 2 core, 16 GB Memory) x86_64 environment, running RHEL 9.
+### Benchmark summary on `x86_64`:
+The following benchmark results were collected by running the same benchmark on a `c3-standard-4` (4 vCPU, 2 core, 16 GB Memory) x86_64 virtual machine in GCP, running RHEL 9.
 
 | **Benchmark Case**         | **Sub-Case / Config**                | **Best Time (ms)** | **Avg Time (ms)** | **Stdev (ms)** | **Rate (M/s)** | **Per Row (ns)** | **Relative** |
 |---------------------------|--------------------------------------|--------------------|-------------------|----------------|----------------|------------------|--------------|
@@ -330,7 +331,8 @@ The following  benchmark results are collected on a c4a-standard-4 (4 vCPU, 16 G
 | BytesToBytesMap            | fast hash                | 42             | 42             | 0          | 499.2       | 2.0            | 3.3X      |
 | BytesToBytesMap    |Aggregate HashMap                 | 23             | 23             | 0          | 913.0       | 1.1            | 5.9X      |
 
-### **Highlights from GCP C4A Arm virtual machine**
+### Benchmarking comparison summary
+When you compare the benchmarking results you will notice that on the Google Axion C4A instances:
 
 - **Whole-stage code generation significantly boosts performance**, improving execution by up to **38×** (e.g., `agg w/o group` from 33.4s to 0.86s).
 - **Vectorized and row-based hash maps** consistently outperform non-codegen and traditional hashmap approaches, especially for aggregation with keys and complex data types (e.g., decimal keys: **6.8× faste**r with vectorized hashmap).
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/instance.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/instance.md
@@ -1,5 +1,5 @@
 ---
-title: Create Google Axion C4A Arm virtual machine 
+title: Create a Google Axion C4A Arm virtual machine 
 weight: 3
 
 ### FIXED, DO NOT MODIFY
@@ -8,9 +8,7 @@ layout: learningpathall
 
 ## Introduction
 
-This guide walks you through provisioning **Google Axion C4A Arm virtual machine** on GCP with the **c4a-standard-4 (4 vCPUs, 16 GB Memory)** machine type, using the **Google Cloud Console**. 
-
-If you are new to Google Cloud, it is recommended to follow the [GCP Quickstart Guide to Create a virtual machine](https://cloud.google.com/compute/docs/instances/create-start-instance).
+In this section you will learn how to provision a **Google Axion C4A Arm virtual machine** on GCP with the **c4a-standard-4 (4 vCPUs, 16 GB Memory)** machine type, using the **Google Cloud Console**. 
 
 For more details, kindly follow the Learning Path on [Getting Started with Google Cloud Platform](https://learn.arm.com/learning-paths/servers-and-cloud-computing/csp/google/).
 
@@ -24,6 +22,6 @@ To create a virtual machine based on the C4A Arm architecture:
       - Choose the **Series** as `C4A`.
       - Select a machine type such as `c4a-standard-4`.
 ![Instance Screenshot](./image1.png)
-4. Under the **OS and Storage**, click on **Change**, and select Arm64 based OS Image of your choice. For this Learning Path, we pick **Red Hat Enterprise Linux** as the Operating System with **Red Hat Enterprise Linux 9** as the Version. Make sure you pick the version of image for Arm.
-5. Under **Networking**, enable **Allow HTTP traffic** to allow HTTP communications.
+4. Under the **OS and Storage**, click on **Change**, and select Arm64 based OS Image of your choice. For this Learning Path, choose **Red Hat Enterprise Linux** as the Operating System with **Red Hat Enterprise Linux 9** as the Version. Make sure you pick the version of image for Arm. Click on the **Select**.
+5. Under **Networking**, enable **Allow HTTP traffic** to allow HTTP communication.
 6. Click on **Create**, and the instance will launch.
diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/spark-deployment.md b/content/learning-paths/servers-and-cloud-computing/spark-on-gcp/spark-deployment.md
@@ -1,5 +1,5 @@
 ---
-title: Deploy Apache Spark on Google Axion C4A virtual machine
+title: Deploy Apache Spark on a Google Axion C4A virtual machine
 weight: 4
 
 ### FIXED, DO NOT MODIFY
@@ -9,46 +9,65 @@ layout: learningpathall
 
 ## Deploy Apache Spark on Google Axion C4A virtual machine
 
-This Learning Path shows how to deploy Apache Spark on a Google Cloud C4A Arm virtual machine running Red Hat Enterprise Linux. It covers installing Java, Scala, Maven, and Spark, followed by functional validation through baseline testing. 
-Finally, it includes benchmarking to compare Spark’s performance on Arm64 versus x86 architectures—optimizing data processing workloads on cost-efficient Arm-based infrastructure.
+In this section you will learn how to deploy Apache Spark on a Google Cloud C4A Arm virtual machine running Red Hat Enterprise Linux. You will install Java, Scala, Maven, and Spark. In the following sections you will run functional tests to validate your installation and benchmarking to compare Spark’s performance on Arm64 versus x86 architectures. 
+
+First, SSH into the Google Cloud C4A VM you created in the previous section. 
+
+On your running VM, install Java, Maven and the other dependencies needed for deploying Spark:
+
 
 ### Install Required Packages 
 
 ```console
-sudo tdnf update -y
-sudo tdnf install -y java-17-openjdk java-17-openjdk-devel git maven wget nano curl unzip tar
+sudo dnf update -y
+sudo dnf install -y java-17-openjdk java-17-openjdk-devel git maven wget nano curl unzip tar
 ```
 Verify Java installation: 
 ```console
 java -version
 ```
+The output will look like:
+
+```output
+openjdk 17.0.16 2025-07-15 LTS
+OpenJDK Runtime Environment (Red_Hat-17.0.16.0.8-1) (build 17.0.16+8-LTS)
+OpenJDK 64-Bit Server VM (Red_Hat-17.0.16.0.8-1) (build 17.0.16+8-LTS, mixed mode, sharing)
+```
 
 ### Install Apache Spark on Arm
+
+You can now download and install Apache Spark on your running Arm-based virtual machine:
+
 ```console
 wget https://downloads.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
 tar -xzf spark-3.5.6-bin-hadoop3.tgz
 sudo mv spark-3.5.6-bin-hadoop3 /opt/spark
 ```
-### Set Environment Variables 
-Add this line to ~/.bashrc or ~/.zshrc to make the change persistent across terminal sessions.
+### Set Environment Variables
+Setup the environment variables to use Spark.
+ 
+Add the lines below to your shell configuration scripts to make the change persistent across terminal sessions:
 
-```cosole
+```console
 echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
+echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk' >> ~/.bashrc
 echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
 
 ```
-Apply changes immediately
+Apply changes immediately by sourcing the script:
 
 ```console
 source ~/.bashrc 
 ```
 
 ### Verify Spark Installation 
 
+You can now verify your Spark installation:
+
 ```console
 spark-shell --version 
 ```
-You should see an output similar to: 
+The output should look similar to: 
 
 ```output
 Welcome to
@@ -60,4 +79,4 @@ Welcome to
 
 Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 17.0.15
 ```
-Spark installation is complete. You can now proceed with the baseline testing.
+Spark installation is complete. You can now proceed to the next section where you perform baseline testing of Spark.