Fix headings

shivani-angelone · shivani-angelone · commit f95a3a1dbe8d · 2025-01-24T10:14:29.000+05:30
diff --git a/content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md b/content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md
@@ -6,13 +6,6 @@ draft: false
 featured: true
 weight: 1
 ---
-
-Have you ever wondered how companies like Netflix recommend your favorite movies or how e-commerce platforms handle vast amounts of data to personalize your shopping experience 🤔?
-
-Behind the scenes, these capabilities often rely on **Apache Spark**, a powerful distributed computing system designed for big data processing. Spark simplifies working with massive datasets by enabling fast and scalable data processing across clusters of computers.
-
-Let’s dive into Spark to understand its core,
-
 ## Introduction
 
 Apache Spark is a unified, multi-language (Python, Java, Scala, and R)computing engine for executing data engineering, data science, and machine learning on single-node machines or clusters and a set of libraries for parallel data processing.
@@ -60,6 +53,7 @@ Let’s break down our description:
 <p align="center">
   <img width="500px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/spark-components.png" alt="Spark Components">
 </p>
+
 ## High-Level Components (Spark Applications)
 
 At a high level, Spark provides several libraries that extend its functionality and are used in specialized data processing tasks.
@@ -76,7 +70,7 @@ At a high level, Spark provides several libraries that extend its functionality
 
 At the heart of all these specialized libraries is **Spark Core**. Spark Core is responsible for basic functionalities like task scheduling, memory management, fault tolerance, and interactions with storage systems.
 
-### The Core of Spark: RDDs
+### RDDs
 
 **RDDs (Resilient Distributed Datasets)** are the fundamental building blocks of Spark Core. They represent an immutable, distributed collection of objects that can be processed in parallel across a cluster. More about RDDs is discussed later.
 
@@ -99,15 +93,16 @@ Spark's ability to interact with these diverse storage systems allows users to w
 <p align="center">
   <img width="500px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/spark-architecture.png" alt="Spark Basic Architecture">
 </p>
-### 1\. The Spark Driver
+
+### 1\. Spark Driver
 
 The Spark driver(process) is like the “brain” of your Spark application. It’s responsible for controlling everything. The driver makes decisions about what tasks to run, keeps track of the application’s progress, and talks to the cluster manager to get the computing power needed. Essentially, it manages the entire process and checks on the tasks being handled by worker nodes (executors). So basically it manages the lifecycle of the spark application.
 
-### 2\. The Spark Executors
+### 2\. Spark Executors
 
 Executors(process)are the “workers” that actually do the processing. They take instructions from the driver, execute the tasks, and send back the results. Every Spark application gets its own set of executors, which run on different machines. They are responsible for completing the tasks, saving data, reporting results, and re-running any tasks that fail.
 
-### 3\. The Cluster Manager
+### 3\. Cluster Manager
 
 The cluster manager is like a “resource manager.” It manages the machines that make up your cluster and ensures that the Spark driver and executors have enough resources (like CPU and memory) to do their jobs. Spark can work with several types of cluster managers, such as YARN, Mesos, or Kubernetes, or it can use its built-in manager.
 
@@ -200,6 +195,7 @@ Examples: `map` `filter`
 <p align="center">
   <img width="400px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/narrow-transformation.png" alt="Spark Narrow Transformation">
 </p>
+
 ### Wide Transformations
 
 In a **wide transformation**, data from multiple parent RDD/DataFrame partitions must be shuffled (redistributed) to form new partitions. These operations involve **network communication**, making them more expensive.
@@ -209,6 +205,7 @@ Examples: `groupByKey` `reduceByKey` `join`
 <p align="center">
   <img width="500px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/wide-transformation.png" alt="Spark Wide Transformation">
 </p>
+
 ## Actions
 
 They are operations that trigger the execution of transformations and return results to the driver program. Actions are the point where Spark evaluates the lazy transformations applied to an RDD, DataFrame, or Dataset.