Skip to content

Commit f95a3a1

Browse files
Fix headings
1 parent bcf2728 commit f95a3a1

File tree

1 file changed

+8
-11
lines changed

1 file changed

+8
-11
lines changed

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,6 @@ draft: false
66
featured: true
77
weight: 1
88
---
9-
10-
Have you ever wondered how companies like Netflix recommend your favorite movies or how e-commerce platforms handle vast amounts of data to personalize your shopping experience 🤔?
11-
12-
Behind the scenes, these capabilities often rely on **Apache Spark**, a powerful distributed computing system designed for big data processing. Spark simplifies working with massive datasets by enabling fast and scalable data processing across clusters of computers.
13-
14-
Let’s dive into Spark to understand its core,
15-
169
## Introduction
1710

1811
Apache Spark is a unified, multi-language (Python, Java, Scala, and R)computing engine for executing data engineering, data science, and machine learning on single-node machines or clusters and a set of libraries for parallel data processing.
@@ -60,6 +53,7 @@ Let’s break down our description:
6053
<p align="center">
6154
<img width="500px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/spark-components.png" alt="Spark Components">
6255
</p>
56+
6357
## High-Level Components (Spark Applications)
6458

6559
At a high level, Spark provides several libraries that extend its functionality and are used in specialized data processing tasks.
@@ -76,7 +70,7 @@ At a high level, Spark provides several libraries that extend its functionality
7670

7771
At the heart of all these specialized libraries is **Spark Core**. Spark Core is responsible for basic functionalities like task scheduling, memory management, fault tolerance, and interactions with storage systems.
7872

79-
### The Core of Spark: RDDs
73+
### RDDs
8074

8175
**RDDs (Resilient Distributed Datasets)** are the fundamental building blocks of Spark Core. They represent an immutable, distributed collection of objects that can be processed in parallel across a cluster. More about RDDs is discussed later.
8276

@@ -99,15 +93,16 @@ Spark's ability to interact with these diverse storage systems allows users to w
9993
<p align="center">
10094
<img width="500px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/spark-architecture.png" alt="Spark Basic Architecture">
10195
</p>
102-
### 1\. The Spark Driver
96+
97+
### 1\. Spark Driver
10398

10499
The Spark driver(process) is like the “brain” of your Spark application. It’s responsible for controlling everything. The driver makes decisions about what tasks to run, keeps track of the application’s progress, and talks to the cluster manager to get the computing power needed. Essentially, it manages the entire process and checks on the tasks being handled by worker nodes (executors). So basically it manages the lifecycle of the spark application.
105100

106-
### 2\. The Spark Executors
101+
### 2\. Spark Executors
107102

108103
Executors(process)are the “workers” that actually do the processing. They take instructions from the driver, execute the tasks, and send back the results. Every Spark application gets its own set of executors, which run on different machines. They are responsible for completing the tasks, saving data, reporting results, and re-running any tasks that fail.
109104

110-
### 3\. The Cluster Manager
105+
### 3\. Cluster Manager
111106

112107
The cluster manager is like a “resource manager.” It manages the machines that make up your cluster and ensures that the Spark driver and executors have enough resources (like CPU and memory) to do their jobs. Spark can work with several types of cluster managers, such as YARN, Mesos, or Kubernetes, or it can use its built-in manager.
113108

@@ -200,6 +195,7 @@ Examples: `map` `filter`
200195
<p align="center">
201196
<img width="400px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/narrow-transformation.png" alt="Spark Narrow Transformation">
202197
</p>
198+
203199
### Wide Transformations
204200

205201
In a **wide transformation**, data from multiple parent RDD/DataFrame partitions must be shuffled (redistributed) to form new partitions. These operations involve **network communication**, making them more expensive.
@@ -209,6 +205,7 @@ Examples: `groupByKey` `reduceByKey` `join`
209205
<p align="center">
210206
<img width="500px" src="/images/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond/wide-transformation.png" alt="Spark Wide Transformation">
211207
</p>
208+
212209
## Actions
213210

214211
They are operations that trigger the execution of transformations and return results to the driver program. Actions are the point where Spark evaluates the lazy transformations applied to an RDD, DataFrame, or Dataset.

0 commit comments

Comments
 (0)