You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md
+8-11Lines changed: 8 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,13 +6,6 @@ draft: false
6
6
featured: true
7
7
weight: 1
8
8
---
9
-
10
-
Have you ever wondered how companies like Netflix recommend your favorite movies or how e-commerce platforms handle vast amounts of data to personalize your shopping experience 🤔?
11
-
12
-
Behind the scenes, these capabilities often rely on **Apache Spark**, a powerful distributed computing system designed for big data processing. Spark simplifies working with massive datasets by enabling fast and scalable data processing across clusters of computers.
13
-
14
-
Let’s dive into Spark to understand its core,
15
-
16
9
## Introduction
17
10
18
11
Apache Spark is a unified, multi-language (Python, Java, Scala, and R)computing engine for executing data engineering, data science, and machine learning on single-node machines or clusters and a set of libraries for parallel data processing.
@@ -60,6 +53,7 @@ Let’s break down our description:
At a high level, Spark provides several libraries that extend its functionality and are used in specialized data processing tasks.
@@ -76,7 +70,7 @@ At a high level, Spark provides several libraries that extend its functionality
76
70
77
71
At the heart of all these specialized libraries is **Spark Core**. Spark Core is responsible for basic functionalities like task scheduling, memory management, fault tolerance, and interactions with storage systems.
78
72
79
-
### The Core of Spark: RDDs
73
+
### RDDs
80
74
81
75
**RDDs (Resilient Distributed Datasets)** are the fundamental building blocks of Spark Core. They represent an immutable, distributed collection of objects that can be processed in parallel across a cluster. More about RDDs is discussed later.
82
76
@@ -99,15 +93,16 @@ Spark's ability to interact with these diverse storage systems allows users to w
The Spark driver(process) is like the “brain” of your Spark application. It’s responsible for controlling everything. The driver makes decisions about what tasks to run, keeps track of the application’s progress, and talks to the cluster manager to get the computing power needed. Essentially, it manages the entire process and checks on the tasks being handled by worker nodes (executors). So basically it manages the lifecycle of the spark application.
105
100
106
-
### 2\.The Spark Executors
101
+
### 2\. Spark Executors
107
102
108
103
Executors(process)are the “workers” that actually do the processing. They take instructions from the driver, execute the tasks, and send back the results. Every Spark application gets its own set of executors, which run on different machines. They are responsible for completing the tasks, saving data, reporting results, and re-running any tasks that fail.
109
104
110
-
### 3\.The Cluster Manager
105
+
### 3\. Cluster Manager
111
106
112
107
The cluster manager is like a “resource manager.” It manages the machines that make up your cluster and ensures that the Spark driver and executors have enough resources (like CPU and memory) to do their jobs. Spark can work with several types of cluster managers, such as YARN, Mesos, or Kubernetes, or it can use its built-in manager.
In a **wide transformation**, data from multiple parent RDD/DataFrame partitions must be shuffled (redistributed) to form new partitions. These operations involve **network communication**, making them more expensive.
They are operations that trigger the execution of transformations and return results to the driver program. Actions are the point where Spark evaluates the lazy transformations applied to an RDD, DataFrame, or Dataset.
0 commit comments