You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At a high level, Spark provides several libraries that extend its functionality and are used in specialized data processing tasks.
71
65
72
66
1.**SparkSQL**: SparkSQL allows users to run SQL queries on large datasets using Spark’s distributed infrastructure. Whether interacting with structured or semi-structured data, SparkSQL makes querying data easy, using either SQL syntax or the DataFrame API.
73
-
67
+
74
68
2.**MLlib**: It provides distributed algorithms for a variety of machine learning tasks such as classification, regression, clustering, recommendation systems, etc.
75
-
69
+
76
70
3.**GraphX**: GraphX is Spark’s API for graph-based computations. Whether you're working with social networks or recommendation systems, GraphX allows you to process and analyze graph data efficiently using distributed processing.
77
-
71
+
78
72
4.**Spark Streaming**: It enables the processing of live data streams from sources like Kafka, or TCP sockets, turning streaming data into real-time analytics.
79
-
80
73
81
74
## Spark Core
82
75
@@ -89,9 +82,8 @@ At the heart of all these specialized libraries is **Spark Core**. Spark Core is
89
82
### DAG Scheduler and Task Scheduler
90
83
91
84
***DAG Scheduler**: Spark breaks down complex workflows into smaller stages by creating a Directed Acyclic Graph (DAG). The DAG Scheduler optimizes this execution plan by determining which operations can be performed in parallel and orchestrating how the tasks should be executed.
92
-
85
+
93
86
***Task Scheduler**: After the DAG is scheduled, the Task Scheduler assigns tasks to worker nodes in the cluster. It interacts with the Cluster Manager to distribute tasks across the available resources.
94
-
95
87
96
88
### Cluster Managers and Storage Systems
97
89
@@ -103,7 +95,7 @@ Spark's ability to interact with these diverse storage systems allows users to w
@@ -122,11 +114,10 @@ The cluster manager is like a “resource manager.” It manages the machines th
122
114
Spark can run in different ways, depending on how you want to set it up:
123
115
124
116
***Cluster Mode**: In this mode, both the driver and executors run on the cluster. This is the most common way to run Spark in production.
125
-
117
+
126
118
***Client Mode**: The driver runs on your local machine (the client) from where the spark application is submitted, but the executors run on the cluster. This is often used when you're testing or developing.
127
-
119
+
128
120
***Local Mode**: Everything runs on a single machine. Spark uses multiple threads for parallel processing to simulate a cluster. This is useful for learning, testing, or development, but not for big production jobs.
129
-
130
121
131
122
## Spark’s Low-Level APIS
132
123
@@ -139,9 +130,8 @@ An RDD represents a distributed collection of immutable records that can be proc
139
130
**Key properties of RDDS**
140
131
141
132
***Fault Tolerance:** RDDs maintain a lineage graph that tracks the transformations applied to the data. If a partition is lost due to a node failure, Spark can recompute that partition by reapplying the transformations from the original dataset.
142
-
133
+
143
134
***In-Memory Computation:** RDDs are designed for in-memory computation, which allows Spark to process data much faster than traditional disk-based systems. By keeping data in memory, Spark minimizes disk I/O and reduces latency.
144
-
145
135
146
136
**Creating RDDs**
147
137
@@ -172,13 +162,14 @@ The Spark DataFrame is one of the most widely used APIs in Spark, offering a hig
172
162
173
163
It is a powerful Structured API that represents data in a tabular format, similar to a spreadsheet, with named columns defined by a schema. Unlike a traditional spreadsheet, which exists on a single machine, a Spark DataFrame can be distributed across thousands of computers. This distribution is essential for handling large datasets that cannot fit on one machine or for speeding up computations.
174
164
175
-
While the DataFrame concept is not unique to Spark; R and Python also includes DataFrames—these are typically limited to a single machine's resources. Fortunately, Spark’s language interfaces allow for easy conversion of Pandas DataFrames in Python and R DataFrames to Spark DataFrames, enabling users to leverage distributed computing for enhanced performance.
165
+
While the DataFrame concept is not unique to Spark; R and Python also include DataFrames—these are typically limited to a single machine's resources. Fortunately, Spark’s language interfaces allow for easy conversion of Pandas DataFrames in Python and R DataFrames to Spark DataFrames, enabling users to leverage distributed computing for enhanced performance.
176
166
177
167
Below is a comparison of distributed versus single-machine analysis.
> Note: Spark also provides the Dataset API, which combines the benefits of RDDs and DataFrames by offering both compile-time type safety and query optimization. However, the Dataset API is only supported in Scala and Java, not in Python.
172
+
>
182
173
## Partitions
183
174
184
175
Spark breaks up data into chunks called partitions, allowing executors to work in parallel. A partition is a collection of rows that reside on a single machine in the cluster. By default, partitions are sized at 128 MB, though this can be adjusted. The number of partitions affects parallelism—fewer partitions can limit performance, even with many executors, and vice versa.
@@ -192,7 +183,7 @@ In Spark, the core data structures are immutable, meaning once they’re created
192
183
For example, to filter out even numbers from a dataframe, you would use:
193
184
194
185
```python
195
-
divisBy2= myRange.where("number % 2 = 0") # myRange is a dataframe
186
+
diviBy2= myRange.where("number % 2 = 0") # myRange is a dataframe
196
187
```
197
188
198
189
This code performs a transformation but produces no immediate output. That’s because they are **lazy**, meaning they do not execute immediately; instead, Spark builds a Directed Acyclic Graph (DAG) of transformations that will be executed only when an **action** is triggered. Transformations are the heart of Spark’s business logic and can be of two types: narrow and wide.
@@ -203,15 +194,15 @@ In a **narrow transformation**, each partition of the parent RDD/DataFrame contr
In a **wide transformation**, data from multiple parent RDD/DataFrame partitions must be shuffled (redistributed) to form new partitions. These operations involve **network communication**, making them more expensive.
0 commit comments