Skip to content

Commit ea031c8

Browse files
committed
Reformat
1 parent 5d91154 commit ea031c8

File tree

1 file changed

+24
-33
lines changed

1 file changed

+24
-33
lines changed

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md

Lines changed: 24 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -28,55 +28,48 @@ Let’s break down our description:
2828
## Where to Run Spark ?
2929

3030
1. **Run Spark Locally**
31-
3231

3332
* Install Java (required as Spark is written in Scala and runs on the JVM) and Python (if using the Python API).
34-
33+
3534
* Visit [Spark's download page](http://spark.apache.org/downloads.html), choose "Pre-built for Hadoop 2.7 and later," and download the TAR file.
36-
35+
3736
* Extract the TAR file and navigate to the directory.
38-
37+
3938
* Launch consoles in the preferred language:
40-
41-
* Python: `./bin/pyspark`
42-
39+
40+
* Python: `./bin/pyspark`
41+
4342
* Scala: `./bin/spark-shell`
44-
43+
4544
* SQL: `./bin/spark-sql`
46-
4745

4846
2. **Run Spark in the Cloud**
49-
5047

5148
* No installation required; provides a web-based interactive notebook environment.
52-
49+
5350
* **Option**: Use [Databricks Community Edition \[free\]](https://www.databricks.com/try-databricks#account)
54-
5551

5652
3. **Building Spark from Source**
57-
5853

5954
* **Source**: Download the source code from the [Apache Spark download page](http://spark.apache.org/downloads.html).
60-
55+
6156
* **Instructions**: Follow the README file in the source package for building Spark.
62-
6357

6458
## Spark Components
6559

66-
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1732356652608/fe1bec32-214f-4784-aab3-8c2d1798c01a.png align="center")
60+
![](<https://cdn.hashnode.com/res/hashnode/image/upload/v1732356652608/fe1bec32-214f-4784-aab3-8c2d1798c01a.png> align="center")
6761

6862
## High-Level Components (Spark Applications)
6963

7064
At a high level, Spark provides several libraries that extend its functionality and are used in specialized data processing tasks.
7165

7266
1. **SparkSQL**: SparkSQL allows users to run SQL queries on large datasets using Spark’s distributed infrastructure. Whether interacting with structured or semi-structured data, SparkSQL makes querying data easy, using either SQL syntax or the DataFrame API.
73-
67+
7468
2. **MLlib**: It provides distributed algorithms for a variety of machine learning tasks such as classification, regression, clustering, recommendation systems, etc.
75-
69+
7670
3. **GraphX**: GraphX is Spark’s API for graph-based computations. Whether you're working with social networks or recommendation systems, GraphX allows you to process and analyze graph data efficiently using distributed processing.
77-
71+
7872
4. **Spark Streaming**: It enables the processing of live data streams from sources like Kafka, or TCP sockets, turning streaming data into real-time analytics.
79-
8073

8174
## Spark Core
8275

@@ -89,9 +82,8 @@ At the heart of all these specialized libraries is **Spark Core**. Spark Core is
8982
### DAG Scheduler and Task Scheduler
9083

9184
* **DAG Scheduler**: Spark breaks down complex workflows into smaller stages by creating a Directed Acyclic Graph (DAG). The DAG Scheduler optimizes this execution plan by determining which operations can be performed in parallel and orchestrating how the tasks should be executed.
92-
85+
9386
* **Task Scheduler**: After the DAG is scheduled, the Task Scheduler assigns tasks to worker nodes in the cluster. It interacts with the Cluster Manager to distribute tasks across the available resources.
94-
9587

9688
### Cluster Managers and Storage Systems
9789

@@ -103,7 +95,7 @@ Spark's ability to interact with these diverse storage systems allows users to w
10395

10496
## Spark’s Basic Architecture
10597

106-
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1732357077877/d3ac6b43-9b95-48ed-8720-8cedc9c6550b.png align="center")
98+
![](<https://cdn.hashnode.com/res/hashnode/image/upload/v1732357077877/d3ac6b43-9b95-48ed-8720-8cedc9c6550b.png> align="center")
10799

108100
### 1\. The Spark Driver
109101

@@ -122,11 +114,10 @@ The cluster manager is like a “resource manager.” It manages the machines th
122114
Spark can run in different ways, depending on how you want to set it up:
123115

124116
* **Cluster Mode**: In this mode, both the driver and executors run on the cluster. This is the most common way to run Spark in production.
125-
117+
126118
* **Client Mode**: The driver runs on your local machine (the client) from where the spark application is submitted, but the executors run on the cluster. This is often used when you're testing or developing.
127-
119+
128120
* **Local Mode**: Everything runs on a single machine. Spark uses multiple threads for parallel processing to simulate a cluster. This is useful for learning, testing, or development, but not for big production jobs.
129-
130121

131122
## Spark’s Low-Level APIS
132123

@@ -139,9 +130,8 @@ An RDD represents a distributed collection of immutable records that can be proc
139130
**Key properties of RDDS**
140131

141132
* **Fault Tolerance:** RDDs maintain a lineage graph that tracks the transformations applied to the data. If a partition is lost due to a node failure, Spark can recompute that partition by reapplying the transformations from the original dataset.
142-
133+
143134
* **In-Memory Computation:** RDDs are designed for in-memory computation, which allows Spark to process data much faster than traditional disk-based systems. By keeping data in memory, Spark minimizes disk I/O and reduces latency.
144-
145135

146136
**Creating RDDs**
147137

@@ -172,13 +162,14 @@ The Spark DataFrame is one of the most widely used APIs in Spark, offering a hig
172162

173163
It is a powerful Structured API that represents data in a tabular format, similar to a spreadsheet, with named columns defined by a schema. Unlike a traditional spreadsheet, which exists on a single machine, a Spark DataFrame can be distributed across thousands of computers. This distribution is essential for handling large datasets that cannot fit on one machine or for speeding up computations.
174164

175-
While the DataFrame concept is not unique to Spark; R and Python also includes DataFrames—these are typically limited to a single machine's resources. Fortunately, Spark’s language interfaces allow for easy conversion of Pandas DataFrames in Python and R DataFrames to Spark DataFrames, enabling users to leverage distributed computing for enhanced performance.
165+
While the DataFrame concept is not unique to Spark; R and Python also include DataFrames—these are typically limited to a single machine's resources. Fortunately, Spark’s language interfaces allow for easy conversion of Pandas DataFrames in Python and R DataFrames to Spark DataFrames, enabling users to leverage distributed computing for enhanced performance.
176166

177167
Below is a comparison of distributed versus single-machine analysis.
178168

179-
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1732358269819/a38be331-b107-4065-bdda-ab40b0bcbff9.png align="center")
169+
![](<https://cdn.hashnode.com/res/hashnode/image/upload/v1732358269819/a38be331-b107-4065-bdda-ab40b0bcbff9.png> align="center")
180170

181171
> Note: Spark also provides the Dataset API, which combines the benefits of RDDs and DataFrames by offering both compile-time type safety and query optimization. However, the Dataset API is only supported in Scala and Java, not in Python.
172+
>
182173
## Partitions
183174

184175
Spark breaks up data into chunks called partitions, allowing executors to work in parallel. A partition is a collection of rows that reside on a single machine in the cluster. By default, partitions are sized at 128 MB, though this can be adjusted. The number of partitions affects parallelism—fewer partitions can limit performance, even with many executors, and vice versa.
@@ -192,7 +183,7 @@ In Spark, the core data structures are immutable, meaning once they’re created
192183
For example, to filter out even numbers from a dataframe, you would use:
193184

194185
```python
195-
divisBy2 = myRange.where("number % 2 = 0") # myRange is a dataframe
186+
diviBy2 = myRange.where("number % 2 = 0") # myRange is a dataframe
196187
```
197188

198189
This code performs a transformation but produces no immediate output. That’s because they are **lazy**, meaning they do not execute immediately; instead, Spark builds a Directed Acyclic Graph (DAG) of transformations that will be executed only when an **action** is triggered. Transformations are the heart of Spark’s business logic and can be of two types: narrow and wide.
@@ -203,15 +194,15 @@ In a **narrow transformation**, each partition of the parent RDD/DataFrame contr
203194

204195
Examples: `map` `filter`
205196

206-
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1732344376978/13ba890a-95c4-4f21-b35f-ac19570daff1.png align="center")
197+
![](<https://cdn.hashnode.com/res/hashnode/image/upload/v1732344376978/13ba890a-95c4-4f21-b35f-ac19570daff1.png> align="center")
207198

208199
### Wide Transformations
209200

210201
In a **wide transformation**, data from multiple parent RDD/DataFrame partitions must be shuffled (redistributed) to form new partitions. These operations involve **network communication**, making them more expensive.
211202

212203
Examples: `groupByKey` `reduceByKey` `join`
213204

214-
![](https://cdn.hashnode.com/res/hashnode/image/upload/v1732348673655/115212c0-c124-47fa-8c5b-de1f659c0866.png align="center")
205+
![](<https://cdn.hashnode.com/res/hashnode/image/upload/v1732348673655/115212c0-c124-47fa-8c5b-de1f659c0866.png> align="center")
215206

216207
## Actions
217208

0 commit comments

Comments
 (0)