Skip to content

Commit 2a5aa8b

Browse files
committed
Fix grammar issue
1 parent cbe3cb9 commit 2a5aa8b

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ weight: 1
1111

1212
Before I knew Spark, I mostly used only Pandas and NumPy for data processing. They worked well for small datasets, and I never really had any issues—until I had to process a really large dataset—millions of rows. The processing slowed down, and eventually, I ran into an out of memory error. I tried splitting the data into chunks and tweaking my code, but nothing really worked well.
1313

14-
On top of that, if my process failed due to a network issue, I had to start over. Running these jobs was expensive(long-running ec2 instances), and the time wasted on reruns was frustrating. That’s when I started looking for a better solution.
14+
On top of that, if my process failed due to a network issue, I had to start over. Running these jobs was expensive(long-running EC2 instances), and the time wasted on reruns was frustrating. That’s when I started looking for a better solution.
1515

1616
That’s when I came across Apache Spark. It handles large datasets by distributing the work across multiple machines.
1717

@@ -103,7 +103,7 @@ Spark can run in different ways, depending on how you want to set it up:
103103

104104
They are the fundamental building block of Spark's older API, introduced in the Spark 1.x series. While RDDs are still available in Spark 2.x and beyond, they are no longer the default API due to the introduction of higher-level abstractions like DataFrames and Datasets. However, every operation in Spark ultimately gets compiled down to RDDs, making it important to understand their basics. The Spark UI also displays job execution details in terms of RDDs, so having a working knowledge of them is essential for debugging and optimization.
105105

106-
An RDD represents a distributed collection of immutable records that can be processed in parallel across a cluster. Unlike DataFrames(High-Level API), where records are structured and organized into rows with known schemas, RDDs are more flexible. They allow developers to store and manipulate data in any format—whether Java, Scala, or Python objects. This flexibility gives you a lot of control but requires more manual effort compared to using higher-level APIs like DataFrames.
106+
An RDD represents a distributed collection of immutable records that can be processed in parallel across a cluster. Unlike DataFrames (High-Level API), where records are structured and organized into rows with known schemas, RDDs are more flexible. They allow developers to store and manipulate data in any format—whether Java, Scala, or Python objects. This flexibility gives you a lot of control but requires more manual effort compared to using higher-level APIs like DataFrames.
107107

108108
### Key properties of RDDs
109109

0 commit comments

Comments
 (0)