You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/apache-spark-unleashing-big-data-with-rdds-dataframes-and-beyond.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -23,11 +23,11 @@ Let’s break down our description:
23
23
24
24
`Computing Engine`: It focuses on computation rather than storage, allowing it to work with various storage systems like Hadoop, Amazon S3, and Apache Cassandra. This flexibility makes Spark suitable for diverse environments, including cloud and streaming applications.
25
25
26
-
`Libraries`: It provides a unified API for common data analysis tasks. It supports both standard libraries that ship with the engine as well as external libraries published as third-party packages by the open-source communities. The standard libraries includes libraries for SQL (Spark SQL), machine learning (MLlib), stream processing (Structured Streaming), and graph analytics (GraphX).
26
+
`Libraries`: It provides a unified API for common data analysis tasks. It supports both standard libraries that ship with the engine as well as external libraries published as third-party packages by the open-source communities. The standard libraries include libraries for SQL (Spark SQL), machine learning (MLlib), stream processing (Structured Streaming), and graph analytics (GraphX).
27
27
28
28
## Where to Run Spark ?
29
29
30
-
1.**Run Spark Locally**
30
+
### Run Spark Locally
31
31
32
32
* Install Java (required as Spark is written in Scala and runs on the JVM) and Python (if using the Python API).
33
33
@@ -43,13 +43,13 @@ Let’s break down our description:
43
43
44
44
* SQL: `./bin/spark-sql`
45
45
46
-
2.**Run Spark in the Cloud**
46
+
### Run Spark in the Cloud
47
47
48
48
* No installation required; provides a web-based interactive notebook environment.
49
49
50
50
***Option**: Use [Databricks Community Edition \[free\]](https://www.databricks.com/try-databricks#account)
51
51
52
-
3.**Building Spark from Source**
52
+
### Building Spark from Source
53
53
54
54
***Source**: Download the source code from the [Apache Spark download page](http://spark.apache.org/downloads.html).
55
55
@@ -127,13 +127,13 @@ They are the fundamental building block of Spark's older API, introduced in the
127
127
128
128
An RDD represents a distributed collection of immutable records that can be processed in parallel across a cluster. Unlike DataFrames(High-Level API), where records are structured and organized into rows with known schemas, RDDs are more flexible. They allow developers to store and manipulate data in any format—whether Java, Scala, or Python objects. This flexibility gives you a lot of control but requires more manual effort compared to using higher-level APIs like DataFrames.
129
129
130
-
**Key properties of RDDS**
130
+
### Key properties of RDDS
131
131
132
132
***Fault Tolerance:** RDDs maintain a lineage graph that tracks the transformations applied to the data. If a partition is lost due to a node failure, Spark can recompute that partition by reapplying the transformations from the original dataset.
133
133
134
134
***In-Memory Computation:** RDDs are designed for in-memory computation, which allows Spark to process data much faster than traditional disk-based systems. By keeping data in memory, Spark minimizes disk I/O and reduces latency.
135
135
136
-
**Creating RDDs**
136
+
### Creating RDDs
137
137
138
138
Now that we discussed some key RDD properties, let’s begin applying them so that you can better understand how to use them.
0 commit comments