PySpark is the Python API for Apache Spark, enabling distributed big data processing directly from Python. While Spark is written in Scala, PySpark leverages the Py4j library for interoperability so you can use Python instead of Scala or Java.
- Combines Python's simplicity with Spark's power
- Highly scalable for petabyte-scale data
- 100x faster than Hadoop MapReduce
- Keeps processing in RAM for top performance
pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark Example") \
.getOrCreate()
data = ["Spark", "is", "awesome"]
rdd = spark.sparkContext.parallelize(data)
print(rdd.collect())
- Databricks Cluster: Bundle of compute resources for running Spark jobs/notebooks.
- All-purpose Cluster: For collaborative explorations and notebooks.
- Job Cluster: For running and terminating after jobsβcost-efficient.
Feature | Driver Node | Worker Node |
---|---|---|
Function | Runs main Spark logic, schedules tasks | Executes tasks in parallel |
Storage | Maintains job metadata/state | Handles data read/write |
map()
: Apply function to each elementflatMap()
: Likemap
, but flattens resultsfilter()
: Keeps elements matching a conditiongroupBy()
: Groups elements by key
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
squared_rdd = rdd.map(lambda x: x*x)
print(squared_rdd.collect()) # [1, 4, 9, 16, 25]
count()
: Number of elementscollect()
: All elements as a listtake(n)
: First n elements
Aspect | Window Functions | GroupBy Aggregates |
---|---|---|
Use | Row-based calc (ranking, moving avg) | Aggregation (sum, avg, count) |
Scope | Subset (window) within group | Entire group |
Example | ROW_NUMBER() , LAG() , LEAD() |
SUM() , AVG() , COUNT() |
- Spark Core: Distributed task scheduling
- Spark SQL: Structured data & DataFrame API (SQL-like)
- Spark Streaming: Real-time streaming analytics
- MLlib: Machine learning algorithms
- GraphX: Graph analytics
Mode | Description |
---|---|
Batch Processing | Data in fixed chunks (large, periodic uploads) |
Real-time | Process as data arrives (live dashboards, alerts) |
An end-to-end Extract β Transform β Load (ETL) workflow:
df = spark.read.csv("input.csv", header=True)
df = df.withColumn("new_col", df["existing_col"] * 10)
df.write.format("parquet").save("output.parquet")
Feature | Data Warehouse | Data Lake |
---|---|---|
Data Type | Structured | All formats (structured/unstructured) |
Processing | Batch | Batch & Streaming |
Use Case | Analytics/BI | Data science, ML, analytics |
Feature | Data Warehouse | Data Mart |
---|---|---|
Scope | Entire enterprise | Project/department |
Data Volume | Large | Smaller |
Focus | Aggregate analytics | Departmental reporting |
Feature | Delta Lake | Data Lake |
---|---|---|
ACID Support | Yes | No |
Schema Enforcement | Yes | No |
Metadata Mgmt | Advanced | Basic |
Bringing data together from various sources for a unified analytics view (joins, merges, pipeline design).
Manage time travel and rollbacks:
DESCRIBE HISTORY employee1;
SELECT * FROM employee1@v1;
RESTORE TABLE employee1 TO VERSION AS OF 1;
- Temporary View: Within current session
- Global Temp View: Accessible across all Spark sessions
df.createOrReplaceTempView("temp_view")
df.createGlobalTempView("global_view")
Example: Assigning row numbers by department and salary order.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("department").orderBy("salary")
df = df.withColumn("row_number", row_number().over(window_spec))
Apply custom Python logic to DataFrame columns.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def custom_function(value):
return value.upper()
uppercase_udf = udf(custom_function, StringType())
df = df.withColumn("uppercase_column", uppercase_udf(df["existing_column"]))
df = df.na.fill({"age": 0, "name": "Unknown"}) # Fill missing values
df = df.na.drop() # Drop rows with any nulls
df1.join(df2, df1.id == df2.id, "inner").show()
Mimics SQL's CASE WHEN
for conditional transformations.
- Use
when()
andotherwise()
frompyspark.sql.functions
- More efficient than UDFs for such logic
from pyspark.sql.functions import when, col
df = df.withColumn(
"age_group",
when(col("age") < 18, "Minor")
.when((col("age") >= 18) & (col("age") < 60), "Adult")
.otherwise("Senior")
)
when(cond1, val1).when(cond2, val2).otherwise(default)
name | age | age_group |
---|---|---|
Alice | 17 | Minor |
Bob | 25 | Adult |
Cathy | 62 | Senior |
- Use
when
over UDF for better speed and optimization - Reference columns inside
when
withcol()
- Order matters: first match applies
- Sparkβs static (pre-execution) query optimizer.
- Makes the first, most efficient plan before running the query.
- Uses rules and estimated stats to:
- Push filters down
- Remove unused columns
- Reorder joins for speed
- Introduced in Spark 3.0. Runs during query execution.
- Improves the plan as data is processed (runtime).
- Can:
- Change join type (e.g., sort-merge to broadcast)
- Merge small partitions or split big ones
- Fix skewed data issues
Catalyst | AQE |
---|---|
Before execution | During execution |
Uses static stats | Uses real-time stats |
Makes initial plan | Adjusts plan as runs |
- Catalyst makes the first good plan.
- AQE tweaks and improves it as the job runs based on real data.
- Catalyst is good for most queries.
- AQE is great for big or unpredictable data where things might change as the query runs.
Remember: Catalyst = BEFORE running (static). AQE = WHILE running (dynamic). Both aim to make Spark SQL run as fast as possible.
- Broadcast Joins: Use
.broadcast()
for small tables to optimize joins - Cache/Persist/Unpersist: Use
.cache()
or.persist()
to reuse DataFrames/RDDs in memory, also make sure to free up the memory using unpersist - Partitioning: Repartition with
.repartition()
or.coalesce()
for performance tuning - Spark SQL Optimization: Use
explain()
to view and optimize execution plans - Reading Different Formats: Supports CSV, Parquet, Avro, ORC, JSON, Delta, etc.
- Logging: Use Spark's built-in logger to trace issues
- Job Monitoring: Monitor runs via Spark UI or Databricks UI
PySpark empowers you to build robust, highly scalable data pipelines with Python and distributed Spark compute. Mastering its API, DataFrames, RDDs, SQL features, and best practices equips you for large-scale data engineering, analytics, and machine learning.
Keep exploring: window functions, performance tuning, streaming, advanced joins, and integrations with cloud platforms for real-world, production ETL and analytics!
- What is Z-Ordering?
- It arranges (sorts) your data files by values in certain columns, clustering similar values together inside Delta Lake tables.
- Makes queries much faster when you filter by those columns, because Spark can skip over lots of irrelevant data files.
- Why use it?
- To make queries on big tables run faster and cheaper when filtering by high-cardinality columns (columns with many unique values).
- Especially helpful when you often query on more than one column (multi-column filters).
- How to use it?
- Run the
OPTIMIZE
command and specify columns usingZORDER BY
. - Works well with columns you filter on most often.
- Run the
Suppose you have an events table and most queries filter on userId
and eventDate
. You can Z-Order the data on those columns:
OPTIMIZE delta.`/mnt/delta/events`
ZORDER BY (userId, eventDate)
Or in PySpark:
spark.sql("OPTIMIZE delta.`/mnt/delta/events` ZORDER BY (userId, eventDate)")
Imagine data scattered randomly vs. data organized by Z-order. Z-ordering clusters related values together so Spark can efficiently skip unrelated data blocks:
Diagram: On the left, files are mixed. On the right, files are organized so that queries filtering on z-ordered columns only scan a few files, skipping the rest.
- Use Z-Ordering on high-cardinality columns often filtered in WHERE queries.
- Typically, Z-Order on 1β3 columns.
- Re-run OPTIMIZE after adding lots of new data.
- Works best alongside partitioning (partition for broad categories, Z-Order for frequently filtered columns).
In summary: Z-Ordering makes your queries faster and cheaper by organizing rows in files so Spark easily finds only the data you need. Use it for large Delta Lake tables where query performance matters, especially with multi-column filters.