Skip to content

Psingh12354/Pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

54 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PySpark Detailed Notes & Best Practices

πŸ”₯ What is PySpark?

PySpark is the Python API for Apache Spark, enabling distributed big data processing directly from Python. While Spark is written in Scala, PySpark leverages the Py4j library for interoperability so you can use Python instead of Scala or Java.

  • Combines Python's simplicity with Spark's power
  • Highly scalable for petabyte-scale data
  • 100x faster than Hadoop MapReduce
  • Keeps processing in RAM for top performance

πŸš€ Getting Started with PySpark

Installation

pip install pyspark

Initialize a Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Example") \
    .getOrCreate()

Creating an RDD

data = ["Spark", "is", "awesome"]
rdd = spark.sparkContext.parallelize(data)
print(rdd.collect())

βš™οΈ Databricks Cluster Essentials

  • Databricks Cluster: Bundle of compute resources for running Spark jobs/notebooks.
    • All-purpose Cluster: For collaborative explorations and notebooks.
    • Job Cluster: For running and terminating after jobsβ€”cost-efficient.

πŸ–₯️ Driver Node vs Worker Node

Feature Driver Node Worker Node
Function Runs main Spark logic, schedules tasks Executes tasks in parallel
Storage Maintains job metadata/state Handles data read/write

πŸ”„ RDD Operations

Transformations (Lazy, Build a DAG)

  • map(): Apply function to each element
  • flatMap(): Like map, but flattens results
  • filter(): Keeps elements matching a condition
  • groupBy(): Groups elements by key
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
squared_rdd = rdd.map(lambda x: x*x)
print(squared_rdd.collect())  # [1, 4, 9, 16, 25]

Actions (Trigger Execution)

  • count(): Number of elements
  • collect(): All elements as a list
  • take(n): First n elements

πŸ—οΈ Window Functions vs GroupBy

Aspect Window Functions GroupBy Aggregates
Use Row-based calc (ranking, moving avg) Aggregation (sum, avg, count)
Scope Subset (window) within group Entire group
Example ROW_NUMBER(), LAG(), LEAD() SUM(), AVG(), COUNT()

πŸ“Š Spark Ecosystem Components

  • Spark Core: Distributed task scheduling
  • Spark SQL: Structured data & DataFrame API (SQL-like)
  • Spark Streaming: Real-time streaming analytics
  • MLlib: Machine learning algorithms
  • GraphX: Graph analytics

πŸ”„ Data Ingestion Modes

Mode Description
Batch Processing Data in fixed chunks (large, periodic uploads)
Real-time Process as data arrives (live dashboards, alerts)

⚑ Building An ETL Pipeline in PySpark

An end-to-end Extract β†’ Transform β†’ Load (ETL) workflow:

df = spark.read.csv("input.csv", header=True)
df = df.withColumn("new_col", df["existing_col"] * 10)
df.write.format("parquet").save("output.parquet")

πŸ—οΈ Data Lake vs Data Warehouse

Feature Data Warehouse Data Lake
Data Type Structured All formats (structured/unstructured)
Processing Batch Batch & Streaming
Use Case Analytics/BI Data science, ML, analytics

🏒 Data Warehouse vs Data Mart

Feature Data Warehouse Data Mart
Scope Entire enterprise Project/department
Data Volume Large Smaller
Focus Aggregate analytics Departmental reporting

πŸ”„ Delta Lake vs Data Lake

Feature Delta Lake Data Lake
ACID Support Yes No
Schema Enforcement Yes No
Metadata Mgmt Advanced Basic

πŸ”— Data Integration

Bringing data together from various sources for a unified analytics view (joins, merges, pipeline design).

πŸ“ Delta Lake Version Control

Manage time travel and rollbacks:

DESCRIBE HISTORY employee1;
SELECT * FROM employee1@v1;
RESTORE TABLE employee1 TO VERSION AS OF 1;

πŸ“‘ Views in Spark

  • Temporary View: Within current session
  • Global Temp View: Accessible across all Spark sessions
df.createOrReplaceTempView("temp_view")
df.createGlobalTempView("global_view")

πŸ” Window Functions in PySpark

Example: Assigning row numbers by department and salary order.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("department").orderBy("salary")
df = df.withColumn("row_number", row_number().over(window_spec))

πŸ“Œ Advanced PySpark Topics

User Defined Functions (UDFs)

Apply custom Python logic to DataFrame columns.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def custom_function(value):
    return value.upper()

uppercase_udf = udf(custom_function, StringType())
df = df.withColumn("uppercase_column", uppercase_udf(df["existing_column"]))

Handling Missing Values

df = df.na.fill({"age": 0, "name": "Unknown"})  # Fill missing values
df = df.na.drop()  # Drop rows with any nulls

DataFrame Joins

df1.join(df2, df1.id == df2.id, "inner").show()

πŸ” PySpark when and otherwise – Conditional Logic

Overview

Mimics SQL's CASE WHEN for conditional transformations.

  • Use when() and otherwise() from pyspark.sql.functions
  • More efficient than UDFs for such logic

Syntax

from pyspark.sql.functions import when, col

df = df.withColumn(
    "age_group",
    when(col("age") < 18, "Minor")
    .when((col("age") >= 18) & (col("age") < 60), "Adult")
    .otherwise("Senior")
)

Chaining Conditions

when(cond1, val1).when(cond2, val2).otherwise(default)

Sample Output

name age age_group
Alice 17 Minor
Bob 25 Adult
Cathy 62 Senior

Best Practices

  • Use when over UDF for better speed and optimization
  • Reference columns inside when with col()
  • Order matters: first match applies

Catalyst Optimizer vs. AQE in Spark

Catalyst Optimizer

  • Spark’s static (pre-execution) query optimizer.
  • Makes the first, most efficient plan before running the query.
  • Uses rules and estimated stats to:
    • Push filters down
    • Remove unused columns
    • Reorder joins for speed

Adaptive Query Execution (AQE)

  • Introduced in Spark 3.0. Runs during query execution.
  • Improves the plan as data is processed (runtime).
  • Can:
    • Change join type (e.g., sort-merge to broadcast)
    • Merge small partitions or split big ones
    • Fix skewed data issues

Key Differences

Catalyst AQE
Before execution During execution
Uses static stats Uses real-time stats
Makes initial plan Adjusts plan as runs

How do they work together?

  • Catalyst makes the first good plan.
  • AQE tweaks and improves it as the job runs based on real data.

Why use both?

  • Catalyst is good for most queries.
  • AQE is great for big or unpredictable data where things might change as the query runs.

Remember: Catalyst = BEFORE running (static). AQE = WHILE running (dynamic). Both aim to make Spark SQL run as fast as possible.


🦾 Additional PySpark Features & Best Practices

  • Broadcast Joins: Use .broadcast() for small tables to optimize joins
  • Cache/Persist/Unpersist: Use .cache() or .persist() to reuse DataFrames/RDDs in memory, also make sure to free up the memory using unpersist
  • Partitioning: Repartition with .repartition() or .coalesce() for performance tuning
  • Spark SQL Optimization: Use explain() to view and optimize execution plans
  • Reading Different Formats: Supports CSV, Parquet, Avro, ORC, JSON, Delta, etc.
  • Logging: Use Spark's built-in logger to trace issues
  • Job Monitoring: Monitor runs via Spark UI or Databricks UI

🏁 Conclusion

PySpark empowers you to build robust, highly scalable data pipelines with Python and distributed Spark compute. Mastering its API, DataFrames, RDDs, SQL features, and best practices equips you for large-scale data engineering, analytics, and machine learning.

Keep exploring: window functions, performance tuning, streaming, advanced joins, and integrations with cloud platforms for real-world, production ETL and analytics!


Z-Ordering in Databricks

  • What is Z-Ordering?
    • It arranges (sorts) your data files by values in certain columns, clustering similar values together inside Delta Lake tables.
    • Makes queries much faster when you filter by those columns, because Spark can skip over lots of irrelevant data files.
  • Why use it?
    • To make queries on big tables run faster and cheaper when filtering by high-cardinality columns (columns with many unique values).
    • Especially helpful when you often query on more than one column (multi-column filters).
  • How to use it?
    • Run the OPTIMIZE command and specify columns using ZORDER BY.
    • Works well with columns you filter on most often.

Example

Suppose you have an events table and most queries filter on userId and eventDate. You can Z-Order the data on those columns:

OPTIMIZE delta.`/mnt/delta/events`
ZORDER BY (userId, eventDate)

Or in PySpark:

spark.sql("OPTIMIZE delta.`/mnt/delta/events` ZORDER BY (userId, eventDate)")

How Z-Ordering Works (Image)

Imagine data scattered randomly vs. data organized by Z-order. Z-ordering clusters related values together so Spark can efficiently skip unrelated data blocks:

Diagram: On the left, files are mixed. On the right, files are organized so that queries filtering on z-ordered columns only scan a few files, skipping the rest. image

Quick Facts

  • Use Z-Ordering on high-cardinality columns often filtered in WHERE queries.
  • Typically, Z-Order on 1–3 columns.
  • Re-run OPTIMIZE after adding lots of new data.
  • Works best alongside partitioning (partition for broad categories, Z-Order for frequently filtered columns).

In summary: Z-Ordering makes your queries faster and cheaper by organizing rows in files so Spark easily finds only the data you need. Use it for large Delta Lake tables where query performance matters, especially with multi-column filters.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages