-
Notifications
You must be signed in to change notification settings - Fork 97
Open
Description
Users need to configure their SparkSession for localhost development so computations run fast and so that they don't run out of memory.
Here are some examples I ran on my local machine that has 64GB of RAM on the 1e9 h2o groupby dataset (has 1 billion rows of data).
Here's the "better config":
builder = (
pyspark.sql.SparkSession.builder.appName("MyApp")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.executor.memory", "10G")
.config("spark.driver.memory", "25G")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
.config("spark.sql.shuffle.partitions", "2")
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
Here's the default config:
builder = (
pyspark.sql.SparkSession.builder.appName("MyApp")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
groupby query
This query takes 104 seconds with the "better config":
delta_table = delta.DeltaTable.forPath(
spark, f"{Path.home()}/data/deltalake/G1_1e9_1e2_0_0"
)
delta_table.toDF().groupby("id3").agg(F.sum("v1"), F.mean("v3")).limit(10).collect()
This same query errors out with the default config.
join query
This query takes 69 seconds with the "better config", but 111 seconds with the default config:
x = spark.read.format("delta").load(f"{Path.home()}/data/deltalake/J1_1e9_1e9_0_0")
small = spark.read.format("parquet").load(f"{Path.home()}/data/J1_1e9_1e3_0_0.parquet")
spark.sql('select x.id2, sum(small.v2) from x join small using (id1) group by x.id2').show()
Conclusion
SparkSession configurations significantly impact the localhost Spark runtime experience.
How can we make it easy for Spark users to get optimal configurations for localhost development?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels