Better SparkSession settings for localhost

Users need to configure their SparkSession for localhost development so computations run fast and so that they don't run out of memory.

Here are some examples I ran on my local machine that has 64GB of RAM on the 1e9 h2o groupby dataset (has 1 billion rows of data).

Here's the "better config":

```
builder = (
    pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.executor.memory", "10G")
    .config("spark.driver.memory", "25G")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
    .config("spark.sql.shuffle.partitions", "2")
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()
```

Here's the default config:

```

builder = (
    pyspark.sql.SparkSession.builder.appName("MyApp")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
)

spark = configure_spark_with_delta_pip(builder).getOrCreate()
```

## groupby query

This query takes 104 seconds with the "better config":

```
delta_table = delta.DeltaTable.forPath(
    spark, f"{Path.home()}/data/deltalake/G1_1e9_1e2_0_0"
)

delta_table.toDF().groupby("id3").agg(F.sum("v1"), F.mean("v3")).limit(10).collect()
```

This same query errors out with the default config.

## join query

This query takes 69 seconds with the "better config", but 111 seconds with the default config:

```
x = spark.read.format("delta").load(f"{Path.home()}/data/deltalake/J1_1e9_1e9_0_0")
small = spark.read.format("parquet").load(f"{Path.home()}/data/J1_1e9_1e3_0_0.parquet")

spark.sql('select x.id2, sum(small.v2) from x join small using (id1) group by x.id2').show()
```

## Conclusion

`SparkSession` configurations significantly impact the localhost Spark runtime experience.

How can we make it easy for Spark users to get optimal configurations for localhost development?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better SparkSession settings for localhost #143

groupby query

join query

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better SparkSession settings for localhost #143

Description

groupby query

join query

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions