-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish I could use Pandas to handle large datasets efficiently without running into memory issues. Pandas is great for data analysis but struggles with large datasets that don't fit in memory. This feature would allow seamless integration between Pandas and PySpark, enabling users to process large datasets using Spark’s distributed computing while still leveraging the familiar Pandas syntax.
Feature Description
Seamlessly integrate Pandas with PySpark by automatically converting large Pandas DataFrames into Spark DataFrames while preserving Pandas-like syntax for efficient distributed computing. 🚀
Alternative Solutions
import pyspark.pandas as ps
psdf = ps.DataFrame({'id': range(1000000), 'value': range(1000000)})
import dask.dataframe as dd
ddf = dd.read_csv("large_dataset.csv")
import modin.pandas as mpd
df = mpd.read_csv("large_file.csv")
import vaex
df = vaex.open("large_file.csv")
Additional Context
No response