Skip to content

ENH: Integrate the pyspark in pandas #60961

@asifmohammed1

Description

@asifmohammed1

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could use Pandas to handle large datasets efficiently without running into memory issues. Pandas is great for data analysis but struggles with large datasets that don't fit in memory. This feature would allow seamless integration between Pandas and PySpark, enabling users to process large datasets using Spark’s distributed computing while still leveraging the familiar Pandas syntax.

Feature Description

Seamlessly integrate Pandas with PySpark by automatically converting large Pandas DataFrames into Spark DataFrames while preserving Pandas-like syntax for efficient distributed computing. 🚀

Alternative Solutions

import pyspark.pandas as ps
psdf = ps.DataFrame({'id': range(1000000), 'value': range(1000000)})
import dask.dataframe as dd
ddf = dd.read_csv("large_dataset.csv")
import modin.pandas as mpd
df = mpd.read_csv("large_file.csv")
import vaex
df = vaex.open("large_file.csv")

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions