Skip to content

Commit db61986

Browse files
Ranjana BabuRanjana Babu
authored andcommitted
add contribution_plan.md file
1 parent a76c8cd commit db61986

File tree

1 file changed

+98
-31
lines changed

1 file changed

+98
-31
lines changed

contribution_plan.md

Lines changed: 98 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
## 1. Basic Information
23

34
- **Project Name:** pandas
@@ -19,7 +20,12 @@
1920
## 2. Contribution Guidelines
2021

2122
- **Are there clear steps in a CONTRIBUTING.md file?**
22-
No
23+
❌ No. The project uses a `contributing.rst` file instead of `CONTRIBUTING.md`. This file provides comprehensive guidelines for contributing to pandas, including:
24+
1. Accepted contribution types such as bug fixes, documentation updates, feature enhancements, and suggestions.
25+
2. Steps to identify suitable tasks by selecting issues labeled as "good first issue" or "Docs".
26+
3. A version control workflow that involves: Forking the repository → Cloning it locally → Creating a new branch → Making changes → Submitting a Pull Request (PR).
27+
4. Instructions for setting up the development environment using conda and regularly syncing with the upstream main branch.
28+
5. Best practices for writing meaningful commit messages, referencing related issues, and ensuring all tests pass before submission.
2329

2430
- **Is there a Code of Conduct?**
2531
✅ Yes, the project follows a [Code of Conduct](https://github.com/pandas-dev/pandas/blob/main/.github/CODE_OF_CONDUCT.md) based on the Contributor Covenant to ensure a welcoming and respectful community.
@@ -33,35 +39,96 @@
3339
- Offers clear contribution steps
3440
- Encourages community interaction on GitHub discussions and issues
3541

36-
---
37-
3842
## 3. Environment Setup
3943

40-
- **How do you set up the project locally?**
41-
42-
1. **Install Anaconda**
43-
- Download from [https://www.anaconda.com](https://www.anaconda.com)
44-
45-
2. **Create a conda environment**
46-
conda create -n pandas-dev python=3.10 -y
47-
conda activate pandas-dev
48-
49-
50-
3. **Clone the GitHub repository**
51-
git clone https://github.com/pandas-dev/pandas.git
52-
cd pandas
53-
54-
4. **Install dependencies**
55-
pip install -r requirements-dev.txt
56-
57-
5. **(Optional) Build pandas from source**
58-
python setup.py build_ext --inplace
59-
60-
6. **(Optional) Run tests**
61-
pytest pandas
62-
63-
- **Any dependencies or setup steps?**
64-
Yes — dependencies are managed through `requirements-dev.txt` and include:
65-
- `numpy`, `cython`
66-
- `pytest`, `mypy`, `black`, `flake8`
67-
- `isort`, `versioneer`, and others required for linting, testing, and building
44+
### Steps to Set Up Locally:
45+
46+
1. Fork the repository on GitHub to your account.
47+
2. Clone the repository locally:
48+
```bash
49+
git clone https://github.com/<your-username>/pandas.git
50+
cd pandas
51+
```
52+
3. Create and activate a development environment using conda:
53+
```bash
54+
conda create -n devenv python=3.10
55+
conda activate devenv
56+
```
57+
4. Install development dependencies:
58+
```bash
59+
pip install -r requirements-dev.txt
60+
```
61+
5. Build the C extensions required by pandas:
62+
```bash
63+
python setup.py build_ext --inplace
64+
```
65+
6. (Optional but recommended) Run the test suite to validate your environment:
66+
```bash
67+
pytest pandas/tests/
68+
```
69+
70+
## 4. Making a Contribution
71+
72+
- **Open Issue Chosen:**
73+
[BUG: Groupby aggregate coercion of outputs inconsistency for pyarrow dtypes #61636](https://github.com/pandas-dev/pandas/issues/61636)
74+
75+
- **Issue Summary:**
76+
When using `groupby(...).agg()` on PyArrow-backed DataFrames, the output types are sometimes inconsistently coerced to pandas-native dtypes like `float64`, rather than preserving the original PyArrow dtypes. This leads to unexpected results and breaks downstream workflows that rely on dtype stability.
77+
78+
### Steps to Resolve the Issue:
79+
80+
1. Reproduce the issue locally by creating a DataFrame backed by PyArrow dtypes and performing `groupby(...).agg()` with functions like `'sum'`, `'first'`, etc.
81+
2. Implement a fix to ensure aggregation on PyArrow-backed DataFrames:
82+
- Maintain the original PyArrow dtypes wherever applicable.
83+
- Avoid coercion unless required by the aggregation operation.
84+
3. Modify the logic in the relevant pandas core modules (likely `core/groupby/aggregation.py` or `core/groupby/groupby.py`).
85+
4. Add targeted unit tests under `pandas/tests/groupby/` to cover aggregation behavior on PyArrow-backed data.
86+
5. Run all tests to ensure that the issue is resolved and no regressions are introduced.
87+
88+
## 5. Create a Pull Request Plan
89+
90+
### Pull Request Workflow:
91+
92+
1. Create a new feature branch:
93+
```bash
94+
git checkout -b fix-groupby-coercion-pyarrow
95+
```
96+
97+
2. Make the required code changes in the appropriate files.
98+
99+
3. Add and commit the changes:
100+
```bash
101+
git add .
102+
git commit -m "BUG: Fix aggregate dtype coercion on pyarrow-backed GroupBy (#61636)"
103+
```
104+
105+
4. Push the changes to your fork:
106+
```bash
107+
git push origin fix-groupby-coercion-pyarrow
108+
```
109+
110+
5. Open a Pull Request in GitHub from your branch to `pandas-dev/pandas:main`.
111+
112+
### Example PR Title:
113+
```
114+
BUG: Fix aggregate dtype coercion on pyarrow-backed GroupBy (#61636)
115+
```
116+
117+
### PR Description:
118+
```
119+
This PR addresses [#61636](https://github.com/pandas-dev/pandas/issues/61636), which reports inconsistent dtype coercion during groupby aggregation on PyArrow-backed DataFrames. Specifically, aggregations like 'sum' or 'first' on columns with Arrow dtypes (e.g., int32, uint64) may return outputs with unexpected pandas-native dtypes like float64.
120+
121+
The fix ensures that aggregation operations on Arrow-backed columns preserve the original Arrow dtypes wherever possible, improving consistency and reliability for downstream workflows.
122+
123+
New unit tests have been added to validate aggregation outputs for PyArrow-backed DataFrames and confirm dtype stability.
124+
125+
Closes #61636.
126+
```
127+
128+
### Testing the Fix:
129+
130+
- Run the test suite using:
131+
```bash
132+
pytest pandas/tests/groupby/
133+
```
134+
- Confirm that all tests pass and that the new tests adequately cover the issue scenario involving Arrow dtypes.

0 commit comments

Comments
 (0)