|
| 1 | + |
1 | 2 | ## 1. Basic Information
|
2 | 3 |
|
3 | 4 | - **Project Name:** pandas
|
|
19 | 20 | ## 2. Contribution Guidelines
|
20 | 21 |
|
21 | 22 | - **Are there clear steps in a CONTRIBUTING.md file?**
|
22 |
| - No |
| 23 | + ❌ No. The project uses a `contributing.rst` file instead of `CONTRIBUTING.md`. This file provides comprehensive guidelines for contributing to pandas, including: |
| 24 | + 1. Accepted contribution types such as bug fixes, documentation updates, feature enhancements, and suggestions. |
| 25 | + 2. Steps to identify suitable tasks by selecting issues labeled as "good first issue" or "Docs". |
| 26 | + 3. A version control workflow that involves: Forking the repository → Cloning it locally → Creating a new branch → Making changes → Submitting a Pull Request (PR). |
| 27 | + 4. Instructions for setting up the development environment using conda and regularly syncing with the upstream main branch. |
| 28 | + 5. Best practices for writing meaningful commit messages, referencing related issues, and ensuring all tests pass before submission. |
23 | 29 |
|
24 | 30 | - **Is there a Code of Conduct?**
|
25 | 31 | ✅ Yes, the project follows a [Code of Conduct](https://github.com/pandas-dev/pandas/blob/main/.github/CODE_OF_CONDUCT.md) based on the Contributor Covenant to ensure a welcoming and respectful community.
|
|
33 | 39 | - Offers clear contribution steps
|
34 | 40 | - Encourages community interaction on GitHub discussions and issues
|
35 | 41 |
|
36 |
| ---- |
37 |
| - |
38 | 42 | ## 3. Environment Setup
|
39 | 43 |
|
40 |
| -- **How do you set up the project locally?** |
41 |
| - |
42 |
| - 1. **Install Anaconda** |
43 |
| - - Download from [https://www.anaconda.com](https://www.anaconda.com) |
44 |
| - |
45 |
| - 2. **Create a conda environment** |
46 |
| - conda create -n pandas-dev python=3.10 -y |
47 |
| - conda activate pandas-dev |
48 |
| - |
49 |
| - |
50 |
| - 3. **Clone the GitHub repository** |
51 |
| - git clone https://github.com/pandas-dev/pandas.git |
52 |
| - cd pandas |
53 |
| - |
54 |
| - 4. **Install dependencies** |
55 |
| - pip install -r requirements-dev.txt |
56 |
| - |
57 |
| - 5. **(Optional) Build pandas from source** |
58 |
| - python setup.py build_ext --inplace |
59 |
| - |
60 |
| - 6. **(Optional) Run tests** |
61 |
| - pytest pandas |
62 |
| - |
63 |
| -- **Any dependencies or setup steps?** |
64 |
| - Yes — dependencies are managed through `requirements-dev.txt` and include: |
65 |
| - - `numpy`, `cython` |
66 |
| - - `pytest`, `mypy`, `black`, `flake8` |
67 |
| - - `isort`, `versioneer`, and others required for linting, testing, and building |
| 44 | +### Steps to Set Up Locally: |
| 45 | + |
| 46 | +1. Fork the repository on GitHub to your account. |
| 47 | +2. Clone the repository locally: |
| 48 | + ```bash |
| 49 | + git clone https://github.com/<your-username>/pandas.git |
| 50 | + cd pandas |
| 51 | + ``` |
| 52 | +3. Create and activate a development environment using conda: |
| 53 | + ```bash |
| 54 | + conda create -n devenv python=3.10 |
| 55 | + conda activate devenv |
| 56 | + ``` |
| 57 | +4. Install development dependencies: |
| 58 | + ```bash |
| 59 | + pip install -r requirements-dev.txt |
| 60 | + ``` |
| 61 | +5. Build the C extensions required by pandas: |
| 62 | + ```bash |
| 63 | + python setup.py build_ext --inplace |
| 64 | + ``` |
| 65 | +6. (Optional but recommended) Run the test suite to validate your environment: |
| 66 | + ```bash |
| 67 | + pytest pandas/tests/ |
| 68 | + ``` |
| 69 | + |
| 70 | +## 4. Making a Contribution |
| 71 | + |
| 72 | +- **Open Issue Chosen:** |
| 73 | + [BUG: Groupby aggregate coercion of outputs inconsistency for pyarrow dtypes #61636](https://github.com/pandas-dev/pandas/issues/61636) |
| 74 | + |
| 75 | +- **Issue Summary:** |
| 76 | + When using `groupby(...).agg()` on PyArrow-backed DataFrames, the output types are sometimes inconsistently coerced to pandas-native dtypes like `float64`, rather than preserving the original PyArrow dtypes. This leads to unexpected results and breaks downstream workflows that rely on dtype stability. |
| 77 | + |
| 78 | +### Steps to Resolve the Issue: |
| 79 | + |
| 80 | +1. Reproduce the issue locally by creating a DataFrame backed by PyArrow dtypes and performing `groupby(...).agg()` with functions like `'sum'`, `'first'`, etc. |
| 81 | +2. Implement a fix to ensure aggregation on PyArrow-backed DataFrames: |
| 82 | + - Maintain the original PyArrow dtypes wherever applicable. |
| 83 | + - Avoid coercion unless required by the aggregation operation. |
| 84 | +3. Modify the logic in the relevant pandas core modules (likely `core/groupby/aggregation.py` or `core/groupby/groupby.py`). |
| 85 | +4. Add targeted unit tests under `pandas/tests/groupby/` to cover aggregation behavior on PyArrow-backed data. |
| 86 | +5. Run all tests to ensure that the issue is resolved and no regressions are introduced. |
| 87 | + |
| 88 | +## 5. Create a Pull Request Plan |
| 89 | + |
| 90 | +### Pull Request Workflow: |
| 91 | + |
| 92 | +1. Create a new feature branch: |
| 93 | + ```bash |
| 94 | + git checkout -b fix-groupby-coercion-pyarrow |
| 95 | + ``` |
| 96 | + |
| 97 | +2. Make the required code changes in the appropriate files. |
| 98 | + |
| 99 | +3. Add and commit the changes: |
| 100 | + ```bash |
| 101 | + git add . |
| 102 | + git commit -m "BUG: Fix aggregate dtype coercion on pyarrow-backed GroupBy (#61636)" |
| 103 | + ``` |
| 104 | + |
| 105 | +4. Push the changes to your fork: |
| 106 | + ```bash |
| 107 | + git push origin fix-groupby-coercion-pyarrow |
| 108 | + ``` |
| 109 | + |
| 110 | +5. Open a Pull Request in GitHub from your branch to `pandas-dev/pandas:main`. |
| 111 | + |
| 112 | +### Example PR Title: |
| 113 | +``` |
| 114 | +BUG: Fix aggregate dtype coercion on pyarrow-backed GroupBy (#61636) |
| 115 | +``` |
| 116 | + |
| 117 | +### PR Description: |
| 118 | +``` |
| 119 | +This PR addresses [#61636](https://github.com/pandas-dev/pandas/issues/61636), which reports inconsistent dtype coercion during groupby aggregation on PyArrow-backed DataFrames. Specifically, aggregations like 'sum' or 'first' on columns with Arrow dtypes (e.g., int32, uint64) may return outputs with unexpected pandas-native dtypes like float64. |
| 120 | +
|
| 121 | +The fix ensures that aggregation operations on Arrow-backed columns preserve the original Arrow dtypes wherever possible, improving consistency and reliability for downstream workflows. |
| 122 | +
|
| 123 | +New unit tests have been added to validate aggregation outputs for PyArrow-backed DataFrames and confirm dtype stability. |
| 124 | +
|
| 125 | +Closes #61636. |
| 126 | +``` |
| 127 | + |
| 128 | +### Testing the Fix: |
| 129 | + |
| 130 | +- Run the test suite using: |
| 131 | + ```bash |
| 132 | + pytest pandas/tests/groupby/ |
| 133 | + ``` |
| 134 | +- Confirm that all tests pass and that the new tests adequately cover the issue scenario involving Arrow dtypes. |
0 commit comments