Skip to content

Fix DataFrame fragmentation PerformanceWarning#355

Open
nihaase wants to merge 1 commit intomainfrom
fix/354
Open

Fix DataFrame fragmentation PerformanceWarning#355
nihaase wants to merge 1 commit intomainfrom
fix/354

Conversation

@nihaase
Copy link
Collaborator

@nihaase nihaase commented Mar 11, 2026

Summary

  • Defragment the DataFrame with .copy() in _create_row_id_col before inserting the row_id column, eliminating the pandas PerformanceWarning about high fragmentation.
  • Prior preparation steps (_standardize_null_values, _standardize_inf_values, _transform_bool_to_int) fragment the DataFrame via repeated column-slice assignments; the .copy() consolidates blocks before the final insert.

Defragment the DataFrame with `.copy()` before inserting the `row_id`
column. Prior preparation steps fragment the internal block structure
via repeated column-slice assignments, triggering a PerformanceWarning
on the next insert.

Fixes #354

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses pandas PerformanceWarning: DataFrame is highly fragmented during data preparation by defragmenting the working DataFrame immediately before adding the autogenerated row_id column in OctoDataPreparator.

Changes:

  • Defragment OctoDataPreparator.data via self.data.copy() when row_id_col is not provided.
  • Add the row_id column after defragmentation to avoid triggering pandas’ fragmentation warning on wide datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +109 to 110
self.data = self.data.copy()
self.data["row_id"] = list(range(len(self.data)))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.data = self.data.copy() performs a deep copy of the entire DataFrame, which can be very memory/time expensive on large datasets and may negate the performance win from defragmentation. Consider using an approach that avoids copying the full underlying data (e.g., building a new DataFrame via a non-deep copy / assign of the new column) while still preventing fragmentation warnings, and document the intended tradeoff here if the deep copy is required.

Suggested change
self.data = self.data.copy()
self.data["row_id"] = list(range(len(self.data)))
# Use a shallow copy to avoid the cost of a full deep copy of the DataFrame
# while still ensuring we don't mutate any external references to the original object.
self.data = self.data.copy(deep=False)
self.data["row_id"] = np.arange(len(self.data))

Copilot uses AI. Check for mistakes.
def _create_row_id_col(self):
"""Create a unique row identifier if not provided."""
if not self.row_id_col:
self.data = self.data.copy()
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copy() rebinds self.data to a new DataFrame only when row_id_col is not provided. In OctoStudy.fit, the original data object is later persisted as data_raw.parquet, so this change will stop the auto-generated row_id column from appearing in data_raw.parquet (previously it did). If data_raw.parquet is expected to include row_id, you’ll need to adjust how/when the copy happens (or how raw vs prepared data is written) to keep outputs consistent.

Suggested change
self.data = self.data.copy()

Copilot uses AI. Check for mistakes.
"""Create a unique row identifier if not provided."""
if not self.row_id_col:
self.data = self.data.copy()
self.data["row_id"] = list(range(len(self.data)))
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list(range(len(self.data))) materializes a Python list of all row ids, which is avoidable overhead for large datasets. Prefer generating row ids with a vectorized range (e.g., via NumPy/Pandas range types) to reduce memory pressure and speed up column creation.

Suggested change
self.data["row_id"] = list(range(len(self.data)))
self.data["row_id"] = pd.RangeIndex(len(self.data))

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants