Fix DataFrame fragmentation PerformanceWarning by nihaase · Pull Request #355 · emdgroup/octopus

nihaase · 2026-03-11T19:08:06Z

Summary

Defragment the DataFrame with .copy() in _create_row_id_col before inserting the row_id column, eliminating the pandas PerformanceWarning about high fragmentation.
Prior preparation steps (_standardize_null_values, _standardize_inf_values, _transform_bool_to_int) fragment the DataFrame via repeated column-slice assignments; the .copy() consolidates blocks before the final insert.

Defragment the DataFrame with `.copy()` before inserting the `row_id` column. Prior preparation steps fragment the internal block structure via repeated column-slice assignments, triggering a PerformanceWarning on the next insert. Fixes #354 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR addresses pandas PerformanceWarning: DataFrame is highly fragmented during data preparation by defragmenting the working DataFrame immediately before adding the autogenerated row_id column in OctoDataPreparator.

Changes:

Defragment OctoDataPreparator.data via self.data.copy() when row_id_col is not provided.
Add the row_id column after defragmentation to avoid triggering pandas’ fragmentation warning on wide datasets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-11T20:53:43Z

octopus/study/data_preparator.py

+            self.data = self.data.copy()
            self.data["row_id"] = list(range(len(self.data)))


self.data = self.data.copy() performs a deep copy of the entire DataFrame, which can be very memory/time expensive on large datasets and may negate the performance win from defragmentation. Consider using an approach that avoids copying the full underlying data (e.g., building a new DataFrame via a non-deep copy / assign of the new column) while still preventing fragmentation warnings, and document the intended tradeoff here if the deep copy is required.

Suggested change

self.data = self.data.copy()

self.data["row_id"] = list(range(len(self.data)))

# Use a shallow copy to avoid the cost of a full deep copy of the DataFrame

# while still ensuring we don't mutate any external references to the original object.

self.data = self.data.copy(deep=False)

self.data["row_id"] = np.arange(len(self.data))

Copilot · 2026-03-11T20:53:44Z

octopus/study/data_preparator.py

    def _create_row_id_col(self):
        """Create a unique row identifier if not provided."""
        if not self.row_id_col:
+            self.data = self.data.copy()


This copy() rebinds self.data to a new DataFrame only when row_id_col is not provided. In OctoStudy.fit, the original data object is later persisted as data_raw.parquet, so this change will stop the auto-generated row_id column from appearing in data_raw.parquet (previously it did). If data_raw.parquet is expected to include row_id, you’ll need to adjust how/when the copy happens (or how raw vs prepared data is written) to keep outputs consistent.

Suggested change

self.data = self.data.copy()

Copilot · 2026-03-11T20:53:44Z

octopus/study/data_preparator.py

        """Create a unique row identifier if not provided."""
        if not self.row_id_col:
+            self.data = self.data.copy()
            self.data["row_id"] = list(range(len(self.data)))


list(range(len(self.data))) materializes a Python list of all row ids, which is avoidable overhead for large datasets. Prefer generating row ids with a vectorized range (e.g., via NumPy/Pandas range types) to reduce memory pressure and speed up column creation.

Suggested change

self.data["row_id"] = list(range(len(self.data)))

self.data["row_id"] = pd.RangeIndex(len(self.data))

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

nihaase requested a review from Copilot March 11, 2026 19:08

Copilot started reviewing on behalf of nihaase March 11, 2026 19:09 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

nihaase requested a review from Copilot March 11, 2026 20:50

Copilot started reviewing on behalf of nihaase March 11, 2026 20:50 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

anwurl requested a review from Copilot March 14, 2026 19:47

Copilot started reviewing on behalf of anwurl March 14, 2026 19:48 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DataFrame fragmentation PerformanceWarning#355

Fix DataFrame fragmentation PerformanceWarning#355
nihaase wants to merge 1 commit intomainfrom
fix/354

nihaase commented Mar 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self.data = self.data.copy()
		self.data["row_id"] = list(range(len(self.data)))

-            self.data = self.data.copy()
-            self.data["row_id"] = list(range(len(self.data)))
+            # Use a shallow copy to avoid the cost of a full deep copy of the DataFrame
+            # while still ensuring we don't mutate any external references to the original object.
+            self.data = self.data.copy(deep=False)
+            self.data["row_id"] = np.arange(len(self.data))

	self.data["row_id"] = list(range(len(self.data)))
	self.data["row_id"] = pd.RangeIndex(len(self.data))

Conversation

nihaase commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nihaase commented Mar 11, 2026 •

edited

Loading