Conversation
Defragment the DataFrame with `.copy()` before inserting the `row_id` column. Prior preparation steps fragment the internal block structure via repeated column-slice assignments, triggering a PerformanceWarning on the next insert. Fixes #354 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses pandas PerformanceWarning: DataFrame is highly fragmented during data preparation by defragmenting the working DataFrame immediately before adding the autogenerated row_id column in OctoDataPreparator.
Changes:
- Defragment
OctoDataPreparator.dataviaself.data.copy()whenrow_id_colis not provided. - Add the
row_idcolumn after defragmentation to avoid triggering pandas’ fragmentation warning on wide datasets.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| self.data = self.data.copy() | ||
| self.data["row_id"] = list(range(len(self.data))) |
There was a problem hiding this comment.
self.data = self.data.copy() performs a deep copy of the entire DataFrame, which can be very memory/time expensive on large datasets and may negate the performance win from defragmentation. Consider using an approach that avoids copying the full underlying data (e.g., building a new DataFrame via a non-deep copy / assign of the new column) while still preventing fragmentation warnings, and document the intended tradeoff here if the deep copy is required.
| self.data = self.data.copy() | |
| self.data["row_id"] = list(range(len(self.data))) | |
| # Use a shallow copy to avoid the cost of a full deep copy of the DataFrame | |
| # while still ensuring we don't mutate any external references to the original object. | |
| self.data = self.data.copy(deep=False) | |
| self.data["row_id"] = np.arange(len(self.data)) |
| def _create_row_id_col(self): | ||
| """Create a unique row identifier if not provided.""" | ||
| if not self.row_id_col: | ||
| self.data = self.data.copy() |
There was a problem hiding this comment.
This copy() rebinds self.data to a new DataFrame only when row_id_col is not provided. In OctoStudy.fit, the original data object is later persisted as data_raw.parquet, so this change will stop the auto-generated row_id column from appearing in data_raw.parquet (previously it did). If data_raw.parquet is expected to include row_id, you’ll need to adjust how/when the copy happens (or how raw vs prepared data is written) to keep outputs consistent.
| self.data = self.data.copy() |
| """Create a unique row identifier if not provided.""" | ||
| if not self.row_id_col: | ||
| self.data = self.data.copy() | ||
| self.data["row_id"] = list(range(len(self.data))) |
There was a problem hiding this comment.
list(range(len(self.data))) materializes a Python list of all row ids, which is avoidable overhead for large datasets. Prefer generating row ids with a vectorized range (e.g., via NumPy/Pandas range types) to reduce memory pressure and speed up column creation.
| self.data["row_id"] = list(range(len(self.data))) | |
| self.data["row_id"] = pd.RangeIndex(len(self.data)) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Summary
.copy()in_create_row_id_colbefore inserting therow_idcolumn, eliminating the pandasPerformanceWarningabout high fragmentation._standardize_null_values,_standardize_inf_values,_transform_bool_to_int) fragment the DataFrame via repeated column-slice assignments; the.copy()consolidates blocks before the final insert.