-
Notifications
You must be signed in to change notification settings - Fork 0
Fix DataFrame fragmentation PerformanceWarning #355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -106,6 +106,7 @@ def _transform_bool_to_int(self): | |||||||||||||||||
| def _create_row_id_col(self): | ||||||||||||||||||
| """Create a unique row identifier if not provided.""" | ||||||||||||||||||
| if not self.row_id_col: | ||||||||||||||||||
| self.data = self.data.copy() | ||||||||||||||||||
| self.data["row_id"] = list(range(len(self.data))) | ||||||||||||||||||
|
Comment on lines
+109
to
110
|
||||||||||||||||||
| self.data = self.data.copy() | |
| self.data["row_id"] = list(range(len(self.data))) | |
| # Use a shallow copy to avoid the cost of a full deep copy of the DataFrame | |
| # while still ensuring we don't mutate any external references to the original object. | |
| self.data = self.data.copy(deep=False) | |
| self.data["row_id"] = np.arange(len(self.data)) |
Copilot
AI
Mar 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list(range(len(self.data))) materializes a Python list of all row ids, which is avoidable overhead for large datasets. Prefer generating row ids with a vectorized range (e.g., via NumPy/Pandas range types) to reduce memory pressure and speed up column creation.
| self.data["row_id"] = list(range(len(self.data))) | |
| self.data["row_id"] = pd.RangeIndex(len(self.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
copy()rebindsself.datato a new DataFrame only whenrow_id_colis not provided. InOctoStudy.fit, the originaldataobject is later persisted asdata_raw.parquet, so this change will stop the auto-generatedrow_idcolumn from appearing indata_raw.parquet(previously it did). Ifdata_raw.parquetis expected to includerow_id, you’ll need to adjust how/when the copy happens (or how raw vs prepared data is written) to keep outputs consistent.