Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions octopus/study/data_preparator.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ def _transform_bool_to_int(self):
def _create_row_id_col(self):
"""Create a unique row identifier if not provided."""
if not self.row_id_col:
self.data = self.data.copy()
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copy() rebinds self.data to a new DataFrame only when row_id_col is not provided. In OctoStudy.fit, the original data object is later persisted as data_raw.parquet, so this change will stop the auto-generated row_id column from appearing in data_raw.parquet (previously it did). If data_raw.parquet is expected to include row_id, you’ll need to adjust how/when the copy happens (or how raw vs prepared data is written) to keep outputs consistent.

Suggested change
self.data = self.data.copy()

Copilot uses AI. Check for mistakes.
self.data["row_id"] = list(range(len(self.data)))
Comment on lines +109 to 110
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.data = self.data.copy() performs a deep copy of the entire DataFrame, which can be very memory/time expensive on large datasets and may negate the performance win from defragmentation. Consider using an approach that avoids copying the full underlying data (e.g., building a new DataFrame via a non-deep copy / assign of the new column) while still preventing fragmentation warnings, and document the intended tradeoff here if the deep copy is required.

Suggested change
self.data = self.data.copy()
self.data["row_id"] = list(range(len(self.data)))
# Use a shallow copy to avoid the cost of a full deep copy of the DataFrame
# while still ensuring we don't mutate any external references to the original object.
self.data = self.data.copy(deep=False)
self.data["row_id"] = np.arange(len(self.data))

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list(range(len(self.data))) materializes a Python list of all row ids, which is avoidable overhead for large datasets. Prefer generating row ids with a vectorized range (e.g., via NumPy/Pandas range types) to reduce memory pressure and speed up column creation.

Suggested change
self.data["row_id"] = list(range(len(self.data)))
self.data["row_id"] = pd.RangeIndex(len(self.data))

Copilot uses AI. Check for mistakes.
self.row_id_col = "row_id"

Expand Down
Loading