-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Summary
Refactor the pandas/core/reshape module to improve code quality by reducing duplication, replacing hard-coded values, and simplifying complex conditionals.
Problem Description
The pandas/core/reshape module implements key reshaping functions (pivot, melt, and unstack) used in data manipulation workflows. A review of pivot.py and melt.py reveals a couple of areas where code quality could be improved:
Nested Conditionals:
- In melt.py, nested conditionals add complexity, making the code harder to read and maintain.
- Suggestion: Refactor these conditionals into smaller, more modular functions.
Hard-Coded Values:
- In pivot.py, hard-coded strings (e.g., "All" for margins) reduce flexibility.
- Suggestion: Replace hard-coded values with constants for maintainability.
Relevant File
- melt.py
- pivot.py
Proposed Solution
Refactor Nested Conditionals in melt.py
- Nested Conditional in
ensure_list_vars()
- Before:
def ensure_list_vars(arg_vars, variable: str, columns) -> list: if arg_vars is not None: if not is_list_like(arg_vars): return [arg_vars] elif isinstance(columns, MultiIndex) and not isinstance(arg_vars, list): raise ValueError( f"{variable} must be a list of tuples when columns are a MultiIndex" ) else: return list(arg_vars) else: return []
- After:
def ensure_list_vars(arg_vars, variable: str, columns) -> list: if arg_vars is None: return [] if not is_list_like(arg_vars): return [arg_vars] if isinstance(columns, MultiIndex) and not isinstance(arg_vars, list): raise ValueError( f"{variable} must be a list of tuples when columns are a MultiIndex" ) return list(arg_vars)
- Nested Conditional in
melt()
forid_vars
:- Before:
if id_vars or value_vars: if col_level is not None: level = frame.columns.get_level_values(col_level) else: level = frame.columns labels = id_vars + value_vars idx = level.get_indexer_for(labels) missing = idx == -1 if missing.any(): missing_labels = [ lab for lab, not_found in zip(labels, missing) if not_found ] raise KeyError( "The following id_vars or value_vars are not present in " f"the DataFrame: {missing_labels}" ) if value_vars_was_not_none: frame = frame.iloc[:, algos.unique(idx)] else: frame = frame.copy(deep=False) else: frame = frame.copy(deep=False)
- After:
def validate_and_get_level(frame, id_vars, value_vars, col_level): level = frame.columns.get_level_values(col_level) if col_level is not None else frame.columns labels = id_vars + value_vars idx = level.get_indexer_for(labels) missing = idx == -1 if missing.any(): missing_labels = [lab for lab, not_found in zip(labels, missing) if not_found] raise KeyError( "The following id_vars or value_vars are not present in " f"the DataFrame: {missing_labels}" ) return idx if id_vars or value_vars: idx = validate_and_get_level(frame, id_vars, value_vars, col_level) if value_vars_was_not_none: frame = frame.iloc[:, algos.unique(idx)] else: frame = frame.copy(deep=False)
- Nested Conditionals for Setting
var_name
inmelt()
:- Before:
After:if var_name is None: if isinstance(frame.columns, MultiIndex): if len(frame.columns.names) == len(set(frame.columns.names)): var_name = frame.columns.names else: var_name = [f"variable_{i}" for i in range(len(frame.columns.names))] else: var_name = [ frame.columns.name if frame.columns.name is not None else "variable" ] elif is_list_like(var_name): if isinstance(frame.columns, MultiIndex): if is_iterator(var_name): var_name = list(var_name) if len(var_name) > len(frame.columns): raise ValueError( f"{var_name=} has {len(var_name)} items, " f"but the dataframe columns only have {len(frame.columns)} levels." ) else: raise ValueError(f"{var_name=} must be a scalar.") else: var_name = [var_name]
def determine_var_name(frame, var_name): if var_name is None: return _default_var_name(frame) if is_list_like(var_name): _validate_list_var_name(var_name, frame) return list(var_name) return [var_name] def _default_var_name(frame): if isinstance(frame.columns, MultiIndex): if len(frame.columns.names) == len(set(frame.columns.names)): return frame.columns.names return [f"variable_{i}" for i in range(len(frame.columns.names))] return [frame.columns.name or "variable"] def _validate_list_var_name(var_name, frame): if isinstance(frame.columns, MultiIndex): if is_iterator(var_name): var_name = list(var_name) if len(var_name) > len(frame.columns): raise ValueError( f"{var_name=} has {len(var_name)} items, " f"but the dataframe columns only have {len(frame.columns)} levels." ) else: raise ValueError(f"{var_name=} must be a scalar.") var_name = determine_var_name(frame, var_name)
- Benefits:
- Improves readability:
Simplifies the main function, making the logic clearer and easier to follow. - Makes the logic easier to test and maintain:
Enables independent testing of each helper function, ensuring robust behavior. - Separation of concerns:
Each helper function is now responsible for a single, well-defined task, aligning with the principle of single responsibility.
- Improves readability:
Replace Hard-Coded Values in pivot.py
- Before:
# Hard-coded string for margins
margins_name: Hashable = "All"
- After:
# Define a constant for the hard-coded value
MARGIN_NAME = "All"
# Use the constant in the code
margins_name: Hashable = MARGIN_NAME:
- Benefits:
- Makes the code more readable and maintainable.
- Centralizes the value so it can be reused or modified easily.
Testing
Unit Testing Helper Functions:
Write focused tests for each new helper function to validate their behavior under expected, edge, and erroneous inputs. For example:
- Ensure validate_and_get_level() correctly identifies missing variables and raises KeyError.
- Test determine_var_name() with var_name=None, scalar inputs, and multi-level columns.
Regression Testing Parent Functions:
Run all pre-existing tests for the parent functions (e.g., melt()) to confirm they maintain their functionality after the refactor.
Edge Cases:
Include additional tests for edge scenarios, such as:
- Empty id_vars or value_vars.
- DataFrames with unusual column configurations like MultiIndex or missing names.
Labels
ENH
Code Quality
Compliance with Contributing Guide
- Focus: The issue is specific and addresses code quality improvements without scope creep.
- Clarity: Includes actionable suggestions and a clear implementation path.