Skip to content

[BUG] notebook fails when calculating the correlation matrix. #657

@joaquinborovich

Description

@joaquinborovich

Describe the bug
Looking for Correlations code is not working with string values in the ValueError in 02_end_to_end_machine_learning_project.ipynb when calculating correlation on non-numeric data
when trying to run: corr_matrix = housing.corr() I get an error for the ocean proximity label.

Running the 02_end_to_end_machine_learning_project.ipynb notebook fails when calculating the correlation matrix. The housing.corr() method is called on the DataFrame before the non-numeric ocean_proximity column has been preprocessed, which raises a ValueError because the method cannot convert the string values (e.g., 'INLAND') to floats.

To Reproduce

corr_matrix = housing.corr()

And if you got an exception, please copy the full stacktrace here:

ValueError                                Traceback (most recent call last)
Cell In[97], line 2
      1 aux_set = housing.drop("ocean_proximity", axis=1)
----> 2 corr_matrix = housing.corr()

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py:11056, in DataFrame.corr(self, method, min_periods, numeric_only)
  11054 cols = data.columns
  11055 idx = cols.copy()
> 11056 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
  11058 if method == "pearson":
  11059     correl = libalgos.nancorr(mat, minp=min_periods)

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py:1998, in DataFrame.to_numpy(self, dtype, copy, na_value)
   1996 if dtype is not None:
   1997     dtype = np.dtype(dtype)
-> 1998 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
   1999 if result.dtype is not dtype:
   2000     result = np.asarray(result, dtype=dtype)

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\internals\managers.py:1694, in BlockManager.as_array(self, dtype, copy, na_value)
   1692         arr.flags.writeable = False
   1693 else:
-> 1694     arr = self._interleave(dtype=dtype, na_value=na_value)
   1695     # The underlying data was copied within _interleave, so no need
...
-> 1753     result[rl.indexer] = arr
   1754     itemmask[rl.indexer] = 1
   1756 if not itemmask.all():

ValueError: could not convert string to float: 'INLAND'

Expected behavior
I expect to avoid using ocean proximity without explicit convertion, as a workourround I dropped the column for finding correlations.

aux_set = housing.drop("ocean_proximity", axis=1)
corr_matrix = aux_set.corr()

THIS WORKS!

Using enums could be a solution as well.

Versions (please complete the following information):

  • OS: Windows 11
  • Python: 3.11
  • Pandas: 2.3.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions