Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 9 additions & 4 deletions dirty_cat/datasets/fetching.py
Original file line number Diff line number Diff line change
Expand Up @@ -346,18 +346,23 @@ def fetch_employee_salaries():
dict
a dictionary containing:

- a short description of the dataset (under the ``description``
- a short description of the dataset (under the ``DESCR``
key)
- an absolute path leading to the csv file where the data is stored
locally (under the ``path`` key)
- the tabular data (under the ``data`` key)
- the target (under the ``target`` key)

References
----------
https://catalog.data.gov/dataset/employee-salaries-2016

"""

return fetch_dataset(EMPLOYEE_SALARIES_CONFIG, show_progress=False)
from sklearn.datasets import fetch_openml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I would prefer if this import was moved to the top of the module, with the other imports.

data = fetch_openml(data_id=42125, as_frame=True)
return data

# link dead.
# return fetch_dataset(EMPLOYEE_SALARIES_CONFIG, show_progress=False)


def fetch_road_safety():
Expand Down
7 changes: 4 additions & 3 deletions examples/02_fit_predict_plot_employee_salaries.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,17 @@
# We first download the dataset:
from dirty_cat.datasets import fetch_employee_salaries
employee_salaries = fetch_employee_salaries()
print(employee_salaries['description'])
print(employee_salaries['DESC'])


################################################################################
# Then we load it:
import pandas as pd
df = pd.read_csv(employee_salaries['path']).astype(str)
df = data = employee_salaries['data']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = data = employee_salaries['data']
df = employee_salaries['data']


# Test if load was unsuccesful
if '"code" : "authentication_required"' in str(df.iloc[0]):
print('Error while loading the data') #raise IOError
raise IOError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the if clause: we shouldn't need it anymore, right?

################################################################################
# Now, let's carry out some basic preprocessing:
df['Current Annual Salary'] = df['Current Annual Salary'].str.strip('$').astype(
Expand Down