Skip to content

Conversation

@TwsThomas
Copy link
Contributor

@TwsThomas TwsThomas commented May 14, 2020

Change the download source from official link to openML.
For employees_salaries

@GaelVaroquaux
Copy link
Member

Darn, still failing!

Is it the download that is failing? If so, we should change where we download from. I had in mind that we had uploaded this data on openML, but I am not sure.

@TwsThomas
Copy link
Contributor Author

It seems to me it is the download indeed.
I will work on that with openML.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 14, 2020 via email

@TwsThomas
Copy link
Contributor Author

Did we already upload it there? I have in mind that we had started
uploading the datasets.

Most of them are already in openML (cf the PR skrub-data/datasets#4)

For the download, you can use
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html

I will do it with that ok.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 14, 2020 via email

Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I have in mind that employees salary is used in other examples. No?

"""

return fetch_dataset(EMPLOYEE_SALARIES_CONFIG, show_progress=False)
from sklearn.datasets import fetch_openml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I would prefer if this import was moved to the top of the module, with the other imports.

# Then we load it:
import pandas as pd
df = pd.read_csv(employee_salaries['path']).astype(str)
df = data = employee_salaries['data']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = data = employee_salaries['data']
df = employee_salaries['data']

# Test if load was unsuccesful
if '"code" : "authentication_required"' in str(df.iloc[0]):
print('Error while loading the data') #raise IOError
raise IOError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the if clause: we shouldn't need it anymore, right?

@GaelVaroquaux
Copy link
Member

CI is still failing because other examples need to be updated. But this is looking good!

Can you also add an entry in CHANGES.rst that documents the changes of API in the fetcher?

@codecov
Copy link

codecov bot commented May 14, 2020

Codecov Report

Merging #119 into master will decrease coverage by 0.11%.
The diff coverage is 25.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #119      +/-   ##
==========================================
- Coverage   64.29%   64.17%   -0.12%     
==========================================
  Files          11       11              
  Lines         801      804       +3     
  Branches      153      153              
==========================================
+ Hits          515      516       +1     
- Misses        246      248       +2     
  Partials       40       40              
Impacted Files Coverage Δ
dirty_cat/datasets/fetching.py 81.57% <25.00%> (-1.31%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3695d51...e6ea1ed. Read the comment docs.

@@ -1,3 +1,7 @@
Release 0.0.7
=============
* **datasets.fetch_employee_salaries**: change the origin of download for employee_salaries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also mention that this changed the function signature: the function now return a Bunch with a dataframe under the field "data", and not the path to the csv file. Also, the field "description" has been renamed to "DESCR".

Mentioning this is important, because these changes can break user code, and they need to quickly identify them from the changelog.

Co-authored-by: Gael Varoquaux <gael.varoquaux@normalesup.org>
@GaelVaroquaux
Copy link
Member

Hurray, merging!

@GaelVaroquaux GaelVaroquaux merged commit 57b0ab3 into skrub-data:master May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants