Populate make_classification_df date functionality with random dates#851
Populate make_classification_df date functionality with random dates#851BhuvanashreeM wants to merge 5 commits intodask:mainfrom
Conversation
dask_ml/datasets.py
Outdated
| dd.from_array( | ||
| np.array([random_date(*dates)] * len(X_df)), | ||
| chunksize=chunks, | ||
| np.array([random_date(*dates) for i in range(len(X_df))]), |
There was a problem hiding this comment.
Looks like there's some test errors because random_dates doesn't have a random seed. Maybe this signature would help?
def random_dates(start, end, random_state=None):
rng = check_random_state(random_state)
...
There was a problem hiding this comment.
Yeah, that could be a probable reason, I'll work on it. And also, the test written for the function(make_classification_df) doesn't check for true randomness. I'm thinking of adding a new test/modifying the existing one for the same, will that suffice?
link to test_make_classification_df
|
Here's what I've done in the latest commit :
Another observation I made earlier today was that: the main repository when cloned and existing tests run, I observed the same number of errors that I previously got when running the tests after making my modification in the earlier commit. Could the maintainers of the repository look into the same? |
| chunks=100, | ||
| dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
| ) | ||
| check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute()) |
There was a problem hiding this comment.
Good, this checks the random state.
Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.
There was a problem hiding this comment.
Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's randint function is a deterministic pseudo random number generator, we can be sure that it will produce a random set of numbers. I've ensured in the code that the function random_date is called multiple times, each with a different seed. Hence I feel the line np.unique(X["date"]).size >= threshold would be redundant. Open to any thoughts you might have here @stsievert
There was a problem hiding this comment.
The same random_seed=123 is passed to both calls to make_classification_df; I would expect every value in X_df["date"] and X_df1["date"] to be the same.
I think X["date"].nunique() >= threshold would be a lot simpler.
There was a problem hiding this comment.
Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will n_samples/2 be good?
There was a problem hiding this comment.
Oh yeah, threshold=n_samples/2 is more than good. I think threshold=2 would suffice; that'd make sure #845 is resolved.
| chunksize=chunks, | ||
| np.array( | ||
| [ | ||
| random_date(*dates, random_state + i) |
There was a problem hiding this comment.
What happens when random_state isn't an integer? Scikit-learn allows for random_state to be an integer, None or an instance of np.random.RandomState (source).
There was a problem hiding this comment.
Okay, I will have to raise a ValueError exception there, on it.
There was a problem hiding this comment.
Why not use this code?
rng = check_random_state(random_state)
dates = [random_date(*dates, rng) for i in range(len(X_df))]
...There was a problem hiding this comment.
The code above will produce the same random number since the seed(rng) remains the same in subsequent calls.
There was a problem hiding this comment.
My main point: I think np.random.RandomState and None should be acceptable types for random_state. I'm fine expanding the for-loop, though I don't think that needs to happen:
Details
[ins] In [193]: def random_dates(random_state):
...: return random_state.randint(100)
...:
[ins] In [194]: rng = np.random.RandomState(42)
[ins] In [196]: [random_dates(rng) for _ in range(20)]
Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]There was a problem hiding this comment.
I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")
There was a problem hiding this comment.
That runs counter to the use of random_state in Scikit-learn. random_date is public, so it should accept all types of random_state that Scikit-learn accepts.
If random_date were a private function, I wouldn't really care
There was a problem hiding this comment.
random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's random_state. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.
There was a problem hiding this comment.
Scikit-learn's check_random_state function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html
It takes those values and produces the correct random seed generator.
There was a problem hiding this comment.
@stsievert the accepted types of random_state in Scikit and Numpy appear to be the same.
Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state
dask_ml/datasets.py
Outdated
|
|
||
| def random_date(start, end): | ||
| def random_date(start, end, random_state=None): | ||
| rng_random_date = dask_ml.utils.check_random_state(random_state) |
There was a problem hiding this comment.
Nit:
| rng_random_date = dask_ml.utils.check_random_state(random_state) | |
| rng_random_date = sklearn.utils.check_random_state(random_state) |
That way the .compute() can be avoided (especially relevant on repeated calls.).
dask_ml/datasets.py
Outdated
| or not isinstance(random_state, np.random.RandomState) | ||
| or not isinstance(random_state, int) | ||
| ): | ||
| random_state = None |
There was a problem hiding this comment.
Why is the block necessary? None is already the default value for random_state.
There was a problem hiding this comment.
This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values"
This fixes the date functionality of make_classification_df as mentioned in #845
Overview of the problem :
dask_ml.datasets.make_classification_dfwith a date range provided, fills the date column with just one unique datechunksizeto be equal tochunks(passed in through make_classification_df), populates the date column with NaN values.Findings :
The helper function
random_dateworks perfectly fine, generating a random date given the start and endthis line populates the list with the same date value, rather than calling the
random_datefunctionlen(X_df)time, which is the required fix.Run Existing Tests
Code Formatting (black, flake8, isort)
Custom Tests: Added
Seeking the maintainers and @ScottMGustafson to review/provide feedback on the proposed changes.