Populate make_classification_df date functionality with random dates by BhuvanashreeM · Pull Request #851 · dask/dask-ml

BhuvanashreeM · 2021-08-29T13:43:48Z

This fixes the date functionality of make_classification_df as mentioned in #845

Overview of the problem :

running dask_ml.datasets.make_classification_df with a date range provided, fills the date column with just one unique date
in dask.dataframe.from_array, specifying the chunksize to be equal to chunks (passed in through make_classification_df), populates the date column with NaN values.

Findings :

The helper function random_date works perfectly fine, generating a random date given the start and end
this line populates the list with the same date value, rather than calling the random_date function len(X_df) time, which is the required fix.
Run Existing Tests
Code Formatting (black, flake8, isort)
Custom Tests: Added

Seeking the maintainers and @ScottMGustafson to review/provide feedback on the proposed changes.

stsievert · 2021-08-29T23:10:57Z

dask_ml/datasets.py

                dd.from_array(
-                    np.array([random_date(*dates)] * len(X_df)),
-                    chunksize=chunks,
+                    np.array([random_date(*dates) for i in range(len(X_df))]),


Looks like there's some test errors because random_dates doesn't have a random seed. Maybe this signature would help?

def random_dates(start, end, random_state=None): rng = check_random_state(random_state) ...

Yeah, that could be a probable reason, I'll work on it. And also, the test written for the function(make_classification_df) doesn't check for true randomness. I'm thinking of adding a new test/modifying the existing one for the same, will that suffice?
link to test_make_classification_df

BhuvanashreeM · 2021-08-31T11:25:20Z

Here's what I've done in the latest commit :

included a seed(random_state) as an additional argument to the function random_date
added a simple test in (test_datasets.py) for checking randomness given the seed, I could add more complex tests for checking true randomness in the date column. Will that be requisite? @stsievert

Another observation I made earlier today was that: the main repository when cloned and existing tests run, I observed the same number of errors that I previously got when running the tests after making my modification in the earlier commit. Could the maintainers of the repository look into the same?

stsievert · 2021-08-31T12:59:42Z

tests/test_datasets.py

+        chunks=100,
+        dates=(date(2014, 1, 1), date(2015, 1, 1)),
+    )
+    check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute())


Good, this checks the random state.

Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.

Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's randint function is a deterministic pseudo random number generator, we can be sure that it will produce a random set of numbers. I've ensured in the code that the function random_date is called multiple times, each with a different seed. Hence I feel the line np.unique(X["date"]).size >= threshold would be redundant. Open to any thoughts you might have here @stsievert

The same random_seed=123 is passed to both calls to make_classification_df; I would expect every value in X_df["date"] and X_df1["date"] to be the same.

I think X["date"].nunique() >= threshold would be a lot simpler.

Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will n_samples/2 be good?

Oh yeah, threshold=n_samples/2 is more than good. I think threshold=2 would suffice; that'd make sure #845 is resolved.

stsievert · 2021-08-31T13:02:51Z

dask_ml/datasets.py

-                    chunksize=chunks,
+                    np.array(
+                        [
+                            random_date(*dates, random_state + i)


What happens when random_state isn't an integer? Scikit-learn allows for random_state to be an integer, None or an instance of np.random.RandomState (source).

Okay, I will have to raise a ValueError exception there, on it.

Why not use this code?

rng = check_random_state(random_state) dates = [random_date(*dates, rng) for i in range(len(X_df))] ...

The code above will produce the same random number since the seed(rng) remains the same in subsequent calls.

My main point: I think np.random.RandomState and None should be acceptable types for random_state. I'm fine expanding the for-loop, though I don't think that needs to happen:

Details

[ins] In [193]: def random_dates(random_state): ...: return random_state.randint(100) ...: [ins] In [194]: rng = np.random.RandomState(42) [ins] In [196]: [random_dates(rng) for _ in range(20)] Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]

I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")

That runs counter to the use of random_state in Scikit-learn. random_date is public, so it should accept all types of random_state that Scikit-learn accepts.

If random_date were a private function, I wouldn't really care

random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's random_state. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.

Scikit-learn's check_random_state function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html

It takes those values and produces the correct random seed generator.

@stsievert the accepted types of random_state in Scikit and Numpy appear to be the same.
Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state

stsievert · 2021-09-03T14:29:04Z

dask_ml/datasets.py


-def random_date(start, end):
+def random_date(start, end, random_state=None):
+    rng_random_date = dask_ml.utils.check_random_state(random_state)


Nit:

Suggested change

rng_random_date = dask_ml.utils.check_random_state(random_state)

rng_random_date = sklearn.utils.check_random_state(random_state)

That way the .compute() can be avoided (especially relevant on repeated calls.).

stsievert · 2021-09-03T14:29:13Z

dask_ml/datasets.py

+        or not isinstance(random_state, np.random.RandomState)
+        or not isinstance(random_state, int)
+    ):
+        random_state = None


Why is the block necessary? None is already the default value for random_state.

This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values"

fix populates make_classification_df with random dates

4b55f9a

BhuvanashreeM changed the title ~~Populate make_classification_df date functionality with random dates #845~~ Populate make_classification_df date functionality with random dates Aug 29, 2021

stsievert reviewed Aug 29, 2021

View reviewed changes

added-seed-to-random_date-and-modified-test_datasets

424fca4

stsievert reviewed Aug 31, 2021

View reviewed changes

BhuvanashreeM added 2 commits September 3, 2021 08:49

check-for-unique-values

b740b27

checks-for-random_state-type

24461fb

stsievert reviewed Sep 3, 2021

View reviewed changes

removed-redundant-compute-calls-in-random_date

3a7c9d1

	rng_random_date = dask_ml.utils.check_random_state(random_state)
	rng_random_date = sklearn.utils.check_random_state(random_state)

Uh oh!

Conversation

BhuvanashreeM commented Aug 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BhuvanashreeM commented Aug 31, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BhuvanashreeM Sep 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BhuvanashreeM commented Aug 29, 2021 •

edited

Loading

BhuvanashreeM Sep 3, 2021 •

edited

Loading