-
Notifications
You must be signed in to change notification settings - Fork 366
Implement Series.cov #1620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement Series.cov #1620
Conversation
databricks/koalas/series.py
Outdated
| Parameters | ||
| ---------- | ||
| other : Series | ||
| min_periods : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe min_periods also can be an optional ?
- because It will be a zero when nothing is given for
min_periods
| k_isnan = np.isnan(kser.cov(kser_other, 4)) | ||
| p_isnan = np.isnan(pser.cov(pser_other, 4)) | ||
| self.assert_eq(k_isnan, p_isnan) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test when each Series has a different index and an Exception case?
For example,
kser = ks.Series([90, 91, 85], index=[1, 2, 3])
pser = kser.to_pandas()
kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3])
pser_other = kser_other.to_pandas()
self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True)and
self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85])) # 'other' must be a Series
self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90]))) # series are not aligned| return self._internal.spark_frame.cov(self.name, other.name) | ||
| else: | ||
| # if not on the same anchor calculate covariance manually | ||
| return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(self.index) is performed four times in this code.
What do you think about we assign a proper variable and reuse it?
(ex. len_index = len(self.index) at the line above this variable is first used)
|
Could you add this to the docs also ?? It is placed at |
|
Otherwise, looks fine to me. Thanks, @lopez- |
| if len(self.index) != len(other.index): | ||
| raise ValueError("series are not aligned") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this from? Seems like pandas works even with a different length of Series.
>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed it. Thanks, @ueshin .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this:
>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5And this:
>>> pd.Series([1, 2]).cov(pd.Series([5, 6]))
0.5are equivalent... I believe this align is not supported in Koalas today. If this is a blocker I could open an issue and wait until somebody implements this. Another option I can think of is to go ahead and have a slightly different behavior for this edge case while we wait for the align implementation. Do you have any thoughts/preference on how to go about this @itholic @ueshin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lopez- Could you file the issue for align?
Also, is it possible for you to implement it?
| kser = ks.Series([90, 91, 85]) | ||
| pser = kser.to_pandas() | ||
| kser_other = ks.Series([90, 91, 85]) | ||
| pser_other = kser_other.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define pandas object first. to_pandas() invokes extra Spark jobs and it will take more time for tests.
pser = pd.Series([90, 91, 85])
kser = ks.from_pandas(pser)
databricks/koalas/series.py
Outdated
|
|
||
| def cov(self, other: "Series", min_periods: Optional[int] = None) -> float: | ||
| """ | ||
| Return the covariance between two series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just copy the docstring from pandas' with a few modification of examples?
| raise ValueError("series are not aligned") | ||
|
|
||
| min_periods = 0 if min_periods is None else min_periods | ||
| if len(self.index) < min_periods or len(self.index) <= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also compare len(self.index) with min_periods?
>>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3)
nan|
|
||
| if same_anchor(self, other): | ||
| # if the have the same anchor use the more performant Spark native `cov` | ||
| return self._internal.spark_frame.cov(self.name, other.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._kdf._internal.resolved_copy.spark_frame.cov(
self._internal.data_spark_column_names[0],
other._internal.data_spark_column_names[0])?
FYI: self.name won't always be the same as the underlying Spark DataFrame column name. See the description of #1554.
| # if not on the same anchor calculate covariance manually | ||
| return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should create a new DataFrame and use it, something like:
kdf = self._kdf.copy()
tmp_column = verify_temp_column_name(kdf, '__tmp_column__')
kdf[tmp_column] = other
return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period)I haven't checked the code, so please modify as it works.
Btw, we should do this at the beginning of this method to avoid extra checks for length or something.
|
Any updates here ? Just confirming :) |
|
Hi @lopez- , since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36401. Otherwise, I may do that for you next week. |
This PR proposes Series.cov