-
Notifications
You must be signed in to change notification settings - Fork 12
[API 2]: Model X knockoff #367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Originally posted by @AngelReyero in #361 (comment)
Originally posted by @jpaillard in #361 (comment) |
6038de1 to
edeab86
Compare
jpaillard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks overall good. I mostly have comments for naming and organisation.
I would suggest to set by default statistical_test='lcd', rename lasso_statistic_with_sampling --> ModelXKnockoff.lasso_coefficient_difference and make it an internal method of the class ModelXKnockoff. IMO the method should be attached to the class, and use the attribute self.estimator (see below).
To the same extend, the class should have a parameter estimator in order to expose the Lasso model instead of keeping it under the hood. Especially since it uses a particularly high default max_iter=200000,
| n_bootstraps = 25 | ||
| # number of jobs for repetition of the method | ||
| n_jobs = 2 | ||
| n_jobs = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think it makes sense to illustrate in the example where users should use parallelization and to do so using n_jobs>1
- If we decide not to show it I would remove n_jobs and rely on the default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs=2 (or even 4) is a better option IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get some issue when I was running with n_jobs=2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of issues ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For running the example, I got some errors in the last three commits. This was the solution to my problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation of the class makes a track of some states, which uses too much memory and create an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be related to #356.
We still have the problem of nested parallel loops in the example generation. I would still argue that showcasing parallelization in the example is valuable.
|
|
||
| def preconfigure_LassoCV(estimator, X, X_tilde, y, n_alphas=20): | ||
| """ | ||
| Configure the estimator for Model-X knockoffs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docstring should focus more on L43-44: the regularization path is defined in a data-dependent way.
The paragraph in Notes should be the central part of the docstring. Is there a reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bthirion Do you have some reference?
| return estimator | ||
|
|
||
|
|
||
| def lasso_statistic_with_sampling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to D0CRT (regression_test, logistic_test), IMO lasso_statistic should be a method of the knockoff class. As its signature suggests, it is KO-specific. it would be more intuitive to place it in the same file, under the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the actual implementation of the test is the original one but there are propositions for modifying.
I want it to let the possibility for users to modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to D0CRT, we can consider extending to other test statistics, but they should still be methods of the class.
- These methods will not be used anywhere else
- Would be simplified if the class state were used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted it to separate the configuration of Lasso from knockoff because I don't like to add the parameter preconfigure.
I put everything to a knockoff.
bthirion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor things remaining, thx !
| n_bootstraps = 25 | ||
| # number of jobs for repetition of the method | ||
| n_jobs = 2 | ||
| n_jobs = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs=2 (or even 4) is a better option IMHO.
The lasso is present at the user API level. I use the default value of the code. I can modify it in order to remove all these default parameters. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #367 +/- ##
==========================================
- Coverage 99.31% 99.19% -0.12%
==========================================
Files 24 24
Lines 1309 1364 +55
==========================================
+ Hits 1300 1353 +53
- Misses 9 11 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please chekc the logic. This fixes the error reported in #520
| # Number of variables | ||
| n_features = 150 | ||
| # Correlation parameter | ||
| n = 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can add the description of the parameter as before, it should be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer n_samples, n_features, which are self-explaining.
examples/plot_knockoffs_wisconsin.py
Outdated
| max_iter=1000, | ||
| ), | ||
| random_state=0, | ||
| n_repeats=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, it's sometimes better to declare some parameters even if the value is the same as the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but in that case, we just want to show the vanilla knockoff, not the aggregation.
So the user can simply ignore the existence of this parameter.
| return test_statistic | ||
|
|
||
| @staticmethod | ||
| def knockoff_threshold(test_score, fdr=0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a provide method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you don't add the other methods (_empirical_knockoff_pval and _empirical_knockoff_eval) into the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you mean private? If so, I am not sure that it should be a private method. As shown in the example, having access to the knockoff threshold can be quite useful for visualization / understanding the data.
For the second comment I agree, they can all be class methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I meant private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of using a staticmethod here ? it could be a standard class method since it needs to access test_scores anyhow ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for the next 2 static methods.
Note that it may be simply a poor understanding on my side.
| k_star = 1 | ||
| # The for loop over all e-values could be optimized by considering a descending list | ||
| # and stopping when the condition is not satisfied anymore. | ||
| for k, e_k in enumerate(evals_sorted, start=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments because it is quite unusual in python that the enumeration starts at 1.
Moreover, if you add the link to the equation in the paper, it will be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| for i in range(n_features - 1, -1, -1): | ||
| if evals_sorted[i] >= n_features / (fdr * (i + 1)): | ||
| selected_index = i | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of finding the maximum is more optimising than your new implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but it's also broken. I replaced it with a non-optimized version that is easy to read and clearly reflects Equation 5 of Wang and Ramdas (2022).
This is not a critical increase in computation; it is simply a for loop with an if condition at each step of the loop. I still added a comment mentioning that it could be optimized.
| for k, e_k in enumerate(evals_sorted, start=1): | ||
| if k * e_k >= n_features / fdr: | ||
| k_star = k | ||
| if k_star <= n_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if k_star <= n_features: | |
| if k_star-1 <= n_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See issue #520.
By the way, it will be better to do a PR only on this modification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be k_star otherwise you can end up with k_star = n_features + 1 which is problematic.
Actually, I think this test is not necessary; the condition is always fulfilled because the possible range of values k_star can take in the for loop is always met.
bthirion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I only have relatively minor comments left.
| # Number of variables | ||
| n_features = 150 | ||
| # Correlation parameter | ||
| n = 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer n_samples, n_features, which are self-explaining.
| return test_statistic | ||
|
|
||
| @staticmethod | ||
| def knockoff_threshold(test_score, fdr=0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of using a staticmethod here ? it could be a standard class method since it needs to access test_scores anyhow ?
| return test_statistic | ||
|
|
||
| @staticmethod | ||
| def knockoff_threshold(test_score, fdr=0.1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for the next 2 static methods.
Note that it may be simply a poor understanding on my side.
Co-authored-by: bthirion <[email protected]>
|
To transform them to standard class methods, we would need to move the for loops from the |
OK, I see. I think we need to brainstorm a little bit on this ---i.e. we're likely going to break the API in the future. But Let's merge the PR as is. |
This PR is based on the PR :
