Skip to content

Conversation

@lionelkusch
Copy link
Collaborator

@lionelkusch lionelkusch commented Aug 29, 2025

This PR is based on the PR : 366 361

@lionelkusch lionelkusch changed the title Reformat Model X knockoff with version 2 of API API 2: Model X knockoff Aug 29, 2025
@lionelkusch lionelkusch added the API 2 Refactoring following the second version of API label Sep 9, 2025
@lionelkusch
Copy link
Collaborator Author

Sure, but what I mean is that the Knockoffs and the dCRT for instance have different nature since one is aiming the FDR and the other one is p-values. Therefore, I think that the default selection for each should be the one is it made for, or at least include a warning or something to indicate the default control.

Originally posted by @AngelReyero in #361 (comment)

To follow up on @AngelReyero, since knockoffs have quite a different selection procedure (for computing the threshold..), should we consider overwriting the selection_fdr method of BaseVariableImportance in #367?

Also, should an Error be raised if the user passes a p-value threshold to the .selection function?

Originally posted by @jpaillard in #361 (comment)

@lionelkusch lionelkusch marked this pull request as ready for review October 20, 2025 13:30
Copy link
Collaborator

@jpaillard jpaillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks overall good. I mostly have comments for naming and organisation.
I would suggest to set by default statistical_test='lcd', rename lasso_statistic_with_sampling --> ModelXKnockoff.lasso_coefficient_difference and make it an internal method of the class ModelXKnockoff. IMO the method should be attached to the class, and use the attribute self.estimator (see below).

To the same extend, the class should have a parameter estimator in order to expose the Lasso model instead of keeping it under the hood. Especially since it uses a particularly high default max_iter=200000,

n_bootstraps = 25
# number of jobs for repetition of the method
n_jobs = 2
n_jobs = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think it makes sense to illustrate in the example where users should use parallelization and to do so using n_jobs>1
  • If we decide not to show it I would remove n_jobs and rely on the default

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_jobs=2 (or even 4) is a better option IMHO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get some issue when I was running with n_jobs=2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of issues ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For running the example, I got some errors in the last three commits. This was the solution to my problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The creation of the class makes a track of some states, which uses too much memory and create an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be related to #356.
We still have the problem of nested parallel loops in the example generation. I would still argue that showcasing parallelization in the example is valuable.


def preconfigure_LassoCV(estimator, X, X_tilde, y, n_alphas=20):
"""
Configure the estimator for Model-X knockoffs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docstring should focus more on L43-44: the regularization path is defined in a data-dependent way.
The paragraph in Notes should be the central part of the docstring. Is there a reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bthirion Do you have some reference?

return estimator


def lasso_statistic_with_sampling(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to D0CRT (regression_test, logistic_test), IMO lasso_statistic should be a method of the knockoff class. As its signature suggests, it is KO-specific. it would be more intuitive to place it in the same file, under the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the actual implementation of the test is the original one but there are propositions for modifying.
I want it to let the possibility for users to modify it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to D0CRT, we can consider extending to other test statistics, but they should still be methods of the class.

  • These methods will not be used anywhere else
  • Would be simplified if the class state were used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted it to separate the configuration of Lasso from knockoff because I don't like to add the parameter preconfigure.
I put everything to a knockoff.

Copy link
Collaborator

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor things remaining, thx !

n_bootstraps = 25
# number of jobs for repetition of the method
n_jobs = 2
n_jobs = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_jobs=2 (or even 4) is a better option IMHO.

@lionelkusch
Copy link
Collaborator Author

To the same extend, the class should have a parameter estimator in order to expose the Lasso model instead of keeping it under the hood. Especially since it uses a particularly high default max_iter=200000,

The lasso is present at the user API level. I use the default value of the code. I can modify it in order to remove all these default parameters.

@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 99.32886% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 99.19%. Comparing base (5f90dfa) to head (55c20c1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/hidimstat/knockoffs.py 99.26% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #367      +/-   ##
==========================================
- Coverage   99.31%   99.19%   -0.12%     
==========================================
  Files          24       24              
  Lines        1309     1364      +55     
==========================================
+ Hits         1300     1353      +53     
- Misses          9       11       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jpaillard jpaillard changed the title API 2: Model X knockoff [API 2]: Model X knockoff Oct 30, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please chekc the logic. This fixes the error reported in #520

@jpaillard jpaillard requested a review from bthirion October 30, 2025 10:59
This was linked to issues Oct 30, 2025
# Number of variables
n_features = 150
# Correlation parameter
n = 300
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can add the description of the parameter as before, it should be better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer n_samples, n_features, which are self-explaining.

max_iter=1000,
),
random_state=0,
n_repeats=1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, it's sometimes better to declare some parameters even if the value is the same as the default value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but in that case, we just want to show the vanilla knockoff, not the aggregation.
So the user can simply ignore the existence of this parameter.

return test_statistic

@staticmethod
def knockoff_threshold(test_score, fdr=0.1):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a provide method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you don't add the other methods (_empirical_knockoff_pval and _empirical_knockoff_eval) into the class?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean private? If so, I am not sure that it should be a private method. As shown in the example, having access to the knockoff threshold can be quite useful for visualization / understanding the data.

For the second comment I agree, they can all be class methods

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I meant private.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of using a staticmethod here ? it could be a standard class method since it needs to access test_scores anyhow ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for the next 2 static methods.
Note that it may be simply a poor understanding on my side.

k_star = 1
# The for loop over all e-values could be optimized by considering a descending list
# and stopping when the condition is not satisfied anymore.
for k, e_k in enumerate(evals_sorted, start=1):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments because it is quite unusual in python that the enumeration starts at 1.
Moreover, if you add the link to the equation in the paper, it will be better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for i in range(n_features - 1, -1, -1):
if evals_sorted[i] >= n_features / (fdr * (i + 1)):
selected_index = i
break
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of finding the maximum is more optimising than your new implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but it's also broken. I replaced it with a non-optimized version that is easy to read and clearly reflects Equation 5 of Wang and Ramdas (2022).
This is not a critical increase in computation; it is simply a for loop with an if condition at each step of the loop. I still added a comment mentioning that it could be optimized.

for k, e_k in enumerate(evals_sorted, start=1):
if k * e_k >= n_features / fdr:
k_star = k
if k_star <= n_features:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if k_star <= n_features:
if k_star-1 <= n_features:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See issue #520.
By the way, it will be better to do a PR only on this modification.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be k_star otherwise you can end up with k_star = n_features + 1 which is problematic.

Actually, I think this test is not necessary; the condition is always fulfilled because the possible range of values k_star can take in the for loop is always met.

Copy link
Collaborator

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I only have relatively minor comments left.

# Number of variables
n_features = 150
# Correlation parameter
n = 300
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer n_samples, n_features, which are self-explaining.

return test_statistic

@staticmethod
def knockoff_threshold(test_score, fdr=0.1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of using a staticmethod here ? it could be a standard class method since it needs to access test_scores anyhow ?

return test_statistic

@staticmethod
def knockoff_threshold(test_score, fdr=0.1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for the next 2 static methods.
Note that it may be simply a poor understanding on my side.

@jpaillard
Copy link
Collaborator

self.importances_ is an array of shape (n_repeats, n_features) whereas their argument, test_scores, is an array of shape (n_features,).
Each of these functions is designed to operate on a single knockoff repeat and implement formulas from the "vanilla KO" paper.

To transform them to standard class methods, we would need to move the for loops from the fdr_selection or importance methods to these functions. Let me know if you would prefer that.

@jpaillard jpaillard requested a review from bthirion November 5, 2025 07:44
@bthirion
Copy link
Collaborator

bthirion commented Nov 5, 2025

self.importances_ is an array of shape (n_repeats, n_features) whereas their argument, test_scores, is an array of shape (n_features,). Each of these functions is designed to operate on a single knockoff repeat and implement formulas from the "vanilla KO" paper.

To transform them to standard class methods, we would need to move the for loops from the fdr_selection or importance methods to these functions. Let me know if you would prefer that.

OK, I see. I think we need to brainstorm a little bit on this ---i.e. we're likely going to break the API in the future. But Let's merge the PR as is.

@jpaillard jpaillard merged commit 6067152 into mind-inria:main Nov 5, 2025
24 checks passed
@jpaillard jpaillard deleted the PR_knockoff branch November 5, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API 2 Refactoring following the second version of API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] e-BH always returns np.inf rename n_repeat to n_repeats knockoff

3 participants