Report of Reviewer #1

Associate Editor: *Nelle Varoquaux*
Reviewer : chose to remain anonymous

## Reviewer: Reviewing history

- Paper submitted *March 09, 2023*
- Reviewer invited *March 20th, 2023*
- Review 1 received *May 29th,  2023*
- Paper revised *October 02, 2023*
- Reviewer invited *October 03, 2023*
- Review 2 received *October 23, 2023*
- Paper revised *January 22nd, 2024*
- Reviewer invited *January 31st, 2024*
- Review received *February 2nd, 2024*
- Paper conditionally accepted *February 9th, 2024*


## First Round

The paper presents a dimension reduction framework based on Importance Sampling for estimating integrals
with respect to probability distributions, such as, rare events. The main contribution of the manuscript lies in
Theorem 3.1 which provides more theoretical understanding about reducing the dimension only by projecting
onto the mean of the optimal biasing density. The computational framework relies on a Gaussian family of
biasing densities for which 6 different covariance matrices are analyzed.

Overall, I think the paper has potential of being considered for the publication. However, there are multiple
notation (especially on the eigenpairs and covariance matrices), organization and clarification aspects that have
to be addressed. In the following, I provide some remarks (and editorial comments) the Authors need to address.

> We thank the Reviewer for his/her encouraging comments. Just to avoid any ambiguity, we stress nonetheless that it is Theorem 2 that is concerned with projecting on the mean; while Theorem 1, which is indeed the main result of the paper, identifies the theoretically optimal directions on which to project.

### Main comments
1. I counted approximately 37 pages of code material, comprising over half of the manuscript. This significantly hampers readability, especially in Section 6 where numerical results are presented. Despite this
seems to be a policy of the journal, I suggest creating a GitHub page or similar platform to host the codes
for each numerical example. This will not only enhance the paper’s readability but also provide reviewers
and readers with convenient access to the code for testing. Then if it is necessary you can move the code
to an appendix.

> We agree with the Reviewer that the code material does not help the reading of the paper. Nevertheless, we have strictly followed the journal policy about the reproducibility of the results. All the different pieces of code generate indeed the different graphs (Figure 1, 2, 3, 4, 5, and 6) and all the result tables (Table 2, 3, 4, 5, 6 and 7) of the article. They cannot be moved to the appendix since otherwise all these graphs and tables will also be moved to the
appendix. 
> The code can be hidden in the HTML version of the paper using the hide/show button. We recall that the HTML version of the article is available at the following link https: //jmorio44.github.io/optimal-projection-IS/.


2. Section 3.2: I think the the choice of function $\ell$ deserves to be justified in more detail as it is fundamental
for the whole paper. For example, I do not see the ordering idea is mentioned/motivated in the ab-
stract and introduction. You should also mention what are the requirements for this function, bounded,
convex?, are there some theoretical justifications on why this induces a good  $\ell$-ordering?, what is the
rationale/intuition behind this choice, are there other potential alternatives? Why the  $\ell$-ordering idea
gives a better ranking compared with the more standard descending-ordering of the eigenvalues?
Despite you are trying to explain it in the Remark in P7, it is still not clear to me why it is a good idea
to have the minimizer of  $\ell$ at 1. I also do not understand why sometimes the smallest or the largest
eigenvalue of $\Sigma^⋆$ maximizes  $\ell$ (perhaps this related to the choice of  $\ell$).

> In fact, there is no choice behind the function $\ell$: this function is given by Theorem 1. To be more precise, we show in Theorem 1 that if eigenvalues are ranked according to their ℓ values, then the optimal mean and variance parameters are given by (7) in the paper. The theorem shows that any other choice of eigenvalues (e.g., choosing eigenvalues in increasing or decreasing order) is suboptimal, i.e., will increase the Kullback–Leibler divergence of Equation (5). In practice, as only 1 or 2 projection directions are chosen and due to the shape of the function $\ell$, the optimal projection directions are the eigenvectors associated with extreme (small or large) eigenvalues of the optimal covariance matrix. As it was not clear, the definition of the function ℓ has been moved outside the Section 3.2 "Main results of the paper" to improve the readability of the article. Indeed the necessary code to plot the function $\ell$ was just before theorem and could make it difficult to read. It is no more the case in the new version of the manuscript. To ease the reading, we have also moved the proof of Theorem 1 and thus of the optimality of $\ell$ in Appendix A.

3. I understand the Authors are not taking into account the computational cost from the application of
rejection sampling to $g^⋆$ . The purpose of this is to check that the proposed directions are suitable for
building the subspace, and ideas to relax this assumption are mentioned in the conclusions. However, I
still believe that the effect of a more practical algorithm must be tested in the approach. For instance, if
one uses a MCMC algorithm, the samples will no longer be independent and it will be very relevant to
see how dependent samples affect the proposed subspace directions. Perhaps, you can illustrate/compare
this on a small example.
Otherwise, this will mean that the proposed approach only works when samples of $g^*$ are independent.
Hence, this needs to be clearly stated in the manuscript, and the final computational cost of the method
needs to be discussed as well.

> We agree with the reviewer that it is interesting to consider MCMC algorithm in the simulations. For that purpose, we have applied Metropolis–Hastings algorithm to generate samples distributed from $g^⋆$ on the test case 1. All the results are given in Appendix C and the conclusions on the covariance matrix choices stay similar. The number of calls to $\phi$ is significantly decreased with MCMC compared to rejection sampling but the integral estimates have of course an increased variance due to the dependency of the MCMC samples.

4. In general, the computational cost of the method needs to be discussed in detail (total number of $\phi$ calls).
I suggest to add this to the Tables in the numerical experiments in Section 6. This way the readers can
see how competitive is the algorithm at the end. Also, this will help to understand the choices of the 6
proposed covariance matrices, such that in addition to the accuracy of the integral estimate, we can see
the associated cost.

> We agree with the Reviewer that the computational cost is an important topic in importance sampling. The simulation budget is here exactly the same whatever the covariance matrix model. To clarify that point, we now indicate the number of calls to ϕ in the legend of each result table. We have also added the following paragraph in Section 4.1:

> The number of samples $M$ and $N$ are respectively set to $M=500$ and $N=2000$. The computational cost to generate $M=500$ samples distributed from $g^⋆$ with rejection sampling is often unaffordable in practice; if $\mathcal{E}$ is a probability of order $10^{-p}$, then approximately $500\times10^p$ calls to $\phi$ are necessary for the generation of $\mathbf{X}_1^⋆,\ldots,\mathbf{X}_M^⋆$. Finally, whatever the auxiliary parametric density $g'$ computed from $\mathbf{X}_1^⋆,\ldots,\mathbf{X}_M^⋆$, the number of calls to $\phi$ for the estimation step stays constant and equal to $N$. The number of calls to $\phi$ for the whole procedure on a $10^{-p}$ probability estimation is about $500\times10^p+N$. A more realistic situation is considered in Appendix C where MCMC is applied to generate samples from $g^⋆$. The resulting samples are dependent but the computational cost is significanlty reduced. The number of calls to $\phi$ with MCMC is then equal to $M$ which leads to a total computational cost of $M+N$ for the whole procedure.


Detailed comments: 
_[2110187-response-to-the-reviewers.pdf](https://github.com/computorg/published-202402-elmasri-optimal/files/14558050/2110187-response-to-the-reviewers.pdf)_

## Second round

The authors did a great job at replying to my comments and suggestions. I was confused
about the choice of the $\lambda$ function, and indeed it arises from Theorem 1. Also, I was not aware
of the html version, thanks for pointing it out. Now, I just wanted to ask some details in
the Metropolis-Hastings (MH) approach since the results in Appendix C (Table 7) look a bit
surprising or perhaps confusing:

1. In the MH code I can see the proposal is standard Gaussian, but why not using a proposal
scale (or a method that adapts the proposal)? I mean something like $P=VA0(\beta).rvs(size=1)$,
so the proposal depends on a scale beta that the user can control (or adapt with an algorithm)
to improve the acceptance probability of the algorithm.

> We thank the Reviewer for his/her encouraging comments. We use a reversible Markovian kernel M for the Gaussian distribution in the Metropolis-Hastings approach with the following expression:
> $M(x, .) = \frac{x + \beta N}{\sqrt{1 +  \beta^2}}\,$
> where N is a multivariate standard normal random variable and $\beta$ is a scaling parameter [1]. This scaling parameter is called "param_agit" in the python code. It is set to 0.5 as it seems to give the best simulation results for this use case. This parameter can be modified by the user.


It is mentioned that “The number M of g* samples has been increased to 1000 so that the Markov chain has enough step to reach convergence”, I assume this is the burn-in, right? However, given that there is no proposal scale selection (or adaptation), we cannot say that simply adding 1000 more samples is sufficient to achieve convergence. Also, what are the resulting acceptance probabilities of the MH algorithm?

> We have modified the sentence as follows: “Remember that with rejection sampling, we did not account for the samples generated in the rejection step. Thus, in order to generate a sample of size M = 500 with an acceptance
probability of the order of 10 −3 , of the order of 500, 000 samples are generated. Thus, a fair comparison between the rejection and MCMC methods would allow to consider sampling 500, 000 times. In practice, we found that the MCMC method performs reasonably well if we use it as a sampler. More precisely, the sample $(X_1^* , \dots, X_M^{*})$ in Section 4.1 is given by $(Y_{5k})_{k=1,\dots, 500}$ where $(Y_{i} )_{i=1,\dots ,2500}$ is the MH Markov chain. ”

> Note that we do not claim that the chain has reached stationarity, but the numerical results seemed sufficient for our purposes. Note that this represents a significant improvement with respect to the rejection method, as only 2,500 calls to the black-box functions are necessary for MCMC, versus 500, 000 for the rejection method. Note finally that the acceptance probability is about 0.22 when the kernel parameter β is set to 0.5. According to [2], this acceptance rate is relevant.

It seems that the generated samples have a lot of correlation given the large coefficients of variation in Table 7. By the way, are these associated with the estimation of E or g*?, the caption of the table reads: Numerical comparison of the estimation of E ≈ 1.35 × 10 −3 , but I do not see the estimated values of E under each covariance matrix case, so we can compare with the estimated values of E when g* is computed with rejection sampling. Moreover, why are the relative errors negative?

> The coefficients of variation are associated with the estimation of $\mathcal{E}$ and take high values because of the sample dependency. To reduce the coefficients of variations, it is necessary to increase the length of the MH Markov chain. The relative error is defined by by $\widehat{\mathcal{E}}/\mathcal{E} − 1$ (See the end of Section 4.1). By definition, it can take negative values, namely when b note that it the estimation is smaller than the true value. Concerning the absolute value of $\widehat{\mathcal{E}}$, can be retrieved from the relative error since the true value $\mathcal{E}$ is known. We find the relative error more informative than the absolute values (which necessitates an additional comparison step to the true value to assess its accuracy), and to avoid redundancy, we have not reported the absolute values $\widehat{\mathcal{E}}$ in any table of the article.

Given the previous comments and results of Table 7, it seems that doing MCMC for g* is not recommended for your algorithm. Could you please add an explanation of the results of Table 7, e.g., what are the implications of these results and/or ways to improve.

>  MCMC is well adapted to our algorithm but the sample dependency induced by MCMC implies an increase in the number of samples generated from g* with MCMC to get a similar level of error as the one we get with independent samples. Nevertheless, we want to recall that the computational cost to obtain these independent samples with rejection sampling (4.10 5 calls to ϕ) is much more important than with MCMC (10 3 calls to ϕ) and is intractable
in practical applications.

 Typo in the second sentence “computationnal”.
> We have made the modification.

Finally, in the conclusions, it is not clear what is the best choice of covariance matrix
(among the 6 options). It seems that this is problem dependent, right? Perhaps you can think
on a way to make the conclusions more clear and consise. As of now, the first paragraphs of this
section fit more to an “analysis of results” section, which is fine, but takes the reader away from
the main conclusions of this research. For example, assumptions, advantages-disadvantages of
the method, what are the recommended “tunnable settings” (e.g., the covariances, estimation of
g*) in the algorithm, and things to improve.

> Amongst the 6 options reported in the tables, the best covariance matrix choice is of course always to consider the optimal covariance matrix $\Sigma^*$ whatever the use case. However in practice we have to estimate this covariance matrix but the classical empirical estimate is not relevant due to the curse of dimensionality. In this paper, we have determined theoretically the optimal directions in which the variance should be computed and have verified numerically the efficiency of these directions in different settings (knowledge or estimation of these optimal directions). Nevertheless, we agree with the reviewer that the conclusion was too long. We have thus lightened the first paragraphs of the conclusion to focus on the most important points of the paper.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report of Reviewer #1 #2

Reviewer: Reviewing history

First Round

Main comments

Second round

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Report of Reviewer #1 #2

Description

Reviewer: Reviewing history

First Round

Main comments

Second round

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions