Last modifications for tonight

amontoison · amontoison · commit b6f69b4acc16 · 2025-11-07T03:30:33.000-06:00
diff --git a/main.bbl b/main.bbl
@@ -90,6 +90,17 @@
   Conversion from {JuMP} models to {NLPModels}}, 2020,
   \url{https://doi.org/10.5281/zenodo.2574162}.
 
+\bibitem{MadNCL}
+{\sc A.~Montoison, F.~Pacaud, M.~Saunders, S.~Shin, and D.~Orban}, {\em
+  {MadNCL: A GPU Implementation of Algorithm NCL for Large-Scale, Degenerate
+  Nonlinear Programs}}, arXiv preprint arXiv:2510.05885,  (2025).
+
+\bibitem{montoison-2025}
+{\sc A.~Montoison, F.~Pacaud, S.~Shin, and M.~Anitescu}, {\em {GPU
+  Implementation of Second-Order Linear and Nonlinear Programming Solvers}},
+  2025, \url{https://arxiv.org/abs/2508.16094},
+  \url{https://arxiv.org/abs/2508.16094}.
+
 \bibitem{Orban_NLPModels_jl_Data_Structures_2023}
 {\sc D.~Orban and A.~Soares~Siqueira}, {\em {NLPModels.jl: Data Structures for
   Optimization Models}}, 2023.
diff --git a/main.tex b/main.tex
@@ -215,6 +215,9 @@ \section{Introduction}
 We focus on NVIDIA hardware because efficient sparse KKT factorization is currently available only through cuDSS.
 Although our framework already runs on AMD and Intel GPUs, these backends rely on dense linear algebra for KKT solves and therefore serve mainly as proof-of-concept implementations.
 %
+While Krylov methods from \texttt{Krylov.jl} \cite{montoison-orban-2023} can be used on NVIDIA, AMD, and Intel GPUs, the KKT systems arising in the context of interior-point methods are notoriously difficult for Krylov solvers due to eigenvalue clustering in the system matrices, caused by terms approaching zero or infinity.
+Furthermore, a robust, problem-independent preconditioner for GPUs is currently lacking, which limits the practical use of iterative solvers in this setting.
+%
 We demonstrate the performance of this approach on benchmark problems solved on NVIDIA A100 and H100 GPUs.
 
 %\textcolor{red}{We also examine generalizations to hybrid systems characterized by discrete-continuous interactions, Pontryagin-based shooting transcriptions, and infinite-horizon or functional programs modeled with \texttt{InfiniteOpt.jl}~\cite{pulsipher2022unifying}. $\rightarrow$ J-B}
@@ -486,10 +489,10 @@ \section{From optimal control models to SIMD abstraction}
 \section{Orchestrating model derivatives and sparse GPU linear solvers}\label{sec:orchestrate}
 
 Our workflow executes both differentiation and linear algebra entirely on the GPU.  
-The interior-point solver \texttt{MadNLP.jl} serves as the central orchestrator: it calls derivative kernels generated by \texttt{ExaModels.jl}, assembles KKT systems directly in device buffers, and invokes \texttt{CUDSS.jl} for factorization and triangular solves.  
+The interior-point solver \texttt{MadNLP.jl} serves as the central orchestrator: it calls derivative kernels generated by \texttt{ExaModels.jl}, assembles KKT systems directly in device buffers, and invokes \texttt{CUDSS.jl} for sparse factorizations and triangular solves.
 Apart from a one-time symbolic analysis on the host, primarily reordering to reduce fill-in in the factors of the LDL$^\top$ decomposition, all operations remain on the device.
-Robustness is further ensured by scaling, inertia-corrected primal–dual regularization, and optional iterative refinement.  
-When vendor libraries lack suitable routines, \texttt{MadNLP.jl} falls back on \texttt{KernelAbstractions.jl} to implement custom kernels.
+Robustness is further ensured by scaling, inertia-corrected primal–dual regularization, and optional iterative refinement.
+When GPU vendor libraries lack suitable routines, \texttt{MadNLP.jl} falls back on \texttt{KernelAbstractions.jl} to implement custom kernels.
 
 Each interior-point iteration therefore stays entirely on the device: derivative evaluation, KKT assembly, sparse factorizations and triangular solves, and step updates.  
 All of these components are orchestrated on GPU by \texttt{MadNLP.jl}.
@@ -508,9 +511,10 @@ \section{Benchmark problems} \label{s6}
 % including against CPU solvers like IPOPT through CasADi,
 to assess both raw speed-ups and the effect of sparsity on GPU d parallelism on solver performance.
 For the considered setups (see Appendix \ref{sa2} for details on the hardware and on the Julia packages used), as expected performance is better on H100 than A100, with comparable GPU over CPU speed-ups on both problems.
-Note that the aim of the paper being this CPU-GPU comparison, with ExaModels.jl + MadNLP.jl the currently only modeler-solver pair available to work on GPU, we deliberately do not provide comparisons on CPU with other such pairs (\emph{e.g.}, JuMP + Ipopt).
+Note that the aim of the paper being this CPU-GPU comparison, with \texttt{ExaModels.jl} + \texttt{MadNLP.jl} the currently only modeler-solver pair available to work on GPU, we deliberately do not provide comparisons on CPU with other such pairs (\emph{e.g.}, JuMP + Ipopt).
 
-On the Goddard problem, the total dimension of the discretized problem (variables plus constraints) is about $10 N$, where $N$ is the grid size. Although the problem is standard in the control literature, its solution exhibits an intricate structure for the choosen parameters with a control that is comprised of four subarcs (a concatenation of bang-singular-boundary-bang arcs). In particular, the existence of a singular arc is well known to be a source of difficulty for direct methods. Convergence towards the same objective (up to minor changes in precision depending on the number of points) is nonetheless obtained for all solves with the same (trivial) initialization, and the number of iterates remains very close on CPU and GPU for all the tested grid sizes.
+On the Goddard problem, the total dimension of the discretized problem (variables plus constraints) is about $10 N$, where $N$ is the grid size. Although the problem is standard in the control literature, its solution exhibits an intricate structure for the choosen parameters with a control that is comprised of four subarcs (a concatenation of bang-singular-boundary-bang arcs).
+In particular, the existence of a singular arc is well known to be a source of difficulty for direct methods. Convergence towards the same objective (up to minor changes in precision depending on the number of points) is nonetheless obtained for all solves with the same (trivial) initialization, and the number of iterates remains very close on CPU and GPU for all the tested grid sizes.
 On the cluster with A100, GPU is faster than CPU after $N = 2000$, see Figure~\ref{fig1}.
 On the cluster with H100, GPU beats CPU after $N = 5000$, see Figure~\ref{fig2}.
 On both architectures, a speed-up about $2$ is obtained.
@@ -526,6 +530,11 @@ \section{Benchmark problems} \label{s6}
 On both architectures, a speed-up about $5$ is obtained.
 The largest problem solved on the H100 has size about $4e6$ with a run time about $13$~seconds.
 
+We note that the bulk of the computational work in our problems arises from two main components: the evaluation of sparse derivatives and the solution of the sparse KKT systems.
+The first component is highly GPU-friendly and does not constitute a bottleneck.
+The second component is more challenging due to the irregular sparsity patterns in the KKT systems and their factors, which makes memory communication the limiting factor.
+In the large-scale regime where the memory cache is saturated, the maximum achievable speed-up is limited to roughly 10×, corresponding to the ratio of memory bandwidth between RAM and GPU VRAM in current architectures.
+
 \begin{figure}
 \includegraphics[width=.45\textwidth]{goddard-a100-n2.jpg}
 \caption{Goddard problem, A100 solve. The grid size ranges from to $N = 1e2$ to $N = 1e5$.}
@@ -563,9 +572,13 @@ \section{Discussion}
 Julia's combination of high-level abstractions, metaprogramming, and GPU compiler infrastructure makes it uniquely suited for building performant yet expressive tools for optimal control. 
 Our results show that leveraging parallelism in both model structure and solver internals unlocks substantial speed-ups, enabling new applications in aerospace engineering, quantum control, computational biology, learning, and more. 
 % \texttt{ExaModels.jl} is actively being extended to support vector-valued functions, which will enable even more efficient evaluation across large grids.
+
 The modular GPU stack also ensures portability across future architectures.
+In particular, we have added support for newly developed GPU optimization solvers, including \texttt{MadNCL.jl} \cite{MadNCL} and \texttt{MadIPM.jl} \cite{montoison-2025}, which extend \texttt{MadNLP.jl} to efficiently solve LPs, convex QPs, and degenerate NLPs.
+This illustrates the modularity of our workflow and opens new research directions, such as batch optimization on GPUs with varying initial conditions in optimal control problems.
+
 Future extensions include multi-GPU execution and tighter integration with differentiable programming workflows.
-Overall, the synergy between Julia GPU tools and the SIMD structure of direct optimal control yields a powerful solution for solving challenging OCPs at scale.
+Overall, the synergy between Julia GPU tools, SIMD structure in direct optimal control, and the growing suite of solvers yields a powerful framework for solving challenging OCPs at scale.
 
 %\small
 % \bibliographystyle{abbrvnat}
diff --git a/response-to-referees.tex b/response-to-referees.tex
@@ -172,12 +172,12 @@ \subsection*{(iv) -- Response of the authors}
   \begin{itemize}
     \item To the best of our knowledge, there is no other established and well distributed optimal control solver on GPU. We only found an experimental attempt to combine CasADi API with Pytorch using a mix of Matlab and Python (plus \emph{ad hoc} compilations), named CusADi, that we now cite \citep{jeon2024}.
     \item We added sentences describing GPU backend support in our framework. We have officially added support for AMD GPUs: sparse linear systems are simply converted to dense ones and solved with rocSOLVER.
-    We are currently working on similar support for Intel GPUs.
+    We are currently working on similar support for Intel GPUs with oneMKL.
     \item Regarding JIT compilation, we added a precompilation setup in the optimization package \texttt{MadNLP.jl}, which should remove the JIT overhead.
     This precompilation workflow has not yet been integrated into \texttt{OptimalControl.jl}.
     We added sentences in the paper to clarify this difference compared to compiled-code execution.
     \item The memory usage is quite similar to what can be achieved in low-level languages such as C or Fortran, since Julia provides fine-grained control over memory reuse and allocation. However, the main overhead comes from the garbage collector, which we do not force for performance reasons, but which can put pressure on the GPU.
-    A sentence describing these points was added to the paper.
+    A sentence describing these points will be added to the paper.
   \end{itemize}
 \end{answer}
 
@@ -275,12 +275,14 @@ \subsection*{(iv) -- Response of the authors}
     We have a provided a few details on the generated code in Section 5 to illustrate the translation into GPU friendly modeling thanks to \texttt{ExaModels.jl}, but did not give any specific information on the translation process itself (metaprogramming + pattern matching + one single pass treating both syntactic and semantic) as we believe it does not belong exactly to the scope of the conference.
     \item The ticks on the four figures have now been updated and should be more readable.
     \item Regarding plots vs. table(s): we believe it is important to be able to see graphically that CPU and GPU curves indeed cross around $N = 1e3$. Of course, we can use a large table spanning the two columns in the final version.
-    \item We added the CPU model used for the numerical results.
+    \item We will add the CPU model used for the numerical results.
     \item The cluster with H100 has a slower CPU than the one with A100.
     This difference is not related to the GPU model.
     \item We do not observe a significant improvement when using H100 over A100 because, in the nonlinear optimization solver, most of the computational work is spent solving the sparse KKT systems.
     The main bottleneck for speed-up is memory communication rather than computation itself.
-    More precisely, the comparison depends primarily on the memory architecture of each GPU - such as the HBM generation, bandwidth, on-chip shared memory, and interconnect topology - rather than on raw FLOPS or peak compute throughput.
+    More precisely, the comparison depends primarily on the memory architecture of each GPU, such as the HBM generation, bandwidth, on-chip shared memory, and interconnect topology, rather than on raw FLOPS or peak compute throughput.
+    We plan to explain this in Section 7, which is dedicated to benchmarks.
+    We will also show that once the memory cache is exceeded and the computation enters a regime dominated by memory bandwidth and transfer, the maximum speed-up is limited to around 10x due to the fundamental differences in RAM and VRAM architectures between CPUs and GPUs.
     % For example, NVIDIA's A100 is based on HBM2/HBM2e memory with a bandwidth up to 2 TB/s, whereas the H100 uses HBM3 with bandwidth up to 3.35 TB/s and a redesigned memory hierarchy.
     \item We use a direct solver instead of Krylov solvers because the optimization solver implemented in \texttt{MadNLP.jl} \citep{shin2024accelerating,shin2021graph} is a primal-dual interior point method.
     The KKT systems are known to be very hard to solve with Krylov methods due to the clustering of eigenvalues, which is caused by some terms in the KKT systems that approach zero or infinity.
@@ -314,9 +316,8 @@ \subsection*{(iv) -- Response of the authors}
 
 \begin{answer}
   \begin{itemize}
-      \item We added a sentence clarifying that the main computational effort in the optimization solver lies in solving sparse KKT systems on the GPU and performing automatic differentiation to evaluate sparse Jacobians and Hessians.
-      We also indicate which stages of these computations are parallelized and which are not.
-      \item We added a table in the numerical experiments section to show the time consumption across the four stages of the algorithm for our test problems.
+      \item A sentence to clarify that the main computational effort in the optimization solver lies in solving sparse KKT systems on the GPU and performing automatic differentiation to evaluate sparse Jacobians and Hessians, and that the former is harder to parallelize than the latter.
+      \item We plan to add a table in the numerical experiments section showing the time spent in the four stages of the algorithm for our test problems, along with a sentence specifying which of these stages are GPU-friendly.
   \end{itemize}
 \end{answer}
 
@@ -326,7 +327,9 @@ \section*{Other changes}
 
 \begin{answer}
   \begin{itemize}
-    \item Added references to our newly supported optimization solvers: \texttt{UnoSolver.jl} \citep{VanaretLeyffer2024}, \texttt{MadNCL.jl} \citep{MadNCL}, and \texttt{MadIPM.jl} \citep{montoison-2025}, and used them as examples of the modularity in our workflow.
+    \item Added references to our newly supported GPU optimization solvers:
+    % \texttt{UnoSolver.jl} \citep{VanaretLeyffer2024},
+    \texttt{MadNCL.jl} \citep{MadNCL}, and \texttt{MadIPM.jl} \citep{montoison-2025}, and used them as examples of the modularity in our workflow.
     \item Outlined upcoming work and research questions arising from the connection between these packages, such as solving batch optimization problems on GPU with varying initial conditions in optimal control problems.
   \end{itemize}
 \end{answer}