Skip to content

Commit ed9ad21

Browse files
committed
Modifications de Alexis pour les arbitres
1 parent ffee2b8 commit ed9ad21

File tree

3 files changed

+66
-18
lines changed

3 files changed

+66
-18
lines changed

main.bbl

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,20 @@
8080
{\sc A.~Montoison}, {\em {CUDSS.jl: Julia interface for NVIDIA cuDSS}},
8181
\url{github.com/exanauts/CUDSS.jl}.
8282

83+
\bibitem{montoison-migot-orban-siqueira-2021}
84+
{\sc A.~Montoison, T.~Migot, D.~Orban, and A.~S. Siqueira}, {\em
85+
{ADNLPModels.jl}: Automatic differentiation models implementing the nlpmodels
86+
api}, 2021, \url{https://doi.org/10.5281/zenodo.4605982}.
87+
88+
\bibitem{montoison-orban-siquiera-nlpmodelsjump-2020}
89+
{\sc A.~Montoison, D.~Orban, and A.~S. Siqueira}, {\em {NLPModelsJuMP.jl}:
90+
Conversion from {JuMP} models to {NLPModels}}, 2020,
91+
\url{https://doi.org/10.5281/zenodo.2574162}.
92+
93+
\bibitem{Orban_NLPModels_jl_Data_Structures_2023}
94+
{\sc D.~Orban and A.~Soares~Siqueira}, {\em {NLPModels.jl: Data Structures for
95+
Optimization Models}}, 2023.
96+
8397
\bibitem{pacaud2024gpu}
8498
{\sc F.~Pacaud and S.~Shin}, {\em {GPU}-accelerated nonlinear model predictive
8599
control with {ExaModels} and {MadNLP}}, arXiv e-prints, (2024),
@@ -111,6 +125,11 @@
111125
modeling and decomposition of energy infrastructures}, IFAC-PapersOnLine, 54
112126
(2021), pp.~693--698.
113127

128+
\bibitem{VanaretLeyffer2024}
129+
{\sc C.~Vanaret and S.~Leyffer}, {\em Unifying nonlinearly constrained
130+
nonconvex optimization}.
131+
\newblock Submitted to Mathematical Programming Computation, 2024.
132+
114133
\bibitem{wachter2006implementation}
115134
{\sc A.~W{\"a}chter and L.~T. Biegler}, {\em On the implementation of an
116135
interior-point filter line-search algorithm for large-scale nonlinear

main.bib

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,3 +311,17 @@ @misc{montoison-migot-orban-siqueira-2021
311311
year = {2021},
312312
doi = {10.5281/zenodo.4605982},
313313
}
314+
315+
@misc{montoison-orban-siquiera-nlpmodelsjump-2020,
316+
author = {A. Montoison and D. Orban and A. S. Siqueira},
317+
title = {{NLPModelsJuMP.jl}: Conversion from {JuMP} models to {NLPModels}},
318+
year = {2020},
319+
DOI = {10.5281/zenodo.2574162},
320+
}
321+
322+
@misc{Orban_NLPModels_jl_Data_Structures_2023,
323+
author = {Orban, Dominique and Soares Siqueira, Abel},
324+
license = {MPL-2.0},
325+
title = {{NLPModels.jl: Data Structures for Optimization Models}},
326+
year = {2023}
327+
}

main.tex

Lines changed: 33 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@
107107
% Define a switch for double-blind mode
108108
\newif\ifblind
109109
%\blindtrue % set \blindfalse for non-blind mode
110-
\blindfalse
110+
\blindtrue
111111

112112
% PDF metadata
113113
\ifblind
@@ -205,13 +205,15 @@
205205
\section{Introduction}
206206

207207
Solving large-scale nonlinear optimal control problems is computationally demanding, especially with fine discretizations or real-time requirements.
208-
While GPUs offer massive parallelism well-suited to these problems, fully exploiting their potential remains challenging due to the complexity of modeling, differentiation, and solver integration.
208+
While GPUs offer massive parallelism well-suited to these problems, fully exploiting their potential remains challenging due to the complexity of modeling, automatic differentiation, and solver integration.
209209
%
210210
We present a fully GPU-accelerated workflow, entirely built in Julia~\cite{bezanson2017julia}.
211211
Continuous-time dynamics are discretized with \texttt{OptimalControl.jl}~\cite{OC_jl} into structured, sparse nonlinear programs.
212212
These are compiled with \texttt{ExaModels.jl}~\cite{shin2024accelerating} into GPU kernels that preserve sparsity and compute derivatives in a single pass, enabling efficient SIMD parallelism.
213213
%
214214
Problems are solved on NVIDIA GPUs using the interior-point solver \texttt{MadNLP.jl}~\cite{shin2021graph} and the sparse linear solver \texttt{CUDSS.jl}~\cite{Montoison_CUDSS_jl_Julia_interface}, enabling end-to-end acceleration from modeling to solving.
215+
We focus on NVIDIA hardware because efficient sparse KKT factorization is currently available only through cuDSS.
216+
Although our framework already runs on AMD and Intel GPUs, these backends rely on dense linear algebra for KKT solves and therefore serve mainly as proof-of-concept implementations.
215217
%
216218
We demonstrate the performance of this approach on benchmark problems solved on NVIDIA A100 and H100 GPUs.
217219

@@ -250,27 +252,28 @@ \section{Background and limitations}
250252
Second-order methods, such as interior-point solvers, exploit this structure. % for efficient problem solution.
251253
%
252254
Most existing optimal control toolchains target CPU execution.
253-
For example, CasADi~\cite{Andersson2019} constructs symbolic expressions evaluated just-in-time or exported as C code, typically solved by CPU solvers like IPOPT~\cite{wachter2006implementation} or KNITRO~\cite{byrd2006k}, which rely on CPU linear solvers such as PARDISO~\cite{schenk2004solving}, MUMPS~\cite{amestoy2000mumps}, or HSL~\cite{fowkes2024libhsl}.
255+
For example, CasADi~\cite{Andersson2019} constructs symbolic expressions evaluated just-in-time or exported as C code, typically solved by CPU solvers like Ipopt~\cite{wachter2006implementation}, Uno~\cite{VanaretLeyffer2024}, or KNITRO~\cite{byrd2006k}, which rely on CPU linear solvers such as PARDISO~\cite{schenk2004solving}, MUMPS~\cite{amestoy2000mumps}, or HSL~\cite{fowkes2024libhsl}.
254256
%
255257
Other frameworks, such as ACADO~\cite{houska2011acado} and \texttt{InfiniteOpt.jl}~\cite{pulsipher2022unifying}, which cleverly leverage the modeling power of JuMP~\cite{dunning2017jump}, also follow the same CPU-centric paradigm.
256258
%
257259
This CPU focus limits scalability and real-time performance for large or time-critical problems that could benefit from GPU parallelism.
258-
While some libraries provide GPU-accelerated components, none deliver a fully integrated, GPU-native workflow for nonlinear optimal control. (See, nonetheless, the nice attempt \cite{jeon2024} trying to combine the CasADi API with PyTorch so as to evaluate part of the generated code on GPU.)
260+
While some libraries provide GPU-accelerated components, none deliver a fully integrated, GPU-native workflow for nonlinear optimal control.
261+
See, nonetheless, the nice attempt \cite{jeon2024} trying to combine the CasADi API with PyTorch so as to evaluate part of the generated code on GPU.
259262
%
260263
Our work fills this gap with a GPU-first toolchain that unifies modeling, differentiation, and solver execution, addressing the challenges of solving large-scale sparse NLPs.
261264

262265
\section{SIMD parallelism in direct optimal control} \label{s3}
263266
When discretized by \emph{direct transcription}, optimal control problems (OCPs) possess an inherent structure that naturally supports SIMD parallelism.
264267
Consider indeed an optimal control with state $x(t) \in \mathbf{R}^n$, and control $u(t) \in \mathbf{R}^m$. Assume that the dynamics is modeled by the ODE
265268
$$ \dot{x}(t) = f(x(t), u(t)), $$
266-
where $f : \mathbf{R}^n \times \mathbf{R}^m \to \mathbf{R}^n$ is a smooth function. Using a one-step numerical scheme to discretise this ODE on a time grid $t_0, t_1, \dots, t_N$ of size $N + 1$ results in a set of equality constraints. For instance, with a forward Euler scheme, denoting $h_i := t_{i+1} - t_i$, one has ($X_i \simeq x(t_i)$, $U_i \simeq u(t_i)$)
269+
where $f : \mathbf{R}^n \times \mathbf{R}^m \to \mathbf{R}^n$ is a smooth function. Using a one-step numerical scheme to discretise this ODE on a time grid $t_0, t_1, \dots, t_N$ of size $N + 1$ results in a set of equality constraints. For instance, with a forward Euler scheme, denoting $h_i := t_{i+1} - t_i$, $X_i \simeq x(t_i)$, $U_i \simeq u(t_i)$, we have
267270
$$ X_{i+1} - X_i - h_i f(X_i, U_i) = 0,\quad i = 0, \dots, N-1. $$
268271
Similarly, a general Bolza cost that mixes endpoint and integral terms as in
269-
$$ g(x(0), x(t_f)) + \int_0^{t_f} f^0(x(t), u(t))\,\mathrm{d}t \to \min $$
272+
$$ g(x(t_0), x(t_f)) + \int_{t_0}^{t_f} f^0(x(t), u(t))\,\mathrm{d}t \to \min $$
270273
can be approximated by
271274
$$ g(X_0, X_N) + \sum_{i=0}^{N-1} h_i f^0(X_i, U_i). $$
272275
Discretising boundary or path constraints such as
273-
$$ b\big(x(0),x(t_f)\big) \leq 0,\quad c\big(x(t), u(t)\big) \leq 0 $$
276+
$$ b\big(x(t_0),x(t_f)\big) \leq 0,\quad c\big(x(t), u(t)\big) \leq 0 $$
274277
is obviously done according to
275278
$$ b(X_0, X_N) \leq 0, \quad c(X_i, U_i) \leq 0,\quad i = 0, \dots, N-1. $$
276279
The resulting NLP in the vector of unknowns $(X_0,\dots,X_N,U_0,\dots,U_{N-1})$
@@ -321,7 +324,7 @@ \section{A Julia-based GPU optimization stack}
321324
It allows users to exploit GPUs without requiring any knowledge of GPU programming.
322325
For instance, \texttt{ExaModels.jl} builds on \texttt{KernelAbstractions.jl} to automatically generate specialized GPU kernels for parallel evaluation of ODE residuals, Jacobians, and Hessians needed in optimal control problems.
323326
%
324-
We build on this ecosystem to create a complete GPU-accelerated toolchain spanning modeling, differentiation, and solving.
327+
We build on this ecosystem to create a complete GPU-accelerated toolchain spanning modeling, automatic differentiation, and solving.
325328
This results into a fully Julia-native workflow for modeling and solving ODE-constrained optimal control problems on GPU.
326329
%
327330
Key components of our stack include:
@@ -330,7 +333,7 @@ \section{A Julia-based GPU optimization stack}
330333
\item[--] \texttt{OptimalControl.jl}: a domain-specific language for symbolic specification of OCPs, supporting both direct and indirect formulations.
331334
\item[--] \texttt{ExaModels.jl}: takes the discretized OCPs and produces sparse, SIMD-aware representations that preserve parallelism across grid points, compiling model expressions and their derivatives into optimized CPU/GPU code.
332335
\item[--] \texttt{MadNLP.jl}: a nonlinear programming solver implementing a filter line-search interior-point method, with GPU-accelerated linear algebra support.
333-
\item[--] \texttt{CUDSS.jl}: a Julia wrapper around NVIDIAs \texttt{cuDSS} sparse solver, enabling GPU-based sparse matrix factorizations essential for interior-point methods.
336+
\item[--] \texttt{CUDSS.jl}: a Julia wrapper around NVIDIA's \texttt{cuDSS} sparse solver, enabling GPU-based sparse matrix factorizations essential for interior-point methods.
334337
\end{itemize}
335338
\noindent Together, these components form a high-level, performant stack that compiles intuitive Julia OCP models into efficient GPU code, achieving substantial speed-ups while maintaining usability.
336339

@@ -341,6 +344,12 @@ \section{A Julia-based GPU optimization stack}
341344
\item[--] \textbf{Portability}: symbolic modeling and kernel generation are backend-agnostic; the current limitation lies in sparse linear solvers, which are still CUDA-specific, but the framework is designed to integrate alternative backends as they become available.
342345
\end{itemize}
343346

347+
One common concern in GPU‑accelerated workflows is the overhead of just‑in‑time (JIT) compilation.
348+
In our stack, this overhead is minimal because we precompile all performance‑critical code whenever possible using Julia 1.12 and tools like \texttt{PrecompileTools.jl}, thereby reducing warm‑up latency.
349+
Only a small fraction of code that depends on dynamic types or runtime-specialized kernels is compiled just-in-time; this JIT overhead is negligible in practice due to the highly regular structure of GPU kernels generated from transcribed ODE constraints.
350+
The GPU ecosystem is also moving toward runtime PTX / JIT compilation (for example, NVIDIA CUDA 13.0 deprecates offline compilation for older architectures), which aligns with this strategy.
351+
Combined, this strategy allows our workflow to efficiently exploit GPUs for large-scale or real-time optimal control problems.
352+
344353
\section{From optimal control models to SIMD abstraction}
345354
To illustrate the transcription from the infinite dimensional setting towards a discretized optimization suited for SIMD parallelism, consider the following elementary optimal control problem with a state function, $x(t)$, valued in $\mathbf{R}^2$, and a scalar control, $u(t)$: minimize the (squared) $L^2$-norm of the control over the fixed time interval $[0,1]$,
346355
$$ \tfrac{1}{2}\int_0^1 u^2(t)\,\mathrm{d}t \to \min, $$
@@ -366,24 +375,30 @@ \section{From optimal control models to SIMD abstraction}
366375
The intial and final times are fixed in this case but they could be additional unknowns (see, Appendix \ref{sa1}, where the Goddard benchmark problem is modeled with a free final time. Users can also declare additional finite-dimensional parameters (or \emph{variables}) to be optimized. Furthermore, extra constraints on the state, control, or other quantities can be imposed as needed.
367376
At this stage the crux is to seamlessly parse the abstract problem description and compile it on the fly into a discretized nonlinear optimization problem.
368377
We achieve this by exploiting two features.
369-
First, the DSL syntax is fully compatible with standard Julia, allowing us to use the language’s built-in lexical and syntactic parsers.
370-
Second, pattern matching via \texttt{MLStyle.jl} \cite{MLStyle_jl} extends Julia’s syntax with additional keywords such as \verb+state+ for declaring state variables, and implements the semantic pass that generates the corresponding discretized code.
371-
This discretized code can now be an \texttt{ExaModels.jl} model (while previously only modeling with ADNLPModels.jl was available, we have added the syntactic and semantic passes to generate ExaModels.jl models from the abstract description), which allows to declare
372-
optimization variables (finite dimensional vector or arrays), constraints and cost.
373-
Regarding constraints, \texttt{ExaModels.jl} uses \emph{generators} in the form of \verb+for+ loop like statements to model the SIMD abstraction, ensuring that the function
374-
at the heart of the statement is mapped towards a \emph{kernel} (this is where \texttt{KernelAbstractions.jl} comes into play) and efficiently evaluated by the solver. All in all, the process merely is a compilation from \texttt{OptimalControl.jl} DSL, well suited for mathematical control abstractions, into \texttt{ExaModels.jl} DSL, tailored to describe optimization problems with strong SIMD potentialities. (As explained in Section~\ref{s3}, this is indeed the case for discretizations of optimal control problems.)
378+
First, the DSL syntax is fully compatible with standard Julia, allowing us to use the language's built-in lexical and syntactic parsers.
379+
Second, pattern matching via \texttt{MLStyle.jl} \cite{MLStyle_jl} extends Julia's syntax with additional keywords such as \verb+state+ for declaring state variables, and implements the semantic pass that generates the corresponding discretized code.
380+
This discretized code can now be represented as an \texttt{ExaModels.jl} model.
381+
Previously, only the generic \texttt{ADNLPModels.jl} \cite{montoison-migot-orban-siqueira-2021} was available, which requires more user effort and provides limited guidance for automatic differentiation, making it harder to generate highly efficient code on GPU.
382+
We have therefore added syntactic and semantic passes to generate \texttt{ExaModels.jl} models directly from the abstract problem description, allowing users to declare optimization variables (finite-dimensional arrays), constraints, and cost functions while providing the solver with detailed problem structure.
383+
Because \texttt{ExaModels.jl} builds on the \texttt{NLPModels.jl} abstraction \cite{Orban_NLPModels_jl_Data_Structures_2023}, integrating support was straightforward.
384+
JuMP models can also be handled via \texttt{NLPModelsJuMP.jl} \cite{montoison-orban-siquiera-nlpmodelsjump-2020}, which wraps JuMP problems to expose the \texttt{NLPModels.jl} interface.
385+
However, this backend is currently limited to CPU execution, highlighting \texttt{ExaModels.jl} as the preferred alternative for high-performance, GPU-enabled optimal control.
386+
Regarding constraints, \texttt{ExaModels.jl} uses \emph{generators} in the form of \verb+for+ loop like statements to model the SIMD abstraction, ensuring that the function at the heart of the statement is mapped towards a \emph{kernel} (this is where \texttt{KernelAbstractions.jl} comes into play) and efficiently evaluated by the solver.
387+
All in all, the process merely is a compilation from \texttt{OptimalControl.jl} DSL, well suited for mathematical control abstractions, into \texttt{ExaModels.jl} DSL, tailored to describe optimization problems with strong SIMD potentialities.
388+
As explained in Section~\ref{s3}, this is indeed the case for discretizations of optimal control problems.
375389
This transcription process is mostly parametrized by the numerical scheme used to discretize the ODE.
376390
%
377391
A very important outcome of having a DSL for \texttt{ExaModels.jl} models is the ability for the package to automatically differentiate the mathematical expressions involved.
378-
Automatic differentiation (AD) is essential for modern second-order nonlinear solvers, such as IPOPT and \texttt{MadNLP.jl}, which rely on first- and second-order derivatives.
392+
Automatic differentiation (AD) is essential for modern second-order nonlinear solvers, such as Ipopt and \texttt{MadNLP.jl}, which rely on first- and second-order derivatives.
379393

380-
Let us take a brief look at the generated code for this simple example. The code is wrapped in a function whose parameters capture the key aspects of the transcription process: the numerical scheme (here trapezoidal), the grid size (here uniform), the backend (CPU or GPU), the initial values for variables, states, and controls (defaulting to nonzero constants across the grid), and the base precision for vectors (defaulting to 64-bit floating point):
394+
Let us take a brief look at the generated code for this simple example.
395+
The code is wrapped in a function whose parameters capture the key aspects of the transcription process: the numerical scheme (here trapezoidal), the grid size (here uniform), the backend (CPU or GPU), the initial values for variables, states, and controls (defaulting to nonzero constants across the grid), and the base precision for vectors (defaulting to 64-bit floating point):
381396

382397
{\small
383398
\begin{minted}{julia}
384399
function def(; scheme=:trapeze, grid_size=250,
385400
backend=CPU(), init=(0.1, 0.1, 0.1),
386-
base_type = Float64)
401+
base_type=Float64)
387402
\end{minted}
388403
}
389404

0 commit comments

Comments
 (0)