You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: main.tex
+33-18Lines changed: 33 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -107,7 +107,7 @@
107
107
% Define a switch for double-blind mode
108
108
\newif\ifblind
109
109
%\blindtrue % set \blindfalse for non-blind mode
110
-
\blindfalse
110
+
\blindtrue
111
111
112
112
% PDF metadata
113
113
\ifblind
@@ -205,13 +205,15 @@
205
205
\section{Introduction}
206
206
207
207
Solving large-scale nonlinear optimal control problems is computationally demanding, especially with fine discretizations or real-time requirements.
208
-
While GPUs offer massive parallelism well-suited to these problems, fully exploiting their potential remains challenging due to the complexity of modeling, differentiation, and solver integration.
208
+
While GPUs offer massive parallelism well-suited to these problems, fully exploiting their potential remains challenging due to the complexity of modeling, automatic differentiation, and solver integration.
209
209
%
210
210
We present a fully GPU-accelerated workflow, entirely built in Julia~\cite{bezanson2017julia}.
211
211
Continuous-time dynamics are discretized with \texttt{OptimalControl.jl}~\cite{OC_jl} into structured, sparse nonlinear programs.
212
212
These are compiled with \texttt{ExaModels.jl}~\cite{shin2024accelerating} into GPU kernels that preserve sparsity and compute derivatives in a single pass, enabling efficient SIMD parallelism.
213
213
%
214
214
Problems are solved on NVIDIA GPUs using the interior-point solver \texttt{MadNLP.jl}~\cite{shin2021graph} and the sparse linear solver \texttt{CUDSS.jl}~\cite{Montoison_CUDSS_jl_Julia_interface}, enabling end-to-end acceleration from modeling to solving.
215
+
We focus on NVIDIA hardware because efficient sparse KKT factorization is currently available only through cuDSS.
216
+
Although our framework already runs on AMD and Intel GPUs, these backends rely on dense linear algebra for KKT solves and therefore serve mainly as proof-of-concept implementations.
215
217
%
216
218
We demonstrate the performance of this approach on benchmark problems solved on NVIDIA A100 and H100 GPUs.
217
219
@@ -250,27 +252,28 @@ \section{Background and limitations}
250
252
Second-order methods, such as interior-point solvers, exploit this structure. % for efficient problem solution.
251
253
%
252
254
Most existing optimal control toolchains target CPU execution.
253
-
For example, CasADi~\cite{Andersson2019} constructs symbolic expressions evaluated just-in-time or exported as C code, typically solved by CPU solvers like IPOPT~\cite{wachter2006implementation} or KNITRO~\cite{byrd2006k}, which rely on CPU linear solvers such as PARDISO~\cite{schenk2004solving}, MUMPS~\cite{amestoy2000mumps}, or HSL~\cite{fowkes2024libhsl}.
255
+
For example, CasADi~\cite{Andersson2019} constructs symbolic expressions evaluated just-in-time or exported as C code, typically solved by CPU solvers like Ipopt~\cite{wachter2006implementation}, Uno~\cite{VanaretLeyffer2024}, or KNITRO~\cite{byrd2006k}, which rely on CPU linear solvers such as PARDISO~\cite{schenk2004solving}, MUMPS~\cite{amestoy2000mumps}, or HSL~\cite{fowkes2024libhsl}.
254
256
%
255
257
Other frameworks, such as ACADO~\cite{houska2011acado} and \texttt{InfiniteOpt.jl}~\cite{pulsipher2022unifying}, which cleverly leverage the modeling power of JuMP~\cite{dunning2017jump}, also follow the same CPU-centric paradigm.
256
258
%
257
259
This CPU focus limits scalability and real-time performance for large or time-critical problems that could benefit from GPU parallelism.
258
-
While some libraries provide GPU-accelerated components, none deliver a fully integrated, GPU-native workflow for nonlinear optimal control. (See, nonetheless, the nice attempt \cite{jeon2024} trying to combine the CasADi API with PyTorch so as to evaluate part of the generated code on GPU.)
260
+
While some libraries provide GPU-accelerated components, none deliver a fully integrated, GPU-native workflow for nonlinear optimal control.
261
+
See, nonetheless, the nice attempt \cite{jeon2024} trying to combine the CasADi API with PyTorch so as to evaluate part of the generated code on GPU.
259
262
%
260
263
Our work fills this gap with a GPU-first toolchain that unifies modeling, differentiation, and solver execution, addressing the challenges of solving large-scale sparse NLPs.
261
264
262
265
\section{SIMD parallelism in direct optimal control} \label{s3}
263
266
When discretized by \emph{direct transcription}, optimal control problems (OCPs) possess an inherent structure that naturally supports SIMD parallelism.
264
267
Consider indeed an optimal control with state $x(t) \in\mathbf{R}^n$, and control $u(t) \in\mathbf{R}^m$. Assume that the dynamics is modeled by the ODE
265
268
$$\dot{x}(t) = f(x(t), u(t)), $$
266
-
where $f : \mathbf{R}^n \times\mathbf{R}^m \to\mathbf{R}^n$ is a smooth function. Using a one-step numerical scheme to discretise this ODE on a time grid $t_0, t_1, \dots, t_N$ of size $N + 1$ results in a set of equality constraints. For instance, with a forward Euler scheme, denoting $h_i := t_{i+1} - t_i$, one has ($X_i \simeq x(t_i)$, $U_i \simeq u(t_i)$)
269
+
where $f : \mathbf{R}^n \times\mathbf{R}^m \to\mathbf{R}^n$ is a smooth function. Using a one-step numerical scheme to discretise this ODE on a time grid $t_0, t_1, \dots, t_N$ of size $N + 1$ results in a set of equality constraints. For instance, with a forward Euler scheme, denoting $h_i := t_{i+1} - t_i$, $X_i \simeq x(t_i)$, $U_i \simeq u(t_i)$, we have
It allows users to exploit GPUs without requiring any knowledge of GPU programming.
322
325
For instance, \texttt{ExaModels.jl} builds on \texttt{KernelAbstractions.jl} to automatically generate specialized GPU kernels for parallel evaluation of ODE residuals, Jacobians, and Hessians needed in optimal control problems.
323
326
%
324
-
We build on this ecosystem to create a complete GPU-accelerated toolchain spanning modeling, differentiation, and solving.
327
+
We build on this ecosystem to create a complete GPU-accelerated toolchain spanning modeling, automatic differentiation, and solving.
325
328
This results into a fully Julia-native workflow for modeling and solving ODE-constrained optimal control problems on GPU.
\item[--] \texttt{OptimalControl.jl}: a domain-specific language for symbolic specification of OCPs, supporting both direct and indirect formulations.
331
334
\item[--] \texttt{ExaModels.jl}: takes the discretized OCPs and produces sparse, SIMD-aware representations that preserve parallelism across grid points, compiling model expressions and their derivatives into optimized CPU/GPU code.
332
335
\item[--] \texttt{MadNLP.jl}: a nonlinear programming solver implementing a filter line-search interior-point method, with GPU-accelerated linear algebra support.
333
-
\item[--] \texttt{CUDSS.jl}: a Julia wrapper around NVIDIA’s \texttt{cuDSS} sparse solver, enabling GPU-based sparse matrix factorizations essential for interior-point methods.
336
+
\item[--] \texttt{CUDSS.jl}: a Julia wrapper around NVIDIA's \texttt{cuDSS} sparse solver, enabling GPU-based sparse matrix factorizations essential for interior-point methods.
334
337
\end{itemize}
335
338
\noindent Together, these components form a high-level, performant stack that compiles intuitive Julia OCP models into efficient GPU code, achieving substantial speed-ups while maintaining usability.
\item[--] \textbf{Portability}: symbolic modeling and kernel generation are backend-agnostic; the current limitation lies in sparse linear solvers, which are still CUDA-specific, but the framework is designed to integrate alternative backends as they become available.
342
345
\end{itemize}
343
346
347
+
One common concern in GPU‑accelerated workflows is the overhead of just‑in‑time (JIT) compilation.
348
+
In our stack, this overhead is minimal because we precompile all performance‑critical code whenever possible using Julia 1.12 and tools like \texttt{PrecompileTools.jl}, thereby reducing warm‑up latency.
349
+
Only a small fraction of code that depends on dynamic types or runtime-specialized kernels is compiled just-in-time; this JIT overhead is negligible in practice due to the highly regular structure of GPU kernels generated from transcribed ODE constraints.
350
+
The GPU ecosystem is also moving toward runtime PTX / JIT compilation (for example, NVIDIA CUDA 13.0 deprecates offline compilation for older architectures), which aligns with this strategy.
351
+
Combined, this strategy allows our workflow to efficiently exploit GPUs for large-scale or real-time optimal control problems.
352
+
344
353
\section{From optimal control models to SIMD abstraction}
345
354
To illustrate the transcription from the infinite dimensional setting towards a discretized optimization suited for SIMD parallelism, consider the following elementary optimal control problem with a state function, $x(t)$, valued in $\mathbf{R}^2$, and a scalar control, $u(t)$: minimize the (squared) $L^2$-norm of the control over the fixed time interval $[0,1]$,
@@ -366,24 +375,30 @@ \section{From optimal control models to SIMD abstraction}
366
375
The intial and final times are fixed in this case but they could be additional unknowns (see, Appendix \ref{sa1}, where the Goddard benchmark problem is modeled with a free final time. Users can also declare additional finite-dimensional parameters (or \emph{variables}) to be optimized. Furthermore, extra constraints on the state, control, or other quantities can be imposed as needed.
367
376
At this stage the crux is to seamlessly parse the abstract problem description and compile it on the fly into a discretized nonlinear optimization problem.
368
377
We achieve this by exploiting two features.
369
-
First, the DSL syntax is fully compatible with standard Julia, allowing us to use the language’s built-in lexical and syntactic parsers.
370
-
Second, pattern matching via \texttt{MLStyle.jl} \cite{MLStyle_jl} extends Julia’s syntax with additional keywords such as \verb+state+ for declaring state variables, and implements the semantic pass that generates the corresponding discretized code.
371
-
This discretized code can now be an \texttt{ExaModels.jl} model (while previously only modeling with ADNLPModels.jl was available, we have added the syntactic and semantic passes to generate ExaModels.jl models from the abstract description), which allows to declare
372
-
optimization variables (finite dimensional vector or arrays), constraints and cost.
373
-
Regarding constraints, \texttt{ExaModels.jl} uses \emph{generators} in the form of \verb+for+ loop like statements to model the SIMD abstraction, ensuring that the function
374
-
at the heart of the statement is mapped towards a \emph{kernel} (this is where \texttt{KernelAbstractions.jl} comes into play) and efficiently evaluated by the solver. All in all, the process merely is a compilation from \texttt{OptimalControl.jl} DSL, well suited for mathematical control abstractions, into \texttt{ExaModels.jl} DSL, tailored to describe optimization problems with strong SIMD potentialities. (As explained in Section~\ref{s3}, this is indeed the case for discretizations of optimal control problems.)
378
+
First, the DSL syntax is fully compatible with standard Julia, allowing us to use the language's built-in lexical and syntactic parsers.
379
+
Second, pattern matching via \texttt{MLStyle.jl} \cite{MLStyle_jl} extends Julia's syntax with additional keywords such as \verb+state+ for declaring state variables, and implements the semantic pass that generates the corresponding discretized code.
380
+
This discretized code can now be represented as an \texttt{ExaModels.jl} model.
381
+
Previously, only the generic \texttt{ADNLPModels.jl} \cite{montoison-migot-orban-siqueira-2021} was available, which requires more user effort and provides limited guidance for automatic differentiation, making it harder to generate highly efficient code on GPU.
382
+
We have therefore added syntactic and semantic passes to generate \texttt{ExaModels.jl} models directly from the abstract problem description, allowing users to declare optimization variables (finite-dimensional arrays), constraints, and cost functions while providing the solver with detailed problem structure.
383
+
Because \texttt{ExaModels.jl} builds on the \texttt{NLPModels.jl} abstraction \cite{Orban_NLPModels_jl_Data_Structures_2023}, integrating support was straightforward.
384
+
JuMP models can also be handled via \texttt{NLPModelsJuMP.jl} \cite{montoison-orban-siquiera-nlpmodelsjump-2020}, which wraps JuMP problems to expose the \texttt{NLPModels.jl} interface.
385
+
However, this backend is currently limited to CPU execution, highlighting \texttt{ExaModels.jl} as the preferred alternative for high-performance, GPU-enabled optimal control.
386
+
Regarding constraints, \texttt{ExaModels.jl} uses \emph{generators} in the form of \verb+for+ loop like statements to model the SIMD abstraction, ensuring that the function at the heart of the statement is mapped towards a \emph{kernel} (this is where \texttt{KernelAbstractions.jl} comes into play) and efficiently evaluated by the solver.
387
+
All in all, the process merely is a compilation from \texttt{OptimalControl.jl} DSL, well suited for mathematical control abstractions, into \texttt{ExaModels.jl} DSL, tailored to describe optimization problems with strong SIMD potentialities.
388
+
As explained in Section~\ref{s3}, this is indeed the case for discretizations of optimal control problems.
375
389
This transcription process is mostly parametrized by the numerical scheme used to discretize the ODE.
376
390
%
377
391
A very important outcome of having a DSL for \texttt{ExaModels.jl} models is the ability for the package to automatically differentiate the mathematical expressions involved.
378
-
Automatic differentiation (AD) is essential for modern second-order nonlinear solvers, such as IPOPT and \texttt{MadNLP.jl}, which rely on first- and second-order derivatives.
392
+
Automatic differentiation (AD) is essential for modern second-order nonlinear solvers, such as Ipopt and \texttt{MadNLP.jl}, which rely on first- and second-order derivatives.
379
393
380
-
Let us take a brief look at the generated code for this simple example. The code is wrapped in a function whose parameters capture the key aspects of the transcription process: the numerical scheme (here trapezoidal), the grid size (here uniform), the backend (CPU or GPU), the initial values for variables, states, and controls (defaulting to nonzero constants across the grid), and the base precision for vectors (defaulting to 64-bit floating point):
394
+
Let us take a brief look at the generated code for this simple example.
395
+
The code is wrapped in a function whose parameters capture the key aspects of the transcription process: the numerical scheme (here trapezoidal), the grid size (here uniform), the backend (CPU or GPU), the initial values for variables, states, and controls (defaulting to nonzero constants across the grid), and the base precision for vectors (defaulting to 64-bit floating point):
0 commit comments