v8_handles_paper/v8_handles_paper.ltx at master · softdevteam/v8_handles_paper · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
%&v8_handles_paper_preamble
\endofdump

\begin{document}

\begin{abstract}
Some programming language Virtual Machines (VMs) use \emph{handles}, that is
wrappers around pointers to heap objects, to allow the root set of
garbage collected objects to be precisely enumerated. Other VMs use
\emph{direct references} to heap objects, requiring the C/C++ call stack to
be conservatively scanned for values that appear to be pointers to heap
objects. The performance trade-offs of these two approaches are unclear:
in particular, do direct references gain more performance by removing
a level of indirection than they pay through conservative stack scanning
(including the costs of checking whether arbitrary words point
to heap objects)? In this paper we study the performance difference
between these two approaches by migrating around half of V8 from handles to
direct references: we show that, even though our partial migration imposes
additional costs and our conservative stack scanner is naive, direct
references are at least as fast as handles on the \speedometer benchmark
suite.
\end{abstract}


\maketitle


\section{Introduction}

A major challenge for Garbage Collection (GC) is finding all the \emph{roots},
that is the starting set of references into the object graph. Ideally, GCs
would be able to precisely enumerate the root set. However, current languages
and compilers make this difficult, or impossible, as pointers
to objects can be `lost' on the call stack: to compensate, the call stack
must be \emph{conservatively scanned}, looking for pointers to objects on
the heap. This over-approximates the true root set as some non-pointers
may have a value which also happens to be a pointer to an object.
To avoid conservative scanning, one can use \emph{handles}, that
is wrappers around object references, such that the handles allow \emph{precise}
enumeration of all roots.

\label{comprehensive_handles}
In this paper, we consider this long-standing implementation and design trade-off
in the context of programming language Virtual Machines (VMs). Some VMs
(e.g.~HotSpot~\cite{hotspot07handles},
SpiderMonkey~\cite{spidermonkey18rooting}, V8) use handles; some
(e.g.~JavaScriptCore~\cite{pizlo17riptide} and
those VMs using the Boehm-Demers-Weiser GC) use \emph{direct references}
(i.e.~`normal pointers'), and conservative scanning (including the stack,
but sometimes heap objects as well).

There is little debate about ergonomics: handles are more difficult to use, and some kinds of misuse
lead to security flaws. There is also now good evidence
that conservative scanning of direct references tends to only slightly
over-approximate the root set~\cite{shahriyar14fast}. However, there is no
clear evidence for the run-time costs of handles relative to direct references
and conservative stack scanning. The nearest work we know of is that of~\cite{kalibera11handles}
which compares handles with forwarding pointers resulting from concurrent compaction,
but forwarding pointers still impose a level of indirection over direct references.
Without an understanding of the performance costs of handles and direct
references, it is difficult to
state with confidence whether direct references or handles are the better trade-off.

In this paper we examine the widely used JavaScript VM V8, which has
been designed from its inception to use handles. Our hypothesis is
that moving from handles to direct references (and thus also conservative scanning) will improve V8's performance.
However, moving wholesale from handles to direct references is impractical on a
codebase of V8's size (approximately 1.75MLoC).
Our first challenge was thus
to find a practical means for gradually migrating parts of V8 from handles to
direct references (Section~\ref{sec:migration}). Having then migrated approximately
half of V8 from handles to direct references,
and implemented a naive conservative stack
scanner, we then run the well known \speedometer benchmark suite,
and find that direct references are at least as fast as handles
(\cref{sec:evaluation}). Since our partial migration includes costs that would
not apply to a complete migration, this establishes a lower bound for direct references
performance. This suggests that a complete migration of V8 to direct references, and a more
sophisticated conservative stack scanner, will improve V8's performance
while reducing a class of security flaw.


\section{Background}
\label{sec:background}

In this section we provide a brief overview of direct references and handles in
the context of VMs. To help make this concrete, we do so for a
hypothetical VM, written in C, that implements an object orientated language.
In this VM, every language-level object is an instance of a \texttt{struct} called
\texttt{Obj}, which we will assume has a single field \texttt{data} of type
\texttt{int}. We assume that the VM contains a GC that runs in a thread
concurrently with other VM threads, and thus the GC can free the memory of any
unreachable \texttt{Obj} at any point.


\subsection{Direct References}

The most `natural' way to program our hypothetical VM is to use direct
references. We assume that the GC exposes a single function \texttt{Obj
*gc\_new\_obj()}, which returns a pointer to a block of memory with a freshly
initialised \texttt{Obj}. When we want to create a new object, we call
\texttt{gc\_new\_obj}, assign the result to a local variable, and then operate
upon its \texttt{data} field:

\begin{lstlisting}[language=C]
Obj *n = gc_new_obj();
n.data = 1;
\end{lstlisting}

As this example suggests, `direct references' is our term for
what many would see as the `normal' way of programming in C. Unlike most
normal C programming, however, our VM must consider the concurrent GC:
it can run immediately after line 1 has executed, but
before line 2 has executed. If the GC were not to notice that \texttt{n}
is pointing to a live \texttt{Obj}, it would free that \texttt{Obj}'s memory,
causing the program to misbehave (e.g.~the program might segfault or overwrite
a portion of memory now used in another thread). In other words,
at that point in the program, \texttt{n} is a \emph{root}, that
is an entry point into the graph of objects created by the VM. For
our purposes, we can consider the root set to be all references to
objects stored in C-level variables, bearing
in mind that at run-time such variables may be stored as part of a
function frame on the C call stack\footnote{There are
other hiding places for such references, ranging from thread-locals to
registers to intermediate variables: for introductory purposes, these can be considered minor
variations on the C call stack problem.}.

The ideal, and universal, solution would be for C's language specification to
allow \emph{introspection} of a function's frame layout, telling it,
for example, where each local variable lives within a frame. For each
frame on the stack, the GC could then read the value of only those variables whose
compile-time type is \texttt{Obj*}, considering the objects they reference as roots.
Local variables of other types, which are not
of interest to the GC, would be ignored. This would allow the GC to \emph{precisely}
enumerate the GC roots (i.e.~to include all live objects as root while not
including any non-live objects as roots).
Unfortunately, no mainstream systems language defines an introspection
capability.

Some compilers instead offer the concept of \emph{stackmaps}, where
the compiler defines a way for a program to introspect the call stack at
specific locations. Stackmaps offer the same functionality as language-defined
introspection, but do so in a way that is not portable between compilers.
Unfortunately, stackmaps are not yet widely available. GCC does not
have any support for stackmaps. LLVM does have some stackmap support,
but (as some of this paper's authors have discovered in other work), it
is experimental, incomplete, and unsupported, with particular problems at
higher optimisation levels.

In practise, therefore, VMs using direct references have to turn to
\emph{conservative stack scanning}, where the C call stack is exhaustively examined for
possible pointers to instances of \texttt{Obj}. This requires the GC to
know at what address a thread's stack began, and what address the stack is
currently at. Each aligned word on the stack is then checked to see whether
it points to an instance of \texttt{Obj}: if it does, that \texttt{Obj} is
considered a root. Depending on the GC, checking whether an arbitrary word is a
pointer to an \texttt{Obj} can be fairly expensive, particularly if the GC
requires interior pointers to objects be translated to base pointers --- we
consider these costs inherent to conservative stack scanning.

Conservative stack scanning inherently
over-approximates the root set, because random words on the stack may
accidentally point to an \texttt{Obj}, keeping it alive longer than necessary.
Fortunately, in practise relatively few objects tend to be kept alive unnecessarily long:
the most extensive study we know of suggests the false detection rate
in Java programs is under 0.01\% of live objects~\cite{shahriyar14fast}.

Objects identified as live via conservative scanning must also be pinned in
memory, as the GC cannot know whether pointers to those objects on the stack
are `genuine' or `accidental'. Fully conservative GCs, that is GCs who conservatively scan all parts
of memory, from the call stack to the heap (e.g.~the Boehm-Demers-Weiser
GC~\cite{boehm88garbage}) are thus unable to move any objects in
memory. However, many GCs are semi-conservative, that is most references to
objects are determined precisely, with only roots on the stack discovered
conservatively.

Conservative scanning occupies an odd position in software. Technically speaking, the
way it works violates the rules of most languages, most compilers, and most
operating systems. Furthermore, some programming techniques
(e.g.~XOR lists) thoroughly defeat it.
However, because conservative scanning is widely used (e.g.~the well-known
Boehm-Demers-Weiser GC and WebKit~\cite{pizlo17riptide}),
it is well supported in practise.


\subsection{Handles}
\label{handles_general}

An alternative approach to direct references and conservative scanning
has become known as handles. These were first introduced in
recognisable form by~\cite{brooks84trading}, though the term `handles' was only
used later. Modern VMs use handles in two distinct ways: in this section
we will consider \emph{comprehensive handles}, as found in e.g.~\cite{kalibera11handles}
(in Section~\ref{handles_in_v8} we will consider the other kind of handles).

Comprehensive handles add a level of indirection to all object references, with
the indirection being the `handle'. Handles are stored at a fixed point in
memory, with one handle per object. In the context of our hypothetical
VM, this means that we never store references to \texttt{Obj} directly, instead
storing references to a \texttt{Handle} struct. When the VM wants to access an
\texttt{Obj} it must \emph{tag} the corresponding \texttt{Handle}; this informs
the GC that the \texttt{Handle} is a root. When the VM is finished with the
\texttt{Obj} it must untag the corresponding \texttt{Handle}. We maintain a tag
count for each \texttt{Handle}, as it may be tagged multiple times before being
untagged: when the tag count goes to zero, the \texttt{Handle} is no longer a
root.

Our hypothetical GC needs three altered/new functions: \texttt{Handle
*gc\_new\_obj()} returns a pointer to a handle, where the handle points to a
freshly initialised \texttt{Obj}; \texttt{Obj *tag(Handle *)} which increments
a handle's tag count by one and returns a pointer to the underlying
\texttt{Obj}; and \texttt{void untag(Handle *)} which decrements the handle's
tag count by one. An example of this API in use looks like:

\begin{lstlisting}[language=C]
Handle *h = gc_new_obj();
Obj *o = tag(h);
o.data = 1;
untag(h);
\end{lstlisting}

Handles have important advantages. First, tagging and untagging allows
the GC to precisely determine the root set by iterating over all handles
and considering as a root any handle with a tag count greater than 0.
Second, handles are fully portable, do not trifle with undefined behaviour,
and require no explicit language or compiler support.
Third, any \texttt{Handle} with a zero tag count can have its
underlying \texttt{Obj} safely moved. In other words, handles
make it trivial to write precise, moving GCs. Comprehensive handles also
have the virtue that moving an \texttt{Obj} requires only updating its
corresponding \texttt{Handle}.

There are however disadvantages. First, any handle API is easy to misuse:
forgetting to \texttt{tag} a handle or \texttt{untag}ing a handle twice leads
to undefined behaviour. Finding such API misuse is notoriously hard, and VMs
such as V8 have developed home-grown linters that scan for obvious API misuse,
though none that we know of can catch all possible kinds of API misuse. Second,
handles' double level of indirection implies at least some performance loss
relative to the single indirection of direct references. Third, handles have
additional run-time costs beyond those of double indirection: each handle
consumes memory; since handles cannot move, they can cause memory
fragmentation; and tagging / untagging a handle requires memory reads and writes.


\section{Handles in V8}

V8 is a widely used JavaScript and WebAssembly VM, embedded in Chromium-based browsers (e.g.~Chrome)
and other well known systems (e.g.~NodeJS).
In this section we introduce some general V8 background, before describing V8's use of handles.

V8 is largely written in C++.
A single V8 process can run many unrelated programs by encapsulating them in \emph{isolates}, each of which has a private heap.
Each heap is separated into young and old generations.
Young generation-only collection uses a semi-space strategy that is a variant of Halstaed's approach~\cite{halstaed} (though V8 uses dynamically-sized lists of segments to allow for work stealing).
A young and old generation collection uses a mark-sweep-compact strategy
where marking and sweeping are performed concurrently~\cite{degenbaev18cross},
and compaction occurs during idle time~\cite{degenbaev16idle}.
Collection can also be performed during idle time to reduce latency from the user's
perspective.

\label{handles_in_v8}

V8 uses handles to determine the root set of objects. Unlike the simple
VM described in \cref{handles_general}, V8 uses \emph{partial handles}:
that is, it uses direct references in those places where it can precisely track
them (e.g.~object slots), and handles in those places where it would otherwise
lose track of roots (e.g.~stack references). Although V8 has various kinds
of handles (e.g.~root handles, persistent handles, eternals, and traced
reference handles~\cite{v8handledocs}), most of these have very narrow, specific uses\footnote{There are also a few parts of V8 where handles are forbidden and GC
is prohibited.}: we focus in
this paper exclusively on `normal' handles


Comprehensive handles as described in \cref{handles_general} persist
throughout a VM's lifetime, with each heap object having a single handle
pointing to it. In contrast, V8's partial handles are temporary, being
regularly created and destroyed: any object may, both at any given point
and over its lifetime, have
multiple handles pointing to it. This has a significant impact on the way
handles are used within V8.

In essence, when V8's C++ code wishes to operate on a JavaScript object, it
must create a handle to it, destroying that handle when it is finished. This
can be thought of as a kind of `lock': the handle guarantees that the object
is kept alive while C++ code works upon object, without having to worry that
pointers to the object will be `lost' on the C++ call stack. As with other VMs
using this style of partial handles, V8 faces two challenges: how to make
partial handles ergonomic for the programmer; and how to make their regular
creation and destruction fast.

The C++ \lstinline{Handle} class is used pervasively throughout V8: most
functions that operate on heap objects either take or return \lstinline{Handle}
instances. The \lstinline{Handle} class exposes a pointer-like API (e.g.~the
dereferencing `\texttt{*}' operator on a handle returns an \lstinline{Object})
that makes it relatively transparent in use. A \lstinline{Handle} instance
contains a pointer to the underlying handle but no additional storage
(i.e.~\texttt{size\_of\-(Handle<T>) == size\_of(T*)}, giving programmers
confidence about handle storage and moving costs.

A challenge with partial handles is knowing when they are no longer needed. To
help with this, V8 uses \emph{handle scopes}, which can be thought of as pools
of handles: newly created handles are automatically placed in the `current'
handle scope; and when a handle scope is no longer needed all of the handles in
it are destroyed. When V8 code creates a new \lstinline{HandleScope} instance,
that handle scope is pushed onto a stack, whose topmost element is the
current handle scope. C++'s RAII
mechanism is used to automatically destroy \lstinline{HandleScope} instances,
which also destroys all the handles it contains.

Handle scopes ensure that C++ code dealing with objects remains relatively
terse, and reduces several opportunities for programmer mistakes. However, C++
code must carefully ensure that it does not leak a pointer to an object
beyond the lifetime of the handle scope that references that object --- doing
so causes undefined behaviour. V8 has various lints (for example \emph{gcmole})
which search the codebase for possible API misuse and warn when such
instances are found. However, lints cannot identify all
possible misuses: some misuses are later detected in debug builds, but some
misuses end up in release code, where they can become a significant security
concern.

\begin{figure}
\begin{lstlisting}[language=C++,
  caption={A simplified example of V8's C++ code, showing the use of handles.
  \texttt{main} creates a handle scope (line 2), and then calls \texttt{foo}
  which creates a further handle scope (line 7). Two new heap objects are
  created, each of which produces a handle (lines 8 and 9), ensuring that both
  objects are considered as roots. \texttt{foo} cannot directly return a handle
  to its caller, as that handle will be automatically destroyed when the handle
  scope is destroyed by RAII at the end of the function call. The
  \texttt{CloseAndEscape} method adds a new handle to \texttt{main}'s handle
  scope pointing to the \texttt{"a"} string, and passes a reference to that
  handle to the caller, allowing the underlying object to safely `escape' the
  handle scope it was created in.},
  label=handles_example]
void main() {
  HandleScope scope(GetIsolate());
  Handle<String> str = foo();
}

Handle<String> foo() {
  HandleScope scope(GetIsolate());
  Handle<String> a = NewString("a");
  Handle<String> b = NewString("b");
  return scope.CloseAndEscape(a);
}
\end{lstlisting}
\end{figure}

\cref{handles_example} shows a simplified snippet of V8's C++ code,
demonstrating how \lstinline{Handle}s and \lstinline{HandleScope}s interact. In this example, two handle scopes
are created, with the second handle scope wanting to return a handle that
outlives its handle scope. Handle scopes expose a \texttt{CloseAndEscape}
method which allow a handle to be safely moved to the parent handle scope.

Handle scopes are an important performance optimisation, since destroying
a handle scope causes its backing storage containing multiple handles
to be destroyed in one go: neither individual deallocation or any form
of handle compaction is necessary.

A further optimisation is based on the observation that within a given handle
scope it is common for the same object to be referenced multiple times.
Although it is always correct to create multiple handles to the same object, it
is inefficient, since each handle requires memory, requiring the storage area
pointed to by a handle scope to be enlarged. When a handle for an object is
requested, V8 thus performs a linear scan through the current handle scope to see if an existing
handle to that object is present, only creating a new handle if none exists.

Although in most of this paper we consider V8 in isolation, we are particularly
interested in how our changes affect V8 when running browser code. That means
that we also need to consider the interactions between V8 and \emph{Blink},
Chrome's rendering engine. V8 exposes a simplified \lstinline{Handle}-like
class called \lstinline{Local}, which also uses two levels of indirection, to
`external' applications such as Blink. Moving objects between Blink and V8
requires converting to/from \lstinline{Handle}s/\lstinline{Local}s,
though the default configuration optimises these conversions to
a compile-time cast (i.e.~there is no run-time copying of data).

% dump
% \begin{itemize}
%   \item There's many different kind of handles in V8. They have two purposes: (a) allow the GC to find those object references and (b) allow the GC to relocate objects.
%   \item In V8, handles are an explicit C++ mechanism/construct to maintain this information for the GC
%   \item There's root handles (should probably have a different name) that that maintain this information per handle (costly, but flexible as each handle can be created/destroyed individually) -- the paper doesn't care about those
%   \item There's handles (the interesting ones) that are bound to handle blocks which are maintained in a chain
%   \item The chain itself is dynamic (the list of blocks depends on branches taken at runtime)
%   \item The chain is mostly maintained via leveraging C++ lexical scopes and using RAII to add and remove blocks. Used this way, the handle scopes create precise stack maps.
%   \item Handles automatically register themselves in the current active scope by allocating one entry in such a block
%   \item A handle always refers to an object via the handle block.
%   \item The top of the dynamic chain is known to the GC. The top is identified by a single V8 isolate.
%   \item This way the GC can (a) find such handles and (b) relocate them
%   \item This is much cheaper than the individual root handle version but comes with the restriction to (a) require setting up handle scopes in some stack-like fashion and (b) not using handles that have had their scopoe closed and block removed
%   \item Using a handle that has been invalidated (scopes has been closed and thus block removed) is undefined behavior and results in (likely) crashes in debug builds
% \end{itemize}


\section{Gradually Migrating from Handles to Direct References}
\label{sec:migration}

Migrating from handles to direct references sounds like it should be easy, but
on a codebase of V8's size and complexity turns out to be rather hard -- this
paper is the result of our second migration attempt! In this section we explain our general migration
strategy, the challenges any migration strategy faces, and what we have migrated.


\subsection{Migration Strategy}

Simplifying slightly, our first migration attempt tried to move the codebase
wholesale from handles to direct references --- after considerable effort this
failed. This was due to a combination of factors but two are of particular
note. First, many parts of V8 quite reasonably make use of implicit properties
of handles that are not captured by the type system or API: these must be manually inspected and altered to work
correctly with direct references. Second, when we inevitably hit bugs due to
our changes, we found it impractical to work out which of our many changes
caused a particular problem.

This paper is the result of our second migration attempt. We used the
experience from our first failed attempt to devise a scheme that allows us to
gradually migrate small portions of V8 from handles to direct references.
The migration is `hidden' behind a compile-time flag that is false by default,
allowing us to incrementally upstream our changes without affecting mainstream
users.

The most obvious part of our gradual migration approach is to realise that
handles and direct references must be able to co-exist for as long as the
migration takes. We thus introduced a new C++ class which we will refer to in
this paper as \lstinline{DirectRef}\footnote{The actual name of this class is
the confusing, and contradictory, \lstinline{DirectHandle}. Our past selves
have much to apologise for, but they are no longer around to do so.} which
stores a direct reference to objects on the heap while exposing the same
API as \lstinline{Handle}.

In many cases, migrating from handles to direct references simply requires
replacing \lstinline{Handle} with \lstinline{DirectRef}. However, many parts of
V8 make assumptions about handles, normally for performance reasons, that are
not encoded in the \lstinline{Handle} API.

For example, string builders and array backing stores are used extensively by
JavaScript code. When constructing and reallocating their value in memory, a
handle's intermediate pointer is modified in-place so that this propagates to
other handles using this value. In contrast, migrating such places to use direct
references means ensuring they are used exclusively so as not to invalidate other
references\footnote{When we started our first migration attempt, we had to
look for challenging parts of the code like this manually. Coincidentally,
V8 has since developed a \lstinline{Handle<T>::PatchValue} API which captures
all the challenging places we know of in a way that makes finding them fairly easy.}.

Similarly, the parts of V8 that enable a transition from C++ code to
JIT-compiled machine code and back are written in a macro-assembler
which is translated into each platform's assembly format. The
macro-assembler does not refer to the \lstinline{Handle}
class directly but generates code which implicitly follows handle conventions.
We had to carefully adjust the macro assembler to follow the state of migration
in C++ code.

Another difficulty was balancing the long-term ergonomics of direct references
with the temporary needs of handles. The chief problem is that the
\lstinline{Handle} class's constructors need a reference to an isolate (to
allocate the actual handle in), but the \lstinline{DirectRef} class's
constructors do not. When moving from a migrated portion of code using
\lstinline{DirectRef} to unmigrated code using \lstinline{Handle}, we thus do
not know which isolate to create the new handle in. We could have required
\lstinline{DirectRef} to store a reference to an isolate, but this would be
awkward and muddy performance comparisons. Instead, we assume that a
\lstinline{Handle} created from a \lstinline{DirectRef} lives in the current
isolate. When this assumption is incorrect (for example, when the JIT compiler
hands compilation off to another thread), we almost always get an immediate
crash. Fixing such a crash can be tedious, as the necessary isolate needs to
be threaded through one or more layers.


\subsection{Performance Challenges While Migrating}
\label{null_check}

Since migrated and unmigrated code frequently interact, moving from code
which uses direct references to code using handles or back again requires
additional computation.

Migrated code that interfaces with unmigrated code must create handles.
This sometimes lead to the surprising outcome that migrating a small portion of code created a
significant slowdown; in general, migrating a small additional portion of code
restored performance. In the early stages of migration we frequently encountered
this problem. However, since migration sometimes imposes handle creation costs
on the callers of a function, we were often unsure which callers of a migrated
function were causing performance problems. We eventually developed a simple
`profiler' which pinpoints how often a line of source code creates handles,
allowing us to quickly identify such locations.

When moving from unmigrated code to migrated code we have to take
account of V8's uses of \emph{tagged pointers}, which avoids storing small integers as full
objects on the heap. The tagging is done on the `pointer' from a handle to a
heap object: if the top bit is set to 1, this is really a pointer; if the top
bit is set to 0, it is a small integer. We use the same pointer tagging scheme in
\lstinline{DirectRef}. However, when converting from handles to direct
references, we have to treat \emph{null handles} (which are often used
as place holders) with care. To avoid allocating backing memory, the
`nullness' is represented by a null pointer to the handle itself. In other
words, a null \lstinline{Handle} instance itself contains a null pointer, but
if a small integer is stored, the \lstinline{Handle} points to a handle whose
tagged pointer has a top bit set to 0. We have to flatten this representation
for \lstinline{DirectRef}. Thus, converting
from \lstinline{Handle}s to \lstinline{DirectRef}s requires an extra
compare-and-branch to deal with null handles correctly: though the cost of this
check is small, it imposes a performance cost in the general case.

Since our migration of V8 is partial, these performance costs show up in our
later experiments. A full migrated V8, however, would show no such costs.


\subsection{What We Migrated}
\label{migrated}

When deciding what parts of V8 to migrate, we wanted to choose those parts
whose migration would give us greatest overall confidence in terms of
correctness and performance. That meant that we chose some of the hardest parts
to migrate, and those that are considered as the most likely performance
bottlenecks. We thus migrated most of the following parts of V8:

\begin{enumerate}
  \item The V8 embedder API layer (where V8 interacts with other applications, such as the Blink rendering
engine in Chrome).
  \item The interface between C++ code and V8's JIT compiled
JavaScript runtime.
  \item JavaScript object definitions and implementations.
  \item Intrinsics, builtins, and runtime functions.
\end{enumerate}


\makeatletter
\newcommand*\ExpandableInputTwo[1]{\@@input#1 }
\makeatother

 \begin{table*}[t]
     \begin{tabular}{lllll}
 \toprule
    \multirow{2}{*}{Benchmark} & \multicolumn{2}{c}{\firstexp} & \multicolumn{2}{c}{\secondexp} \\
\cmidrule(l){2-3} \cmidrule(l){4-5}
& Handles (ms) & Direct Refs (ms) & Handles (ms) & Direct Refs (ms) \\
 \midrule
\ExpandableInputTwo{./table}
 \bottomrule
 \end{tabular}
 \vspace{6pt}
\caption{Results from both our experiments on \speedometer for 30 process executions,
     where we report 99\% confidence intervals. \firstexp is a straightforward
     comparison between handles and direct references that establishes a lower
     bound for direct references performance. This clearly shows that -- even
     though our migration is incomplete, incurring costs for moving between
     migrated and unmigrated code, and our conservative stack scanner is naive
     -- that the lower bound for direct references performance is statistically
     indistinguishable from handles performance.
     \secondexp allows us to largely discount the costs of collection (which
     include our naive stack scanner), focussing on mutator performance. This
     shows that direct references are \mintotalpdiff faster than handles,
     giving us greater confidence that a complete migration of V8, and a
     faster conservative stack scanner, will show that direct references are
     faster than handles.}
\label{table:results}
\end{table*}


\subsection{Conservative Stack Scanning}

Direct references require conservative stack scanning, so we needed to add
a conservative stack scanner to V8. Rather than writing one fully from scratch,
we were able to borrow some parts of the conservative stack scanner built into
Blink's GC \emph{Oilpan}, specifically the platform specific assembly stubs for
spilling registers during a stop-the-world pause.

Our conservative stack scanner tests each word on the call stack to see
whether values constructed from such words look like pointers to heap object.
For each value, we first check if the
value points into a page in the heap; if it doesn't, we quickly move on.
If the value does point into a page in the heap, we then check if it points to
an object or empty space. If the value does point to an object, we then query
the isolate's heap layout to find out the object's base pointer. Our scanner
is not yet production ready: compared to Oilpan's scanner, it is relatively
unoptimised.


\section{Evaluation}
\label{sec:evaluation}

In order to understand the performance effects of migrating from handles to direct
references, we conducted an experiment on several variants of V8 running the
\speedometer benchmark suite. We first outline our methodology (\cref{methodology})
before examining the results (\cref{results}).


\subsection{Methodology}
\label{methodology}

The overall question we would like our experiment to answer is: are handles
or direct references faster? While we can provide an accurate assessment
of handles performance, we are only able to provide a lower bound of
direct references' performance: around half of V8's (large) codebase remains
unmigrated, and the conservative stack scanner we have implemented is naive,
and thus slower than a production version. We are also forced to use a single
generation heap because the young generation scavenger is a moving semi-space
collector, and work to migrate this to a non-moving generational GC has not yet
been carried out. Fortunately, and despite these restrictions, our experiment
is able to provide valuable insights.

Our first experiment \emph{\firstexp} is a straightforward comparison between handles and direct
references, with the GC running as normal (subject to the restrictions noted
earlier). This implicitly includes the full costs of both handles (creating
handles; destroying handles; an extra level of indirection) and direct
references (conservative stack scanning). We disable heap
compaction when a GC is scheduled and there are frames on the C++ stack, as
the compacter is not yet aware of conservative stack scanning.

Our second experiment \emph{\secondexp} aims to tease out the mutator performance of handles and
direct references from the costs of running the collector -- in other words,
what are the costs of creating and destroying handles (whose memory is
reclaimed manually i.e.~without using the GC) and
having two levels of indirection? Ideally we would turn the collector off
entirely, but that causes V8's heap (which has a 4GiB maximum size) to
overflow. Instead, we altered V8 to prevent it from running `intermediate'
collections, instead waiting until the heap is completely full to run a
collection: in other words, we accept higher overall throughput at the expense
of occasional latency. For our benchmark suite, this corresponds to a single
collection per process execution.

We evaluate each browser variant on the well known \speedometer browser benchmark
suite~\cite{speedometer} using Chrome's Crossbench benchmark runner. \speedometer
consists of 16 widely used JavaScript web frameworks all modeling the same TodoMVC application.
This benchmark suite executes various performance critical pieces of a JavaScript VM and the web rendering engine overall, and is therefore considered one of the most relevant workloads for web browser engine optimizations. %Maybe cite blog posts proving its relevance.
We used Crossbench to run \speedometer
for 30 \emph{process executions}, where we close down the browser
and start a new browser. \speedometer records wall-clock times
using JavaScript's \texttt{performnace.now()}, reporting the
results back to Crossbench. We report 99\% confidence intervals for each benchmark.

We use Chrome version 112.0.55572 and V8 version 11.2.28. The benchmarks were
run on a Xeon E3-1240v6 3.7GHz CPU cores, with 32GB of RAM, and running Debian 8.
We disabled turbo boost and hyper-threading in the BIOS and set Linux's CPU
governor to \emph{performance} mode. We observed the \texttt{dmesg} after
benchmarking and did not observe any oddities such as the machine overheating.


\subsection{Results}
\label{results}

\cref{table:results} shows the results for experiments \firstexp and \secondexp.

The results for \firstexp shows that -- even though we have only partly migrated V8,
that there remain costs for moving between migrated and unmigrated code,
and our conservative stack scanner is naive -- handles and direct
references have indistinguishable performance. This establishes a lower bound
for direct references performance. It is possible (indeed, likely) that
completing the migration from handles to direct references, and improving the
performance of the conservative stack scanner, will meaningfully improve upon
this lower bound.

The results for \secondexp show that, by (largely) discounting the costs of collection
-- which include our naive conservative stack scanner -- and looking solely at
mutator costs, direct references are \mintotalpdiff faster than handles. This
gives us greater confidence that the performance results from \firstexp are
indeed a lower bound for, rather than the maximum potential of, direct
references.


\subsection{Further Analysis}

The V8 variants from \secondexp give us the opportunity to better understand
some detailed effects of our partial migration. We used Linux's \texttt{perf}
tool to examine the V8 variants from our second experiment on single runs of
benchmarks, in essence `diffing' the profiling data. Two benchmarks
serve as exemplars of the effects of our migration.

The \jquery benchmark (where direct references are \minjquerypdiff faster than handles
in~\cref{table:results}),
puts particular stress on interactions between
the Blink's DOM heap and V8's JavaScript heap --- a part of V8 which we almost entirely
migrated (see~\cref{migrated}). Our profile showed that direct references were
approximately 15\% faster in these parts of the embedder API.

The \emberjsdebug benchmark (where direct references are statistically
indistinguishable from handles in~\cref{table:results}) puts stress on: entry to parts
of the JIT compiler; a section of the inline cache; and JavaScript debug
objects. Other than small parts of the inline cache, none of these sections of
V8 have been converted to use direct references. However, each of them interacts frequently with parts of V8 we have migrated: in other
words, we pay frequent conversion costs to/from handles due to the partial
migration. Our profiling data shows a ~30\% slow down in parts of the execution
engine (responsible for invoking JavaScript functions) which interact with non-migrated
code.

% 89\% of garbage collections in Chrome are happening without a stack when we are on the message loop.
% Therefore making stack scanning conservative may not be an issue in practise.
% Add Speedometer numbers how often we finalize with a stack there.
% Maybe add browsing benchmark numbers to confirm impact\hannes{discuss with the team}.


\section{Threats to Validity}

We have only migrated about half of V8 to direct references. It is possible
that there are challenges in the remainder of the code base that would change
our view of the merits of direct references. We have tried to mitigate this by
migrating what we consider to be the most challenging parts of V8 first.
While V8 is only partly migrated, there are also performance overheads
involved in moving between migrated and unmigrated code (e.g.~the null
checks explained in \cref{null_check}): our results can only establish a lower bound
for direct references performance.

JavaScript VMs such as V8 are somewhat unusual amongst modern VMs in two
respects: JavaScript is single-threaded; and it is based around an event loop
which naturally gives frequent opportunities to perform garbage collection
without pointers existing on the C stack. These two factors may have a
notable impact on our understanding of the performance merits of direct references
and handles. We would have to migrate at least one non-JavaScript VM
(e.g.~HotSpot) to understand whether these are significant factors or not.

There are two differences between the configurations of Chrome/V8 we run and
those run by normal users. First, we have to disable V8's young generation,
because it does not support pinning immovable objects which were located by
conservative stack scanning. Second, we
do not have the ability to generate or use the profile-guided-optimisation data
that production builds have access to.
At best, these two factors mean that our Chrome/V8 builds will be slower than
their production cousins; at worst, it could be that this changes the relative
performance of handles and direct references.

Objects identified via conservative stack scanning cannot be moved. However,
since we have not yet implemented per-object pinning, we are forced to
prevent heap compaction whenever GC occurs and there might be pointers on the
stack. This is not as onerous in a JavaScript VM as it would be for other
languages' VMs, as over 80\% of V8's GC cycles occur during the JavaScript
event loop\footnote{Based on data from Chrome's \textit{User Metric Analysis}
for Chrome's stable release branch (M110) on Windows}, when there can be, by
design, no pointers on the call stack.

\speedometer is a small, synthetic benchmark suite: like any finite-sized
benchmark suite, it can not tell us about the performance of all possible
programs. However, its ubiquity does mean that its strengths and weaknesses as
a benchmark suite are widely acknowledged, and it allows a good degree of
comparison across JavaScript VMs and browsers.


\section{Related Work}
\label{sec:related}

The relative merits of direct pointers and handles have a long history in GC
and VMs, though quantitative comparisons are few in number. Generalising
somewhat, one technique or the other has tended to hold sway either at
different points in time, or within different communities.

Early Smalltalk VMs used handles, generally calling them `object tables'. Though
the reasons for this are not made it explicit, the VMs that did so used a compacting GC, so it
is reasonable to assume this was the
motivation~\cite[p.~18]{krasner83smalltalk}. Handles are often used in VMs that
descend from the Smalltalk tradition (e.g.~HotSpot~\cite{hotspot07handles}, Dart~\cite{egorovdart}).

To the best of our knowledge, the most comprehensive study of handles
is~\cite{kalibera11handles} which evaluates different styles of handles in the
context of the Ovm's real-time GC, which uses a concurrent compacting GC.
Although \cite{kalibera11handles} use the term `direct references' to refer to
Ovm's initial state, and uses conservative stack scanning, these are not the
same as our `direct references'. Due to Ovm's concurrent compacting GC,
mutator threads may still reference an object's `previous' location,
necessitating a level of indirection to `direct references' to get the
current location. They show that, in their context, `thin' or `fat' handles
were more effective than (their use of the term) `direct references' because it
allowed objects to be moved without requiring forwarding pointers. This is a
notably different context to V8, hence our rather different conclusions.

Firefox's JavaScript engine SpiderMonkey uses handles in a similar fashion to
V8~\cite{spidermonkey}, including similar lints to catch handle misuse.

The use of direct pointers implicitly requires conservative scanning of at
least the stack. However, most (perhaps all) programming languages and most
(perhaps all) compilers have rules which suggest that conservative stack
scanning is undefined behaviour. Most obviously, there is no way of writing a
conservative stack scanner in C/C++ which is not undefined behaviour. In
practise, fortunately, conservative stack scanning `works', as perhaps best
evidenced by the long history of the Boehm-Demers-Weiser (BDW)
GC~\cite{boehm88garbage} which uses conservative scanning for the stack and
other possible location of GC roots (though note that the first conservative
scanning GC appears to have been~\cite{caplinger88memory}). BDWGC has been ported
to most platforms in its long history, and used by a wide variety of software
(including relatively modern software such as Inkscape): it is plausible that
its very existence has been the reason why conservative stack scanning
continues to be supported in practise across languages and compilers.

Oilpan~\cite{ager13oilpan}, the GC for the Blink rendering engine uses conservative stack scanning
to find roots on the C stack but is able to precisely trace object slots. This
is a similar design to that we are proposing for V8 in this paper, though
Oilpan cannot move objects in the heap.

Safari's underlying engine WebKit (of which the JavaScript VM, JavaScriptCore, is a part)
use direct pointers and conservative stack scanning~\cite{pizlo17riptide},
which is very similar to what we propose for V8 in this paper. WebKit has had
at least two different GCs over time (the current GC, \cite{pizlo17riptide},
mostly runs concurrently to JavaScript execution, similarly to V8), but this has not
changed the details relevant to this paper. Unlike V8, WebKit's GC never
moves objects, even across generations (instead using `sticky mark bits' that implicitly mark a given
object as belonging to a certain generation).
Where V8 only uses precise stack scanning
for C/C++ code, using stackmaps for JIT compiled code, WebKit uses conservative stack scanning for
both C/C++ code and JIT compiled code, slightly increasing the chances of
objects being incorrectly identified as live.

%TODO: Managed language collectors:
% - Fast CSS: https://users.cecs.anu.edu.au/~steveb/pubs/papers/consrc-oopsla-2014.pdf
% - Hybrid CSS: https://www.usenix.org/legacy/events/jvm01/full_papers/barabash/barabash.pdf
% - Forkscan Reclamation: https://dspace.mit.edu/bitstream/handle/1721.1/123336/forkscan-eurosys-17.pdf
% - Sound GC for C using taint analysis: https://dl.acm.org/doi/pdf/10.1145/3428244


\section{Conclusions}
\label{sec:conclusion}

Our results tentatively validate our initial hypothesis that moving from
handles to direct references improves V8's performance. We base this conclusion
on the fact that we have accurately measured handles performance, but have merely
provided a lower bound of direct references. It is highly likely that once
all of V8 is migrated to direct references -- which will remove bottlenecks
when moving between migrated and unmigrated code and, hopefully, have
led to a more optimised conservative stack scanner -- the measured performance
of direct references will improve further. When considered together with the fact
that using direct references removes the potential for handle misuse to
cause security flaws, we believe that this provides a strong motivation for
V8 to move to using direct references. It may also provide an interesting data
point for other VMs that currently use handles.

% \textbf{Acknowledgements:} Hughes was funded by a research gift from Google. Tratt
% was funded by the Shopify / Royal Academy of Engineering Research Chair in Language
% Engineering.

\bibliographystyle{plain}
\bibliography{bib}

\end{document}