Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 26 additions & 26 deletions tendril/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@
**Warning**: This library is at a very early stage of development, and it
contains a substantial amount of `unsafe` code. Use at your own risk!

[![Build Status](https://github.com/servo/html5ever/workflows/Tendril%20CI/badge.svg)](https://github.com/servo/tendril/actions)
[![Build Status](https://github.com/servo/html5ever/workflows/Tendril%20CI/badge.svg)](https://github.com/servo/html5ever/actions)

[API Documentation](https://doc.servo.org/tendril/index.html)
[API Documentation](https://docs.rs/tendril)

## Introduction

Expand All @@ -16,9 +16,9 @@ Further mutations occur in-place until the string becomes shared, e.g. with
`clone()` or `subtendril()`.

Buffer sharing is accomplished through thread-local (non-atomic) reference
counting, which has very low overhead. The Rust type system will prevent
you at compile time from sending a tendril between threads. (See below
for thoughts on relaxing this restriction.)
counting, which has very low overhead. The Rust type system will prevent you at
compile time from sending a tendril between threads. (See below for thoughts on
relaxing this restriction.)

Whereas `String` allocates in the heap for any non-empty string, `Tendril` can
store small strings (up to 8 bytes) in-line, without a heap allocation.
Expand All @@ -33,9 +33,9 @@ to go over the limit.

`Tendril` uses
[phantom types](https://doc.rust-lang.org/stable/rust-by-example/generics/phantom.html)
to track a buffer's format. This determines at compile time which
operations are available on a given tendril. For example, `Tendril<UTF8>` and
`Tendril<Bytes>` can be borrowed as `&str` and `&[u8]` respectively.
to track a buffer's format. This determines at compile time which operations are
available on a given tendril. For example, `Tendril<UTF8>` and `Tendril<Bytes>`
can be borrowed as `&str` and `&[u8]` respectively.

`Tendril` also integrates with
[rust-encoding](https://github.com/lifthrasiir/rust-encoding) and has
Expand All @@ -45,33 +45,33 @@ preliminary support for [WTF-8][] buffers.

### Ropes

[html5ever][] will use `Tendril` as a zero-copy text representation. It would
be good to preserve this all the way through to Servo's DOM. This would reduce
[html5ever][] will use `Tendril` as a zero-copy text representation. It would be
good to preserve this all the way through to Servo's DOM. This would reduce
memory consumption, and possibly speed up text shaping and painting. However,
DOM text may conceivably be larger than 4 GB, and will anyway not be contiguous
in memory around e.g. a character entity reference.

*Solution:* Build a **[rope][] on top of these strings** and use that as
Servo's representation of DOM text. We can perhaps do text shaping and/or
painting in parallel for different chunks of a rope. html5ever can additionally
use this rope type as a replacement for `BufferQueue`.
*Solution:* Build a **[rope][] on top of these strings** and use that as Servo's
representation of DOM text. We can perhaps do text shaping and/or painting in
parallel for different chunks of a rope. html5ever can additionally use this
rope type as a replacement for `BufferQueue`.

Because the underlying buffers are reference-counted, the bulk of this rope
is already a [persistent data structure][]. Consider what happens when
appending two ropes to get a "new" rope. A vector-backed rope would copy a
vector of small structs, one for each chunk, and would bump the corresponding
refcounts. But it would not copy any of the string data.
Because the underlying buffers are reference-counted, the bulk of this rope is
already a [persistent data structure][]. Consider what happens when appending
two ropes to get a "new" rope. A vector-backed rope would copy a vector of small
structs, one for each chunk, and would bump the corresponding refcounts. But it
would not copy any of the string data.

If we want more sharing, then a [2-3 finger tree][] could be a good choice.
We would probably stick with `VecDeque` for ropes under a certain size.
If we want more sharing, then a [2-3 finger tree][] could be a good choice. We
would probably stick with `VecDeque` for ropes under a certain size.

### UTF-16 compatibility

SpiderMonkey expects text to be in UCS-2 format for the most part. The
semantics of JavaScript strings are difficult to implement on UTF-8. This also
applies to HTML parsing via `document.write`. Also, passing SpiderMonkey a
string that isn't contiguous in memory will incur additional overhead and
complexity, if not a full copy.
SpiderMonkey expects text to be in UCS-2 format for the most part. The semantics
of JavaScript strings are difficult to implement on UTF-8. This also applies to
HTML parsing via `document.write`. Also, passing SpiderMonkey a string that
isn't contiguous in memory will incur additional overhead and complexity, if not
a full copy.

*Solution:* Use **WTF-8 in parsing** and in the DOM. Servo will **convert to
contiguous UTF-16 when necessary**. The conversion can easily be parallelized,
Expand Down