Skip to content

[CAS] Add LLVMCAS library with InMemoryCAS implementation #114096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions llvm/docs/ContentAddressableStorage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Content Addressable Storage

## Introduction to CAS

Content Addressable Storage, or `CAS`, is a storage system that assigns
unique addresses to the data stored. It is very useful for data deduplicaton
and creating unique identifiers.

Unlike other kinds of storage systems, like file systems, CAS is immutable. It
is more reliable to model a computation by representing the inputs and outputs
of the computation using objects stored in CAS.

The basic unit of the CAS library is a CASObject, where it contains:

* Data: arbitrary data
* References: references to other CASObject

It can be conceptually modeled as something like:

```
struct CASObject {
ArrayRef<char> Data;
ArrayRef<CASObject*> Refs;
}
```

With this abstraction, it is possible to compose `CASObject`s into a DAG that is
capable of representing complicated data structures, while still allowing data
deduplication. Note you can compare two DAGs by just comparing the CASObject
hash of two root nodes.


## LLVM CAS Library User Guide

The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
To reference a CASObject, there are few different abstractions provided
with different trade-offs:

### ObjectRef

`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
This is the most commonly used abstraction and it is cheap to copy/pass
along. It has following properties:

* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
compared.
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
explicit load is required before accessing the data stored in CASObject.
This load can also fail, for reasons like (but not limited to): object does
not exist, corrupted CAS storage, operation timeout, etc.
* If two `ObjectRef` are equal, it is guaranteed that the object they point to
are identical (if they exist). If they are not equal, the underlying objects are
guaranteed to be not the same.

### ObjectProxy

`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
underlying stored data and references can be accessed without the need
of error handling. The class APIs also provide convenient methods to
access underlying data. The lifetime of the underlying data is equal to
the lifetime of the instance of `ObjectStore` unless explicitly copied.

### CASID

`CASID` is the hash identifier for CASObjects. It owns the underlying
storage for hash value so it can be expensive to copy and compare depending
on the hash algorithm. `CASID` is generally only useful in rare situations
like printing raw hash value or exchanging hash values between different
CAS instances with the same hashing schema.

### ObjectStore

`ObjectStore` is the CAS-like object storage. It provides API to save
and load CASObjects, for example:

```
ObjectRef A, B, C;
Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
```

It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
`CASID`.



## CAS Library Implementation Guide

The LLVM ObjectStore API was designed so that it is easy to add
customized CAS implementations that are interchangeable with the builtin
ones.

To add your own implementation, you just need to add a subclass to
`llvm::cas::ObjectStore` and implement all its pure virtual methods.
To be interchangeable with LLVM ObjectStore, the new CAS implementation
needs to conform to following contracts:

* Different CASObjects stored in the ObjectStore need to have a different hash
and result in a different `ObjectRef`. Similarly, the same CASObject should have
the same hash and the same `ObjectRef`. Note: two different CASObjects with
identical data but different references are considered different objects.
* `ObjectRef`s are only comparable within the same `ObjectStore` instance, and
can be used to determine the equality of the underlying CASObjects.
* The loaded objects from the ObjectStore need to have a lifetime at least as
long as the ObjectStore itself so it is always legal to access the loaded data
without holding on the `ObjectProxy` until the `ObjectStore` is destroyed.


If not specified, the behavior can be implementation defined. For example,
`ObjectRef` can be used to point to a loaded CASObject so
`ObjectStore` never fails to load. It is also legal to use a stricter model
than required. For example, the underlying value inside `ObjectRef` can be
the unique indentities of the objects across multiple `ObjectStore` instances,
but comparing such `ObjectRef` from different `ObjectStore` is still illegal.

For CAS library implementers, there is also an `ObjectHandle` class that
is an internal representation of a loaded CASObject reference.
`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, and
just like `ObjectRef`, `ObjectHandle` is only useful when paired with
the `ObjectStore` that knows about the loaded CASObject.
4 changes: 4 additions & 0 deletions llvm/docs/Reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ LLVM and API reference documentation.
CalleeTypeMetadata
CIBestPractices
CommandGuide/index
ContentAddressableStorage
ConvergenceAndUniformity
ConvergentOperations
Coroutines
Expand Down Expand Up @@ -244,3 +245,6 @@ Additional Topics
:doc:`MLGO`
Facilities for ML-Guided Optimization, such as collecting IR corpora from a
build, interfacing with ML models, an exposing features for training.

:doc:`ContentAddressableStorage`
A reference guide for using LLVM's CAS library.
89 changes: 89 additions & 0 deletions llvm/include/llvm/CAS/BuiltinCASContext.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
//===- BuiltinCASContext.h --------------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_CAS_BUILTINCASCONTEXT_H
#define LLVM_CAS_BUILTINCASCONTEXT_H

#include "llvm/CAS/CASID.h"
#include "llvm/Support/BLAKE3.h"
#include "llvm/Support/Error.h"

namespace llvm::cas::builtin {

/// Current hash type for the builtin CAS.
///
/// FIXME: This should be configurable via an enum to allow configuring the hash
/// function. The enum should be sent into \a createInMemoryCAS() and \a
/// createOnDiskCAS().
///
/// This is important (at least) for future-proofing, when we want to make new
/// CAS instances use BLAKE7, but still know how to read/write BLAKE3.
///
/// Even just for BLAKE3, it would be useful to have these values:
///
/// BLAKE3 => 32B hash from BLAKE3
/// BLAKE3_16B => 16B hash from BLAKE3 (truncated)
///
/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>.
///
/// Motivation for a truncated hash is that it's cheaper to store. It's not
/// clear if we always (or ever) need the full 32B, and for an ephemeral
/// in-memory CAS, we almost certainly don't need it.
///
/// Note that the cost is linear in the number of objects for the builtin CAS,
/// since we're using internal offsets and/or pointers as an optimization.
///
/// However, it's possible we'll want to hook up a local builtin CAS to, e.g.,
/// a distributed generic hash map to use as an ActionCache. In that scenario,
/// the transitive closure of the structured objects that are the results of
/// the cached actions would need to be serialized into the map, something
/// like:
///
/// "action:<schema>:<key>" -> "0123"
/// "object:<schema>:0123" -> "3,4567,89AB,CDEF,9,some data"
/// "object:<schema>:4567" -> ...
/// "object:<schema>:89AB" -> ...
/// "object:<schema>:CDEF" -> ...
///
/// These references would be full cost.
using HasherT = BLAKE3;
using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));

/// CASContext for LLVM builtin CAS using BLAKE3 hash type.
class BuiltinCASContext : public CASContext {
void printIDImpl(raw_ostream &OS, const CASID &ID) const final;
void anchor() override;

public:
/// Get the name of the hash for any table identifiers.
///
/// FIXME: This should be configurable via an enum, with at the following
/// values:
///
/// "BLAKE3" => 32B hash from BLAKE3
/// "BLAKE3.16" => 16B hash from BLAKE3 (truncated)
///
/// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS().
static StringRef getHashName() { return "BLAKE3"; }
StringRef getHashSchemaIdentifier() const final {
static const std::string ID =
("llvm.cas.builtin.v2[" + getHashName() + "]").str();
return ID;
}

static const BuiltinCASContext &getDefaultContext();

BuiltinCASContext() = default;

static Expected<HashType> parseID(StringRef PrintedDigest);
static void printID(ArrayRef<uint8_t> Digest, raw_ostream &OS);
};

} // namespace llvm::cas::builtin

#endif // LLVM_CAS_BUILTINCASCONTEXT_H
82 changes: 82 additions & 0 deletions llvm/include/llvm/CAS/BuiltinObjectHasher.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
//===- BuiltinObjectHasher.h ------------------------------------*- C++ -*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H
#define LLVM_CAS_BUILTINOBJECTHASHER_H

#include "llvm/CAS/ObjectStore.h"
#include "llvm/Support/Endian.h"

namespace llvm::cas {

/// Hasher for stored objects in builtin CAS.
template <class HasherT> class BuiltinObjectHasher {
public:
using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));

static HashT hashObject(const ObjectStore &CAS, ArrayRef<ObjectRef> Refs,
ArrayRef<char> Data) {
BuiltinObjectHasher H;
H.updateSize(Refs.size());
for (const ObjectRef &Ref : Refs)
H.updateRef(CAS, Ref);
H.updateArray(Data);
return H.finish();
}

static HashT hashObject(ArrayRef<ArrayRef<uint8_t>> Refs,
ArrayRef<char> Data) {
BuiltinObjectHasher H;
H.updateSize(Refs.size());
for (const ArrayRef<uint8_t> &Ref : Refs)
H.updateID(Ref);
H.updateArray(Data);
return H.finish();
}

private:
HashT finish() { return Hasher.final(); }

void updateRef(const ObjectStore &CAS, ObjectRef Ref) {
updateID(CAS.getID(Ref));
}

void updateID(const CASID &ID) { updateID(ID.getHash()); }

void updateID(ArrayRef<uint8_t> Hash) {
// NOTE: Does not hash the size of the hash. That's a CAS implementation
// detail that shouldn't leak into the UUID for an object.
assert(Hash.size() == sizeof(HashT) &&
"Expected object ref to match the hash size");
Hasher.update(Hash);
}

void updateArray(ArrayRef<uint8_t> Bytes) {
updateSize(Bytes.size());
Hasher.update(Bytes);
}

void updateArray(ArrayRef<char> Bytes) {
updateArray(ArrayRef(reinterpret_cast<const uint8_t *>(Bytes.data()),
Bytes.size()));
}

void updateSize(uint64_t Size) {
Size = support::endian::byte_swap(Size, endianness::little);
Hasher.update(
ArrayRef(reinterpret_cast<const uint8_t *>(&Size), sizeof(Size)));
}

BuiltinObjectHasher() = default;
~BuiltinObjectHasher() = default;
HasherT Hasher;
};

} // namespace llvm::cas

#endif // LLVM_CAS_BUILTINOBJECTHASHER_H
Loading