Skip to content

Commit dda996b

Browse files
[CAS] Add LLVMCAS library with InMemoryCAS implementation (#114096)
Add llvm::cas::ObjectStore abstraction and InMemoryCAS as a in-memory CAS object store implementation. The ObjectStore models its objects as: * Content: An array of bytes for the data to be stored. * Refs: An array of references to other objects in the ObjectStore. And each CAS Object can be idenfied with an unqine ID/Hash. ObjectStore supports following general action: * Expected<ID> store(Content, ArrayRef<Ref>) * Expected<Ref> get(ID) It also introduces following types to interact with a CAS ObjectStore: * CASID: Hash representation for an CAS Objects with its context to help print/compare CASIDs. * ObjectRef: A light-weight ref for an object in the ObjectStore. It is implementation defined so it can be optimized for read/store/references depending on the implementation. * ObjectProxy: A proxy for the users of CAS to interact with the data inside CAS Object. It bundles a ObjectHandle and an ObjectStore instance.
1 parent d7c7fbd commit dda996b

19 files changed

+2019
-0
lines changed
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Content Addressable Storage
2+
3+
## Introduction to CAS
4+
5+
Content Addressable Storage, or `CAS`, is a storage system that assigns
6+
unique addresses to the data stored. It is very useful for data deduplicaton
7+
and creating unique identifiers.
8+
9+
Unlike other kinds of storage systems, like file systems, CAS is immutable. It
10+
is more reliable to model a computation by representing the inputs and outputs
11+
of the computation using objects stored in CAS.
12+
13+
The basic unit of the CAS library is a CASObject, where it contains:
14+
15+
* Data: arbitrary data
16+
* References: references to other CASObject
17+
18+
It can be conceptually modeled as something like:
19+
20+
```
21+
struct CASObject {
22+
ArrayRef<char> Data;
23+
ArrayRef<CASObject*> Refs;
24+
}
25+
```
26+
27+
With this abstraction, it is possible to compose `CASObject`s into a DAG that is
28+
capable of representing complicated data structures, while still allowing data
29+
deduplication. Note you can compare two DAGs by just comparing the CASObject
30+
hash of two root nodes.
31+
32+
33+
## LLVM CAS Library User Guide
34+
35+
The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
36+
To reference a CASObject, there are few different abstractions provided
37+
with different trade-offs:
38+
39+
### ObjectRef
40+
41+
`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
42+
This is the most commonly used abstraction and it is cheap to copy/pass
43+
along. It has following properties:
44+
45+
* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
46+
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
47+
compared.
48+
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
49+
explicit load is required before accessing the data stored in CASObject.
50+
This load can also fail, for reasons like (but not limited to): object does
51+
not exist, corrupted CAS storage, operation timeout, etc.
52+
* If two `ObjectRef` are equal, it is guaranteed that the object they point to
53+
are identical (if they exist). If they are not equal, the underlying objects are
54+
guaranteed to be not the same.
55+
56+
### ObjectProxy
57+
58+
`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
59+
underlying stored data and references can be accessed without the need
60+
of error handling. The class APIs also provide convenient methods to
61+
access underlying data. The lifetime of the underlying data is equal to
62+
the lifetime of the instance of `ObjectStore` unless explicitly copied.
63+
64+
### CASID
65+
66+
`CASID` is the hash identifier for CASObjects. It owns the underlying
67+
storage for hash value so it can be expensive to copy and compare depending
68+
on the hash algorithm. `CASID` is generally only useful in rare situations
69+
like printing raw hash value or exchanging hash values between different
70+
CAS instances with the same hashing schema.
71+
72+
### ObjectStore
73+
74+
`ObjectStore` is the CAS-like object storage. It provides API to save
75+
and load CASObjects, for example:
76+
77+
```
78+
ObjectRef A, B, C;
79+
Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
80+
Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
81+
```
82+
83+
It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
84+
`CASID`.
85+
86+
87+
88+
## CAS Library Implementation Guide
89+
90+
The LLVM ObjectStore API was designed so that it is easy to add
91+
customized CAS implementations that are interchangeable with the builtin
92+
ones.
93+
94+
To add your own implementation, you just need to add a subclass to
95+
`llvm::cas::ObjectStore` and implement all its pure virtual methods.
96+
To be interchangeable with LLVM ObjectStore, the new CAS implementation
97+
needs to conform to following contracts:
98+
99+
* Different CASObjects stored in the ObjectStore need to have a different hash
100+
and result in a different `ObjectRef`. Similarly, the same CASObject should have
101+
the same hash and the same `ObjectRef`. Note: two different CASObjects with
102+
identical data but different references are considered different objects.
103+
* `ObjectRef`s are only comparable within the same `ObjectStore` instance, and
104+
can be used to determine the equality of the underlying CASObjects.
105+
* The loaded objects from the ObjectStore need to have a lifetime at least as
106+
long as the ObjectStore itself so it is always legal to access the loaded data
107+
without holding on the `ObjectProxy` until the `ObjectStore` is destroyed.
108+
109+
110+
If not specified, the behavior can be implementation defined. For example,
111+
`ObjectRef` can be used to point to a loaded CASObject so
112+
`ObjectStore` never fails to load. It is also legal to use a stricter model
113+
than required. For example, the underlying value inside `ObjectRef` can be
114+
the unique indentities of the objects across multiple `ObjectStore` instances,
115+
but comparing such `ObjectRef` from different `ObjectStore` is still illegal.
116+
117+
For CAS library implementers, there is also an `ObjectHandle` class that
118+
is an internal representation of a loaded CASObject reference.
119+
`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, and
120+
just like `ObjectRef`, `ObjectHandle` is only useful when paired with
121+
the `ObjectStore` that knows about the loaded CASObject.

llvm/docs/Reference.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ LLVM and API reference documentation.
1717
CalleeTypeMetadata
1818
CIBestPractices
1919
CommandGuide/index
20+
ContentAddressableStorage
2021
ConvergenceAndUniformity
2122
ConvergentOperations
2223
Coroutines
@@ -244,3 +245,6 @@ Additional Topics
244245
:doc:`MLGO`
245246
Facilities for ML-Guided Optimization, such as collecting IR corpora from a
246247
build, interfacing with ML models, an exposing features for training.
248+
249+
:doc:`ContentAddressableStorage`
250+
A reference guide for using LLVM's CAS library.
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
//===- BuiltinCASContext.h --------------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLVM_CAS_BUILTINCASCONTEXT_H
10+
#define LLVM_CAS_BUILTINCASCONTEXT_H
11+
12+
#include "llvm/CAS/CASID.h"
13+
#include "llvm/Support/BLAKE3.h"
14+
#include "llvm/Support/Error.h"
15+
16+
namespace llvm::cas::builtin {
17+
18+
/// Current hash type for the builtin CAS.
19+
///
20+
/// FIXME: This should be configurable via an enum to allow configuring the hash
21+
/// function. The enum should be sent into \a createInMemoryCAS() and \a
22+
/// createOnDiskCAS().
23+
///
24+
/// This is important (at least) for future-proofing, when we want to make new
25+
/// CAS instances use BLAKE7, but still know how to read/write BLAKE3.
26+
///
27+
/// Even just for BLAKE3, it would be useful to have these values:
28+
///
29+
/// BLAKE3 => 32B hash from BLAKE3
30+
/// BLAKE3_16B => 16B hash from BLAKE3 (truncated)
31+
///
32+
/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>.
33+
///
34+
/// Motivation for a truncated hash is that it's cheaper to store. It's not
35+
/// clear if we always (or ever) need the full 32B, and for an ephemeral
36+
/// in-memory CAS, we almost certainly don't need it.
37+
///
38+
/// Note that the cost is linear in the number of objects for the builtin CAS,
39+
/// since we're using internal offsets and/or pointers as an optimization.
40+
///
41+
/// However, it's possible we'll want to hook up a local builtin CAS to, e.g.,
42+
/// a distributed generic hash map to use as an ActionCache. In that scenario,
43+
/// the transitive closure of the structured objects that are the results of
44+
/// the cached actions would need to be serialized into the map, something
45+
/// like:
46+
///
47+
/// "action:<schema>:<key>" -> "0123"
48+
/// "object:<schema>:0123" -> "3,4567,89AB,CDEF,9,some data"
49+
/// "object:<schema>:4567" -> ...
50+
/// "object:<schema>:89AB" -> ...
51+
/// "object:<schema>:CDEF" -> ...
52+
///
53+
/// These references would be full cost.
54+
using HasherT = BLAKE3;
55+
using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
56+
57+
/// CASContext for LLVM builtin CAS using BLAKE3 hash type.
58+
class BuiltinCASContext : public CASContext {
59+
void printIDImpl(raw_ostream &OS, const CASID &ID) const final;
60+
void anchor() override;
61+
62+
public:
63+
/// Get the name of the hash for any table identifiers.
64+
///
65+
/// FIXME: This should be configurable via an enum, with at the following
66+
/// values:
67+
///
68+
/// "BLAKE3" => 32B hash from BLAKE3
69+
/// "BLAKE3.16" => 16B hash from BLAKE3 (truncated)
70+
///
71+
/// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS().
72+
static StringRef getHashName() { return "BLAKE3"; }
73+
StringRef getHashSchemaIdentifier() const final {
74+
static const std::string ID =
75+
("llvm.cas.builtin.v2[" + getHashName() + "]").str();
76+
return ID;
77+
}
78+
79+
static const BuiltinCASContext &getDefaultContext();
80+
81+
BuiltinCASContext() = default;
82+
83+
static Expected<HashType> parseID(StringRef PrintedDigest);
84+
static void printID(ArrayRef<uint8_t> Digest, raw_ostream &OS);
85+
};
86+
87+
} // namespace llvm::cas::builtin
88+
89+
#endif // LLVM_CAS_BUILTINCASCONTEXT_H
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
//===- BuiltinObjectHasher.h ------------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
9+
#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H
10+
#define LLVM_CAS_BUILTINOBJECTHASHER_H
11+
12+
#include "llvm/CAS/ObjectStore.h"
13+
#include "llvm/Support/Endian.h"
14+
15+
namespace llvm::cas {
16+
17+
/// Hasher for stored objects in builtin CAS.
18+
template <class HasherT> class BuiltinObjectHasher {
19+
public:
20+
using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
21+
22+
static HashT hashObject(const ObjectStore &CAS, ArrayRef<ObjectRef> Refs,
23+
ArrayRef<char> Data) {
24+
BuiltinObjectHasher H;
25+
H.updateSize(Refs.size());
26+
for (const ObjectRef &Ref : Refs)
27+
H.updateRef(CAS, Ref);
28+
H.updateArray(Data);
29+
return H.finish();
30+
}
31+
32+
static HashT hashObject(ArrayRef<ArrayRef<uint8_t>> Refs,
33+
ArrayRef<char> Data) {
34+
BuiltinObjectHasher H;
35+
H.updateSize(Refs.size());
36+
for (const ArrayRef<uint8_t> &Ref : Refs)
37+
H.updateID(Ref);
38+
H.updateArray(Data);
39+
return H.finish();
40+
}
41+
42+
private:
43+
HashT finish() { return Hasher.final(); }
44+
45+
void updateRef(const ObjectStore &CAS, ObjectRef Ref) {
46+
updateID(CAS.getID(Ref));
47+
}
48+
49+
void updateID(const CASID &ID) { updateID(ID.getHash()); }
50+
51+
void updateID(ArrayRef<uint8_t> Hash) {
52+
// NOTE: Does not hash the size of the hash. That's a CAS implementation
53+
// detail that shouldn't leak into the UUID for an object.
54+
assert(Hash.size() == sizeof(HashT) &&
55+
"Expected object ref to match the hash size");
56+
Hasher.update(Hash);
57+
}
58+
59+
void updateArray(ArrayRef<uint8_t> Bytes) {
60+
updateSize(Bytes.size());
61+
Hasher.update(Bytes);
62+
}
63+
64+
void updateArray(ArrayRef<char> Bytes) {
65+
updateArray(ArrayRef(reinterpret_cast<const uint8_t *>(Bytes.data()),
66+
Bytes.size()));
67+
}
68+
69+
void updateSize(uint64_t Size) {
70+
Size = support::endian::byte_swap(Size, endianness::little);
71+
Hasher.update(
72+
ArrayRef(reinterpret_cast<const uint8_t *>(&Size), sizeof(Size)));
73+
}
74+
75+
BuiltinObjectHasher() = default;
76+
~BuiltinObjectHasher() = default;
77+
HasherT Hasher;
78+
};
79+
80+
} // namespace llvm::cas
81+
82+
#endif // LLVM_CAS_BUILTINOBJECTHASHER_H

0 commit comments

Comments
 (0)