Skip to content

Commit fddd461

Browse files
[CAS] Update CAS implementation according to upstream version
Update downstream code to match upstreamed version.
1 parent 35c2f45 commit fddd461

25 files changed

+324
-1302
lines changed
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Content Addressable Storage
2+
3+
## Introduction to CAS
4+
5+
Content Addressable Storage, or `CAS`, is a storage system that assigns
6+
unique addresses to the data stored. It is very useful for data deduplicaton
7+
and creating unique identifiers.
8+
9+
Unlike other kinds of storage systems, like file systems, CAS is immutable. It
10+
is more reliable to model a computation by representing the inputs and outputs
11+
of the computation using objects stored in CAS.
12+
13+
The basic unit of the CAS library is a CASObject, where it contains:
14+
15+
* Data: arbitrary data
16+
* References: references to other CASObject
17+
18+
It can be conceptually modeled as something like:
19+
20+
```
21+
struct CASObject {
22+
ArrayRef<char> Data;
23+
ArrayRef<CASObject*> Refs;
24+
}
25+
```
26+
27+
With this abstraction, it is possible to compose `CASObject`s into a DAG that is
28+
capable of representing complicated data structures, while still allowing data
29+
deduplication. Note you can compare two DAGs by just comparing the CASObject
30+
hash of two root nodes.
31+
32+
33+
## LLVM CAS Library User Guide
34+
35+
The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`.
36+
To reference a CASObject, there are few different abstractions provided
37+
with different trade-offs:
38+
39+
### ObjectRef
40+
41+
`ObjectRef` is a lightweight reference to a CASObject stored in the CAS.
42+
This is the most commonly used abstraction and it is cheap to copy/pass
43+
along. It has following properties:
44+
45+
* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
46+
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
47+
compared.
48+
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An
49+
explicit load is required before accessing the data stored in CASObject.
50+
This load can also fail, for reasons like (but not limited to): object does
51+
not exist, corrupted CAS storage, operation timeout, etc.
52+
* If two `ObjectRef` are equal, it is guaranteed that the object they point to
53+
are identical (if they exist). If they are not equal, the underlying objects are
54+
guaranteed to be not the same.
55+
56+
### ObjectProxy
57+
58+
`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the
59+
underlying stored data and references can be accessed without the need
60+
of error handling. The class APIs also provide convenient methods to
61+
access underlying data. The lifetime of the underlying data is equal to
62+
the lifetime of the instance of `ObjectStore` unless explicitly copied.
63+
64+
### CASID
65+
66+
`CASID` is the hash identifier for CASObjects. It owns the underlying
67+
storage for hash value so it can be expensive to copy and compare depending
68+
on the hash algorithm. `CASID` is generally only useful in rare situations
69+
like printing raw hash value or exchanging hash values between different
70+
CAS instances with the same hashing schema.
71+
72+
### ObjectStore
73+
74+
`ObjectStore` is the CAS-like object storage. It provides API to save
75+
and load CASObjects, for example:
76+
77+
```
78+
ObjectRef A, B, C;
79+
Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B});
80+
Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C);
81+
```
82+
83+
It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and
84+
`CASID`.
85+
86+
87+
88+
## CAS Library Implementation Guide
89+
90+
The LLVM ObjectStore API was designed so that it is easy to add
91+
customized CAS implementations that are interchangeable with the builtin
92+
ones.
93+
94+
To add your own implementation, you just need to add a subclass to
95+
`llvm::cas::ObjectStore` and implement all its pure virtual methods.
96+
To be interchangeable with LLVM ObjectStore, the new CAS implementation
97+
needs to conform to following contracts:
98+
99+
* Different CASObjects stored in the ObjectStore need to have a different hash
100+
and result in a different `ObjectRef`. Similarly, the same CASObject should have
101+
the same hash and the same `ObjectRef`. Note: two different CASObjects with
102+
identical data but different references are considered different objects.
103+
* `ObjectRef`s are only comparable within the same `ObjectStore` instance, and
104+
can be used to determine the equality of the underlying CASObjects.
105+
* The loaded objects from the ObjectStore need to have a lifetime at least as
106+
long as the ObjectStore itself so it is always legal to access the loaded data
107+
without holding on the `ObjectProxy` until the `ObjectStore` is destroyed.
108+
109+
110+
If not specified, the behavior can be implementation defined. For example,
111+
`ObjectRef` can be used to point to a loaded CASObject so
112+
`ObjectStore` never fails to load. It is also legal to use a stricter model
113+
than required. For example, the underlying value inside `ObjectRef` can be
114+
the unique indentities of the objects across multiple `ObjectStore` instances,
115+
but comparing such `ObjectRef` from different `ObjectStore` is still illegal.
116+
117+
For CAS library implementers, there is also an `ObjectHandle` class that
118+
is an internal representation of a loaded CASObject reference.
119+
`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, and
120+
just like `ObjectRef`, `ObjectHandle` is only useful when paired with
121+
the `ObjectStore` that knows about the loaded CASObject.

llvm/docs/Reference.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ LLVM and API reference documentation.
1717
CalleeTypeMetadata
1818
CIBestPractices
1919
CommandGuide/index
20+
ContentAddressableStorage
2021
ConvergenceAndUniformity
2122
ConvergentOperations
2223
Coroutines
@@ -244,3 +245,6 @@ Additional Topics
244245
:doc:`MLGO`
245246
Facilities for ML-Guided Optimization, such as collecting IR corpora from a
246247
build, interfacing with ML models, an exposing features for training.
248+
249+
:doc:`ContentAddressableStorage`
250+
A reference guide for using LLVM's CAS library.

llvm/include/llvm/CAS/BuiltinCASContext.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ namespace llvm::cas::builtin {
5454
using HasherT = BLAKE3;
5555
using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
5656

57+
/// CASContext for LLVM builtin CAS using BLAKE3 hash type.
5758
class BuiltinCASContext : public CASContext {
5859
void printIDImpl(raw_ostream &OS, const CASID &ID) const final;
5960
void anchor() override;

llvm/include/llvm/CAS/BuiltinObjectHasher.h

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,12 @@
99
#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H
1010
#define LLVM_CAS_BUILTINOBJECTHASHER_H
1111

12-
#include "llvm/ADT/StringRef.h"
1312
#include "llvm/CAS/ObjectStore.h"
1413
#include "llvm/Support/Endian.h"
1514

16-
namespace llvm {
17-
namespace cas {
15+
namespace llvm::cas {
1816

17+
/// Hasher for stored objects in builtin CAS.
1918
template <class HasherT> class BuiltinObjectHasher {
2019
public:
2120
using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>()));
@@ -64,21 +63,20 @@ template <class HasherT> class BuiltinObjectHasher {
6463

6564
void updateArray(ArrayRef<char> Bytes) {
6665
updateArray(ArrayRef(reinterpret_cast<const uint8_t *>(Bytes.data()),
67-
Bytes.size()));
66+
Bytes.size()));
6867
}
6968

7069
void updateSize(uint64_t Size) {
7170
Size = support::endian::byte_swap(Size, endianness::little);
72-
Hasher.update(ArrayRef(reinterpret_cast<const uint8_t *>(&Size),
73-
sizeof(Size)));
71+
Hasher.update(
72+
ArrayRef(reinterpret_cast<const uint8_t *>(&Size), sizeof(Size)));
7473
}
7574

7675
BuiltinObjectHasher() = default;
7776
~BuiltinObjectHasher() = default;
7877
HasherT Hasher;
7978
};
8079

81-
} // namespace cas
82-
} // namespace llvm
80+
} // namespace llvm::cas
8381

8482
#endif // LLVM_CAS_BUILTINOBJECTHASHER_H

llvm/include/llvm/CAS/CASID.h

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -55,18 +55,21 @@ class CASContext {
5555
/// compared directly. If they are, then \a
5656
/// CASIDContext::getHashSchemaIdentifier() is compared to see if they can be
5757
/// compared by hash, in which case the result of \a getHash() is compared.
58-
///
59-
/// FIXME: Rename to ObjectID (and rename file to CASObjectID.h?).
6058
class CASID {
6159
public:
6260
void dump() const;
63-
void print(raw_ostream &OS) const {
64-
return getContext().printIDImpl(OS, *this);
65-
}
61+
6662
friend raw_ostream &operator<<(raw_ostream &OS, const CASID &ID) {
6763
ID.print(OS);
6864
return OS;
6965
}
66+
67+
/// Print CASID.
68+
void print(raw_ostream &OS) const {
69+
return getContext().printIDImpl(OS, *this);
70+
}
71+
72+
/// Return a printable string for CASID.
7073
std::string toString() const;
7174

7275
ArrayRef<uint8_t> getHash() const {
@@ -110,6 +113,7 @@ class CASID {
110113

111114
CASID() = delete;
112115

116+
/// Create CASID from CASContext and raw hash bytes.
113117
static CASID create(const CASContext *Context, StringRef Hash) {
114118
return CASID(Context, Hash);
115119
}

llvm/include/llvm/CAS/CASReference.h

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ class raw_ostream;
2020
namespace cas {
2121

2222
class ObjectStore;
23-
2423
class ObjectHandle;
2524
class ObjectRef;
2625

@@ -41,8 +40,9 @@ class ReferenceBase {
4140
return InternalRef;
4241
}
4342

43+
/// Helper functions for DenseMapInfo.
4444
unsigned getDenseMapHash() const {
45-
return (unsigned)llvm::hash_value(InternalRef);
45+
return static_cast<unsigned>(llvm::hash_value(InternalRef));
4646
}
4747
bool isDenseMapEmpty() const { return InternalRef == getDenseMapEmptyRef(); }
4848
bool isDenseMapTombstone() const {
@@ -89,7 +89,7 @@ class ReferenceBase {
8989
#endif
9090
};
9191

92-
/// Reference to an object in a \a ObjectStore instance.
92+
/// Reference to an object in an \a ObjectStore instance.
9393
///
9494
/// If you have an ObjectRef, you know the object exists, and you can point at
9595
/// it from new nodes with \a ObjectStore::store(), but you don't know anything
@@ -105,12 +105,6 @@ class ReferenceBase {
105105
/// ObjectHandle, a variant that knows what kind of entity it is. \a
106106
/// ObjectStore::getReferenceKind() can expect the type of reference without
107107
/// asking for unloaded objects to be loaded.
108-
///
109-
/// This is a wrapper around a \c uint64_t (and a \a ObjectStore instance when
110-
/// assertions are on). If necessary, it can be deconstructed and reconstructed
111-
/// using \a Reference::getInternalRef() and \a
112-
/// Reference::getFromInternalRef(), but clients aren't expected to need to do
113-
/// this. These both require the right \a ObjectStore instance.
114108
class ObjectRef : public ReferenceBase {
115109
struct DenseMapTag {};
116110

llvm/include/llvm/CAS/FileSystemCache.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414
#include "llvm/ADT/ScopeExit.h"
1515
#include "llvm/ADT/StringMap.h"
1616
#include "llvm/CAS/CASReference.h"
17-
#include "llvm/CAS/HashMappedTrie.h"
1817
#include "llvm/CAS/ThreadSafeAllocator.h"
1918
#include "llvm/Support/AlignOf.h"
2019
#include "llvm/Support/Allocator.h"

0 commit comments

Comments
 (0)