-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[CAS] Add LLVMCAS library with InMemoryCAS implementation #114096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
9bf0f30
ee98c85
31f6f78
938db4a
6d3e8a9
f3b0eec
4f215f6
ab62005
1dbe62c
768c22a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,120 @@ | ||||||||||||||||||||||||||||||||||
# Content Addressable Storage | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## Introduction to CAS | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
Content Addressable Storage, or `CAS`, is a storage system where it assigns | ||||||||||||||||||||||||||||||||||
unique addresses to the data stored. It is very useful for data deduplicaton | ||||||||||||||||||||||||||||||||||
and creating unique identifiers. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
Unlikely other kind of storage system like file system, CAS is immutable. It | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
is more reliable to model a computation when representing the inputs and outputs | ||||||||||||||||||||||||||||||||||
of the computation using objects stored in CAS. | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
The basic unit of the CAS library is a CASObject, where it contains: | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* Data: arbitrary data | ||||||||||||||||||||||||||||||||||
* References: references to other CASObject | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
It can be conceptually modeled as something like: | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||
struct CASObject { | ||||||||||||||||||||||||||||||||||
ArrayRef<char> Data; | ||||||||||||||||||||||||||||||||||
ArrayRef<CASObject*> Refs; | ||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
Such abstraction can allow simple composition of CASObjects into a DAG to | ||||||||||||||||||||||||||||||||||
represent complicated data structure while still allowing data deduplication. | ||||||||||||||||||||||||||||||||||
Note you can compare two DAGs by just comparing the CASObject hash of two | ||||||||||||||||||||||||||||||||||
root nodes. | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## LLVM CAS Library User Guide | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
The CAS-like storage provided in LLVM is `llvm::cas::ObjectStore`. | ||||||||||||||||||||||||||||||||||
To reference a CASObject, there are few different abstractions provided | ||||||||||||||||||||||||||||||||||
with different trade-offs: | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
### ObjectRef | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
`ObjectRef` is a lightweight reference to a CASObject stored in the CAS. | ||||||||||||||||||||||||||||||||||
This is the most commonly used abstraction and it is cheap to copy/pass | ||||||||||||||||||||||||||||||||||
along. It has following properties: | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* `ObjectRef` is only meaningful within the `ObjectStore` that created the ref. | ||||||||||||||||||||||||||||||||||
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or | ||||||||||||||||||||||||||||||||||
compared. | ||||||||||||||||||||||||||||||||||
* `ObjectRef` doesn't guarantee the existence of the CASObject it points to. An | ||||||||||||||||||||||||||||||||||
explicitly load is required before accessing the data stored in CASObject. | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
This load can also fail, for reasons like but not limited to: object does | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
not exist, corrupted CAS storage, operation timeout, etc. | ||||||||||||||||||||||||||||||||||
* If two `ObjectRef` are equal, it is guarantee that the object they point to | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
(if exists) are identical. If they are not equal, the underlying objects are | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
guaranteed to be not the same. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
### ObjectProxy | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
`ObjectProxy` represents a loaded CASObject. With an `ObjectProxy`, the | ||||||||||||||||||||||||||||||||||
underlying stored data and references can be accessed without the need | ||||||||||||||||||||||||||||||||||
of error handling. The class APIs also provide convenient methods to | ||||||||||||||||||||||||||||||||||
access underlying data. The lifetime of the underlying data is equal to | ||||||||||||||||||||||||||||||||||
the lifetime of the instance of `ObjectStore` unless explicitly copied. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
### CASID | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
`CASID` is the hash identifier for CASObjects. It owns the underlying | ||||||||||||||||||||||||||||||||||
storage for hash value so it can be expensive to copy and compare depending | ||||||||||||||||||||||||||||||||||
on the hash algorithm. `CASID` is generally only useful in rare situations | ||||||||||||||||||||||||||||||||||
like printing raw hash value or exchanging hash values between different | ||||||||||||||||||||||||||||||||||
CAS instances with the same hashing schema. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
### ObjectStore | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
`ObjectStore` is the CAS-like object storage. It provides API to save | ||||||||||||||||||||||||||||||||||
and load CASObjects, for example: | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||
ObjectRef A, B, C; | ||||||||||||||||||||||||||||||||||
Expected<ObjectRef> Stored = ObjectStore.store("data", {A, B}); | ||||||||||||||||||||||||||||||||||
Expected<ObjectProxy> Loaded = ObjectStore.getProxy(C); | ||||||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
It also provides APIs to convert between `ObjectRef`, `ObjectProxy` and | ||||||||||||||||||||||||||||||||||
`CASID`. | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
## CAS Library Implementation Guide | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
The LLVM ObjectStore APIs are designed so that it is easy to add | ||||||||||||||||||||||||||||||||||
customized CAS implementation that are interchangeable with builtin | ||||||||||||||||||||||||||||||||||
CAS implementations. | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
To add your own implementation, you just need to add a subclass to | ||||||||||||||||||||||||||||||||||
`llvm::cas::ObjectStore` and implement all its pure virtual methods. | ||||||||||||||||||||||||||||||||||
To be interchangeable with LLVM ObjectStore, the new CAS implementation | ||||||||||||||||||||||||||||||||||
needs to conform to following contracts: | ||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
* Different CASObject stored in the ObjectStore needs to have a different hash | ||||||||||||||||||||||||||||||||||
and result in a different `ObjectRef`. Vice versa, same CASObject should have | ||||||||||||||||||||||||||||||||||
same hash and same `ObjectRef`. Note two different CASObjects with identical | ||||||||||||||||||||||||||||||||||
data but different references are considered different objects. | ||||||||||||||||||||||||||||||||||
* `ObjectRef`s are comparable within the same `ObjectStore` instance, and can | ||||||||||||||||||||||||||||||||||
be used to determine the equality of the underlying CASObjects. | ||||||||||||||||||||||||||||||||||
* The loaded objects from the ObjectStore need to have the lifetime to be at | ||||||||||||||||||||||||||||||||||
least as long as the ObjectStore itself. | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
On the last point, isn't it fine for the loaded objects to live only as long as their last reference? I'd imagine if the ObjectStore was long lived, it would be fine if the loaded objects lifetime was only a subset of the ObjectStore, but could not exceed the lifetime of the ObjectStore, right? The way this is written reads as if a loaded object must outlive the ObjectStore. If that is accurate, I think you may want to expand on that a bit to clarify. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To the last point, currently the loaded object
Define a shorter lifetime makes sense (as a relaxed lifetime is also legal), but we should probably add some kind of error checking to catch the case in my example above to make CAS usage portable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, that's a bit tricky then. For now updating the text to clarify those points will help a lot. |
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
If not specified, the behavior can be implementation defined. For example, | ||||||||||||||||||||||||||||||||||
`ObjectRef` can be used to point to a loaded CASObject so | ||||||||||||||||||||||||||||||||||
`ObjectStore` never fails to load. It is also legal to use a stricter model | ||||||||||||||||||||||||||||||||||
than required. For example, an `ObjectRef` that can be used to compare | ||||||||||||||||||||||||||||||||||
objects between different `ObjectStore` instances is legal but user | ||||||||||||||||||||||||||||||||||
of the ObjectStore should not depend on this behavior. | ||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This part seems to conflict with the statements above, where you say that these comparisons won't work. I'm not sure what you mean by saying they're There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let me know if the newer explanation works better or is it just better left it off. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is better, but I think it still conflicts with one of your earlier statements: * `ObjectRef` is only meaningful within the `ObjectStore` that created the ref.
`ObjectRef` created by different `ObjectStore` cannot be cross-referenced or
compared. Maybe some of the document just needs to be reorganized to present this in a more cohesive way? Or perhaps you just need to soften the above to say "typically" or something to convey that its not a requirement? |
||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||
For CAS library implementer, there is also a `ObjectHandle` class that | ||||||||||||||||||||||||||||||||||
is an internal representation of a loaded CASObject reference. | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||
`ObjectProxy` is just a pair of `ObjectHandle` and `ObjectStore`, because | ||||||||||||||||||||||||||||||||||
just like `ObjectRef`, `ObjectHandle` is only useful when paired with | ||||||||||||||||||||||||||||||||||
the ObjectStore that knows about the loaded CASObject. | ||||||||||||||||||||||||||||||||||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
//===- BuiltinCASContext.h --------------------------------------*- C++ -*-===// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
|
||
#ifndef LLVM_CAS_BUILTINCASCONTEXT_H | ||
#define LLVM_CAS_BUILTINCASCONTEXT_H | ||
|
||
#include "llvm/CAS/CASID.h" | ||
#include "llvm/Support/BLAKE3.h" | ||
#include "llvm/Support/Error.h" | ||
|
||
namespace llvm::cas::builtin { | ||
|
||
/// Current hash type for the builtin CAS. | ||
/// | ||
/// FIXME: This should be configurable via an enum to allow configuring the hash | ||
/// function. The enum should be sent into \a createInMemoryCAS() and \a | ||
/// createOnDiskCAS(). | ||
/// | ||
/// This is important (at least) for future-proofing, when we want to make new | ||
/// CAS instances use BLAKE7, but still know how to read/write BLAKE3. | ||
/// | ||
/// Even just for BLAKE3, it would be useful to have these values: | ||
/// | ||
/// BLAKE3 => 32B hash from BLAKE3 | ||
/// BLAKE3_16B => 16B hash from BLAKE3 (truncated) | ||
/// | ||
/// ... where BLAKE3_16 uses \a TruncatedBLAKE3<16>. | ||
/// | ||
/// Motivation for a truncated hash is that it's cheaper to store. It's not | ||
/// clear if we always (or ever) need the full 32B, and for an ephemeral | ||
/// in-memory CAS, we almost certainly don't need it. | ||
/// | ||
/// Note that the cost is linear in the number of objects for the builtin CAS, | ||
/// since we're using internal offsets and/or pointers as an optimization. | ||
/// | ||
/// However, it's possible we'll want to hook up a local builtin CAS to, e.g., | ||
/// a distributed generic hash map to use as an ActionCache. In that scenario, | ||
/// the transitive closure of the structured objects that are the results of | ||
/// the cached actions would need to be serialized into the map, something | ||
/// like: | ||
/// | ||
/// "action:<schema>:<key>" -> "0123" | ||
/// "object:<schema>:0123" -> "3,4567,89AB,CDEF,9,some data" | ||
/// "object:<schema>:4567" -> ... | ||
/// "object:<schema>:89AB" -> ... | ||
/// "object:<schema>:CDEF" -> ... | ||
/// | ||
/// These references would be full cost. | ||
using HasherT = BLAKE3; | ||
using HashType = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>())); | ||
|
||
class BuiltinCASContext : public CASContext { | ||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
void printIDImpl(raw_ostream &OS, const CASID &ID) const final; | ||
void anchor() override; | ||
|
||
public: | ||
/// Get the name of the hash for any table identifiers. | ||
/// | ||
/// FIXME: This should be configurable via an enum, with at the following | ||
/// values: | ||
/// | ||
/// "BLAKE3" => 32B hash from BLAKE3 | ||
/// "BLAKE3.16" => 16B hash from BLAKE3 (truncated) | ||
/// | ||
/// Enum can be sent into \a createInMemoryCAS() and \a createOnDiskCAS(). | ||
static StringRef getHashName() { return "BLAKE3"; } | ||
StringRef getHashSchemaIdentifier() const final { | ||
static const std::string ID = | ||
("llvm.cas.builtin.v2[" + getHashName() + "]").str(); | ||
return ID; | ||
} | ||
|
||
static const BuiltinCASContext &getDefaultContext(); | ||
|
||
BuiltinCASContext() = default; | ||
|
||
static Expected<HashType> parseID(StringRef PrintedDigest); | ||
static void printID(ArrayRef<uint8_t> Digest, raw_ostream &OS); | ||
}; | ||
|
||
} // namespace llvm::cas::builtin | ||
|
||
#endif // LLVM_CAS_BUILTINCASCONTEXT_H |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
//===- BuiltinObjectHasher.h ------------------------------------*- C++ -*-===// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
|
||
#ifndef LLVM_CAS_BUILTINOBJECTHASHER_H | ||
#define LLVM_CAS_BUILTINOBJECTHASHER_H | ||
|
||
#include "llvm/CAS/ObjectStore.h" | ||
#include "llvm/Support/Endian.h" | ||
|
||
namespace llvm::cas { | ||
|
||
template <class HasherT> class BuiltinObjectHasher { | ||
cachemeifyoucan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
public: | ||
using HashT = decltype(HasherT::hash(std::declval<ArrayRef<uint8_t> &>())); | ||
|
||
static HashT hashObject(const ObjectStore &CAS, ArrayRef<ObjectRef> Refs, | ||
ArrayRef<char> Data) { | ||
BuiltinObjectHasher H; | ||
H.updateSize(Refs.size()); | ||
for (const ObjectRef &Ref : Refs) | ||
H.updateRef(CAS, Ref); | ||
H.updateArray(Data); | ||
return H.finish(); | ||
} | ||
|
||
static HashT hashObject(ArrayRef<ArrayRef<uint8_t>> Refs, | ||
ArrayRef<char> Data) { | ||
BuiltinObjectHasher H; | ||
H.updateSize(Refs.size()); | ||
for (const ArrayRef<uint8_t> &Ref : Refs) | ||
H.updateID(Ref); | ||
H.updateArray(Data); | ||
return H.finish(); | ||
} | ||
|
||
private: | ||
HashT finish() { return Hasher.final(); } | ||
|
||
void updateRef(const ObjectStore &CAS, ObjectRef Ref) { | ||
updateID(CAS.getID(Ref)); | ||
} | ||
|
||
void updateID(const CASID &ID) { updateID(ID.getHash()); } | ||
|
||
void updateID(ArrayRef<uint8_t> Hash) { | ||
// NOTE: Does not hash the size of the hash. That's a CAS implementation | ||
// detail that shouldn't leak into the UUID for an object. | ||
assert(Hash.size() == sizeof(HashT) && | ||
"Expected object ref to match the hash size"); | ||
Hasher.update(Hash); | ||
} | ||
|
||
void updateArray(ArrayRef<uint8_t> Bytes) { | ||
updateSize(Bytes.size()); | ||
Hasher.update(Bytes); | ||
} | ||
|
||
void updateArray(ArrayRef<char> Bytes) { | ||
updateArray(ArrayRef(reinterpret_cast<const uint8_t *>(Bytes.data()), | ||
Bytes.size())); | ||
} | ||
|
||
void updateSize(uint64_t Size) { | ||
Size = support::endian::byte_swap(Size, endianness::little); | ||
Hasher.update( | ||
ArrayRef(reinterpret_cast<const uint8_t *>(&Size), sizeof(Size))); | ||
} | ||
|
||
BuiltinObjectHasher() = default; | ||
~BuiltinObjectHasher() = default; | ||
HasherT Hasher; | ||
}; | ||
|
||
} // namespace llvm::cas | ||
|
||
#endif // LLVM_CAS_BUILTINOBJECTHASHER_H |
Uh oh!
There was an error while loading. Please reload this page.