Skip to content

Commit 5dab328

Browse files
committed
Lookup table v1 implementation
This lookup table implementation is meant to replace the current lexicon data structure. The overall concept is the same; the aim here is to improve the binary format to allow for extensions, and to improve the interaction with the table from the code, as well as naming convention. This change includes the data structure implementing the mapping concept, a CLI tool to build a table and read data from it, and a unit test suite, as well as CLI tests. In this particular change, we do not introduce any breaking changes. The new code is not used in the already existing tools and workflows. This work will be done in the future. Changelog-added: New lookup table implementation available Signed-off-by: Michal Siedlaczek <michal@siedlaczek.me>
1 parent 4f13222 commit 5dab328

File tree

14 files changed

+1585
-4
lines changed

14 files changed

+1585
-4
lines changed

docs/src/SUMMARY.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,7 @@
4646
- [`taily-stats`](cli/taily-stats.md)
4747
- [`taily-thresholds`](cli/taily-thresholds.md)
4848
- [`thresholds`](cli/thresholds.md)
49+
50+
# Specifications
51+
52+
- [Lookup Table](specs/lookup-table.md)

docs/src/specs/lookup-table.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Lookup Table Format Specification
2+
3+
A lookup table is a bidirectional mapping from an index, representing an
4+
internal ID, to a binary payload, such as string. E.g., an `N`-element
5+
lookup table maps values `0...N-1` to their payloads. These tables are
6+
used for things like mapping terms to term IDs and document IDs to
7+
titles or URLs.
8+
9+
The format of a lookup table is designed to operate without having to
10+
parse the entire structure. Once the header is parsed, it is possible to
11+
operate directly on the binary format to access the data. In fact, a
12+
lookup table will typically be memory mapped. Therefore, it is possible
13+
to perform a lookup (or reverse lookup) without loading the entire
14+
structure into memory.
15+
16+
The header always begins as follows:
17+
18+
```
19+
+--------+--------+-------- -+
20+
| 0x87 | Ver. | ... |
21+
+--------+--------+-------- -+
22+
```
23+
24+
The first byte is a constant identifier. When reading, we can verify
25+
whether this byte is correct to make sure we are using the correct type
26+
of data structure.
27+
28+
The second byte is equal to the version of the format.
29+
30+
The remaining of the format is defined separately for each version. The
31+
version is introduced in order to be able to update the format in the
32+
future but still be able to read old formats for backwards
33+
compatibility.
34+
35+
## v1
36+
37+
```
38+
+--------+--------+--------+--------+--------+--------+--------+--------+
39+
| 0x87 | 0x01 | Flags | 0x00 |
40+
+--------+--------+--------+--------+--------+--------+--------+--------+
41+
| Length |
42+
+--------+--------+--------+--------+--------+--------+--------+--------+
43+
| |
44+
| Offsets |
45+
| |
46+
+-----------------------------------------------------------------------+
47+
| |
48+
| Payloads |
49+
| |
50+
+-----------------------------------------------------------------------+
51+
```
52+
53+
Immediately after the version bit, we have flags byte.
54+
55+
```
56+
MSB LSB
57+
+---+---+---+---+---+---+---+---+
58+
| 0 | 0 | 0 | 0 | 0 | 0 | W | S |
59+
+---+---+---+---+---+---+---+---+
60+
```
61+
62+
The first bit (`S`) indicates whether the payloads are sorted (1) or not
63+
(0). The second bit (`W`) defines the width of offsets (see below):
64+
32-bit (0) or 64-bit (1). In most use cases, the cumulative size of the
65+
payloads will be small enough to address it by 32-bit offsets. For
66+
example, if we store words that are 16-bytes long on average, we can
67+
address over 200 million of them. For this many elements, reducing the
68+
width of the offsets would save us over 700 MB. Still, we want to
69+
support 64-bit addressing because some payloads may be much longer
70+
(e.g., URLs).
71+
72+
The rest of the bits in the flags byte are currently not used, but
73+
should be set to 0 to make sure that if more flags are introduced, we
74+
know what values to expect in the older iterations, and thus we can make
75+
sure to keep it backwards-compatible.
76+
77+
The following 5 bytes are padding with values of 0. This is to help with
78+
byte alignment. When loaded to memory, it should be loaded with 8-byte
79+
alignment. When memory mapped, it should be already correctly aligned by
80+
the operating system (at least on Linux).
81+
82+
Following the padding, there is a 64-bit unsigned integer encoding the
83+
number of elements in the lexicon (`N`).
84+
85+
Given `N` and `W`, we can now calculate the byte range of all offsets,
86+
and thus the address offset for the start of the payloads. The offsets
87+
are `N+1` little-endian unsigned integers of size determined by `W`
88+
(either 4 or 8 bytes). The offsets are associated with consecutive IDs
89+
from 0 to `N-1`; the last the `N+1` offsets points at the first byte
90+
after the last payload. The offsets are relative to the beginning of the
91+
first payload, therefore the first offset will always be 0.
92+
93+
Payloads are arbitrary bytes, and must be interpreted by the software.
94+
Although the typical use case are strings, this can be any binary
95+
payload. Note that in case of strings, they will not be 0-terminated
96+
unless they were specifically stored as such. Although this should be
97+
clear by the fact a payload is simply a sequence of bytes, it is only
98+
prudent to point it out. Thus, one must be extremely careful when using
99+
C-style strings, as their use is contingent on a correct values inserted
100+
and encoded in the first place, and assuming 0-terminated strings may
101+
easily lead to undefined behavior. Thus, it is recommended to store
102+
strings without terminating them, and then interpret them as string
103+
views (such as `std::string_view`) instead of a C-style string.
104+
105+
The boundaries of the k-th payload are defined by the values of k-th and
106+
(k+1)-th offsets. Note that because of the additional offset that points
107+
to immediately after the last payload, we can read offsets `k` and `k+1`
108+
for any index `k < N` (recall that `N` is the number of elements).
109+
110+
If the payloads are sorted (S), we can find an ID of a certain payload
111+
with a binary search. This is crucial for any application that requires
112+
mapping from payloads to their position in the table.

include/pisa/io.hpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ template <typename Function>
3636
void for_each_line(std::istream& is, Function fn) {
3737
std::string line;
3838
while (std::getline(is, line)) {
39-
fn(line);
39+
fn(std::move(line));
4040
}
4141
}
4242

include/pisa/lookup_table.hpp

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
// Copyright 2024 PISA developers
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
#pragma once
16+
17+
#include <concepts>
18+
#include <cstddef>
19+
#include <cstdint>
20+
#include <memory>
21+
#include <optional>
22+
#include <ostream>
23+
#include <span>
24+
25+
namespace pisa::lt {
26+
27+
namespace detail {
28+
29+
class BaseLookupTable {
30+
public:
31+
virtual ~BaseLookupTable() = default;
32+
[[nodiscard]] virtual auto size() const noexcept -> std::size_t = 0;
33+
[[nodiscard]] virtual auto operator[](std::size_t idx) const
34+
-> std::span<std::byte const> = 0;
35+
[[nodiscard]] virtual auto find(std::span<std::byte const> value) const noexcept
36+
-> std::optional<std::size_t> = 0;
37+
38+
[[nodiscard]] virtual auto clone() -> std::unique_ptr<BaseLookupTable> = 0;
39+
};
40+
41+
class BaseLookupTableEncoder {
42+
public:
43+
virtual ~BaseLookupTableEncoder() = default;
44+
void virtual insert(std::span<std::byte const> payload) = 0;
45+
void virtual encode(std::ostream& out) = 0;
46+
};
47+
48+
} // namespace detail
49+
50+
namespace v1 {
51+
52+
class Flags {
53+
private:
54+
std::uint8_t flags = 0;
55+
56+
public:
57+
constexpr Flags() = default;
58+
explicit constexpr Flags(std::uint8_t bitset) : flags(bitset) {}
59+
60+
[[nodiscard]] auto sorted() const noexcept -> bool;
61+
[[nodiscard]] auto wide_offsets() const noexcept -> bool;
62+
[[nodiscard]] auto bits() const noexcept -> std::uint8_t;
63+
};
64+
65+
namespace flags {
66+
inline constexpr std::uint8_t SORTED = 0b001;
67+
inline constexpr std::uint8_t WIDE_OFFSETS = 0b010;
68+
} // namespace flags
69+
70+
}; // namespace v1
71+
72+
} // namespace pisa::lt
73+
74+
namespace pisa {
75+
76+
/**
77+
* Lookup table mapping integers from a range [0, N) to binary payloads.
78+
*
79+
* This table assigns each _unique_ value (duplicates are not allowed) to an index in [0, N), where
80+
* N is the size of the table. Thus, this structure is equivalent to a sequence of binary values.
81+
* The difference between `LookupTable` and, say, `std::vector` is that its encoding supports
82+
* reading the values without fully parsing the entire binary representation of the table. As such,
83+
* it supports quickly initializing the structure from an external device (with random access),
84+
* e.g., via mmap, and performing a lookup without loading the entire structure to main memory.
85+
* This is especially useful for short-lived programs that must perform a lookup without the
86+
* unnecessary overhead of loading it to memory.
87+
*
88+
* If the values are sorted, and the appropriate flag is toggled in the header, a quick binary
89+
* search lookup can be performed to find an index of a value. If the values are not sorted, then a
90+
* linear scan will be used; therefore, one should consider having values sorted if such lookups are
91+
* important. Getting the value at a given index is a constant-time operation, though if using
92+
* memory mapping, each such operation may need to load multiple pages to memory.
93+
*/
94+
class LookupTable {
95+
private:
96+
std::unique_ptr<::pisa::lt::detail::BaseLookupTable> m_impl;
97+
98+
explicit LookupTable(std::unique_ptr<::pisa::lt::detail::BaseLookupTable> impl);
99+
100+
[[nodiscard]] static auto v1(std::span<const std::byte> bytes) -> LookupTable;
101+
102+
public:
103+
LookupTable(LookupTable const&);
104+
LookupTable(LookupTable&&);
105+
LookupTable& operator=(LookupTable const&);
106+
LookupTable& operator=(LookupTable&&);
107+
~LookupTable();
108+
109+
/**
110+
* The number of elements in the table.
111+
*/
112+
[[nodiscard]] auto size() const noexcept -> std::size_t;
113+
114+
/**
115+
* Retrieves the value at index `idx`.
116+
*
117+
* If `idx < size()`, then `std::out_of_range` exception is thrown. See `at()` if you want to
118+
* conveniently cast the memory span to another type.
119+
*/
120+
[[nodiscard]] auto operator[](std::size_t idx) const -> std::span<std::byte const>;
121+
122+
/**
123+
* Returns the position of `value` in the table or `std::nullopt` if the value does not exist.
124+
*
125+
* See the templated version of this function if you want to automatically cast from another
126+
* type to byte span.
127+
*/
128+
[[nodiscard]] auto find(std::span<std::byte const> value) const noexcept
129+
-> std::optional<std::size_t>;
130+
131+
/**
132+
* Returns the value at index `idx` cast to type `T`.
133+
*
134+
* The type `T` must define `T::value_type` that resolves to a byte-wide type, as well as a
135+
* constructor that takes `T::value_type const*` (pointer to the first byte) and `std::size_t`
136+
* (the total number of bytes). If `T::value_type` is longer than 1 byte, this operation results
137+
* in **undefined behavior**.
138+
*
139+
* Examples of types that can be used are: `std::string_view` or `std::span<const char>`.
140+
*/
141+
template <typename T>
142+
[[nodiscard]] auto at(std::size_t idx) const -> T {
143+
auto bytes = this->operator[](idx);
144+
return T(reinterpret_cast<typename T::value_type const*>(bytes.data()), bytes.size());
145+
}
146+
147+
/**
148+
* Returns the position of `value` in the table or `std::nullopt` if the value does not exist.
149+
*
150+
* The type `T` of the value must be such that `std:span<typename T::value_type const>` is
151+
* constructible from `T`.
152+
*/
153+
template <typename T>
154+
requires(std::constructible_from<std::span<typename T::value_type const>, T>)
155+
[[nodiscard]] auto find(T value) const noexcept -> std::optional<std::size_t> {
156+
return find(std::as_bytes(std::span<typename T::value_type const>(value)));
157+
}
158+
159+
/**
160+
* Constructs a lookup table from the encoded sequence of bytes.
161+
*/
162+
[[nodiscard]] static auto from_bytes(std::span<std::byte const> bytes) -> LookupTable;
163+
};
164+
165+
/**
166+
* Lookup table encoder.
167+
*
168+
* This class builds and encodes a sequence of values to the binary format of lookup table.
169+
* See `LookupTable` for more details.
170+
*
171+
* Note that all encoded data is accumulated in memory and only flushed to the output stream when
172+
* `encode()` member function is called.
173+
*/
174+
class LookupTableEncoder {
175+
std::unique_ptr<::pisa::lt::detail::BaseLookupTableEncoder> m_impl;
176+
177+
explicit LookupTableEncoder(std::unique_ptr<::pisa::lt::detail::BaseLookupTableEncoder> impl);
178+
179+
public:
180+
/**
181+
* Constructs an encoder for a lookup table in v1 format, with the given flag options.
182+
*
183+
* If sorted flag is _not_ set, then an additional hash set will be produced to keep track of
184+
* duplicates. This will increase the memory footprint at build time.
185+
*/
186+
static LookupTableEncoder v1(::pisa::lt::v1::Flags flags);
187+
188+
/**
189+
* Inserts payload.
190+
*
191+
* If sorted flag was set at construction time, it will throw if the given payload is not
192+
* lexicographically greater than the previously inserted payload. If sorted flag was _not_ set
193+
* and the given payload has already been inserted, it will throw as well.
194+
*/
195+
auto insert(std::span<std::byte const> payload) -> LookupTableEncoder&;
196+
197+
/**
198+
* Writes the encoded table to the output stream.
199+
*/
200+
auto encode(std::ostream& out) -> LookupTableEncoder&;
201+
202+
/**
203+
* Inserts a payload of type `Payload`.
204+
*
205+
* `std::span<typename Payload::value_type const>` must be constructible from `Payload`, which
206+
* in turn will be cast as byte span before calling the non-templated version of `insert()`.
207+
*/
208+
template <typename Payload>
209+
requires(std::constructible_from<std::span<typename Payload::value_type const>, Payload>)
210+
auto insert(Payload const& payload) -> LookupTableEncoder& {
211+
insert(std::as_bytes(std::span(payload)));
212+
return *this;
213+
}
214+
215+
/**
216+
* Inserts all payloads in the given span.
217+
*
218+
* It calls `insert()` for each element in the span. See `insert()` for more details.
219+
*/
220+
template <typename Payload>
221+
auto insert_span(std::span<Payload const> payloads) -> LookupTableEncoder& {
222+
for (auto const& payload: payloads) {
223+
insert(payload);
224+
}
225+
return *this;
226+
}
227+
};
228+
229+
} // namespace pisa

0 commit comments

Comments
 (0)