Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
bf5f311
Extract Cursor class and update javadoc style
blambov Mar 13, 2025
e7f4f00
Change transformations to implement Cursor
blambov Mar 14, 2025
09a8b56
Adds the ability to verify cursors' behaviour for debugging
blambov Mar 14, 2025
635feaa
Put direction argument first in forEach/process
blambov Mar 18, 2025
7e57c12
Extract BaseTrie
blambov Mar 18, 2025
31ea5db
Add concrete type to BaseTrie
blambov Mar 18, 2025
a70a2d2
Add CursorWalkable interface to BaseTrie and move implementations there
blambov Mar 18, 2025
125b325
Run trie tests with verification by default
blambov Apr 4, 2025
3275662
Fix prefixed and singleton tailCursor
blambov Jun 2, 2025
0579577
Implement TrieSet and change slices to use intersection
blambov Mar 19, 2025
4735cde
Extract InMemoryBaseTrie unchanged in preparation for other trie types
blambov Apr 29, 2025
871f926
Add deletion support for InMemoryTrie
blambov Apr 29, 2025
e9b8ce1
Add RangeTrie
blambov Mar 24, 2025
ce9d384
Implement RangeTrie.applyTo, InMemoryTrie.delete and InMemoryTrie.app…
blambov May 5, 2025
ecd1f10
Add DeletionAwareTrie
blambov May 16, 2025
8069242
Add "Stage2" versions of trie memtable and partition classes
blambov Jul 15, 2025
83394bb
TrieMemtable Stage 3
blambov Jul 18, 2025
45bc6b1
Implement, test and benchmark stopIssuingTombstones
blambov Sep 4, 2025
855f587
Add trie slicing support for SAI uses
blambov Sep 25, 2025
b0519d6
Switch row deletions to point tombstones
blambov Sep 26, 2025
33b461e
Generalize forEachValue/Entry
blambov Oct 2, 2025
d59d5ec
Switch MemtableAverageRowSize to use trie directly and expand test
blambov Oct 2, 2025
21b0d30
Remove TrieSetIntersectionCursor and implement union and intersection…
blambov Nov 5, 2025
3177f9c
Move TrieSetNegatedCursor into TrieSetCursor
blambov Nov 5, 2025
9b0c620
Review changes
blambov Nov 6, 2025
993194d
Add graph for in-memory trie deletions
blambov Nov 6, 2025
d6902ca
Review changes
blambov Nov 6, 2025
64b5291
Test points in range tries
blambov Nov 6, 2025
94803f4
Review changes
blambov Nov 6, 2025
bbf34bd
Include head in applyToSelected calls
blambov Nov 6, 2025
b072af1
Add deletion-aware tails test and fix problems
blambov Nov 10, 2025
058afcb
Change deletion-aware collection merge to make independently tracking…
blambov Nov 10, 2025
239cb36
Review changes
blambov Nov 11, 2025
e825fde
Sonarcloud and idea warnings
blambov Nov 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 98 additions & 119 deletions src/java/org/apache/cassandra/db/tries/CollectionMergeTrie.java

Large diffs are not rendered by default.

257 changes: 257 additions & 0 deletions src/java/org/apache/cassandra/db/tries/Cursor.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

Comment on lines +1 to +18
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace with DataStax license

package org.apache.cassandra.db.tries;

import org.agrona.DirectBuffer;
import org.apache.cassandra.utils.bytecomparable.ByteComparable;
import org.apache.cassandra.utils.bytecomparable.ByteSource;

/// A trie cursor.
///
/// This is the internal representation of a trie, which enables efficient walks and basic operations (merge,
/// slice) on tries.
///
/// The cursor represents the state of a walk over the nodes of trie. It provides three main features:
/// - the current [#depth] or descend-depth in the trie;
/// - the [#incomingTransition], i.e. the byte that was used to reach the current point;
/// - the [#content] associated with the current node,
///
/// and provides methods for advancing to the next position. This is enough information to extract all paths, and
/// also to easily compare cursors over different tries that are advanced together. Advancing is always done in
/// order; if one imagines the set of nodes in the trie with their associated paths, a cursor may only advance from a
/// node with a lexicographically smaller path to one with bigger. The [#advance] operation moves to the immediate
/// next, it is also possible to skip over some items to a specific position ahead ([#skipTo]).
///
/// Moving to the immediate next position in the lexicographic order is accomplished by:
/// - if the current node has children, moving to its first child;
/// - otherwise, ascend the parent chain and return the next child of the closest parent that still has any.
///
/// As long as the trie is not exhausted, advancing always takes one step down, from the current node, or from a node
/// on the parent chain. By comparing the new depth (which `advance` also returns) with the one before the advance,
/// one can tell if the former was the case (if `newDepth == oldDepth + 1`) and how many steps up we had to take
/// (`oldDepth + 1 - newDepth`). When following a path down, the cursor will stop on all prefixes.
///
/// When it is created the cursor is placed on the root node with `depth() = 0`, `incomingTransition() = -1`.
/// Since tries can have mappings for empty, content() can possibly be non-null. The cursor is exhausted when it
/// returns a depth of -1 (the operations that advance a cursor return the depth, and `depth()` will also
/// return -1 if queried afterwards). It is not allowed for a cursor to start in exhausted state; once a cursor is
/// exhausted, calling any of the advance methods or `tailTrie` is an error.
///
/// For example, the following trie:
/// <pre>
/// t
/// r
/// e
/// e *
/// i
/// e *
/// p *
/// w
/// i
/// n *
/// </pre>
/// has nodes reachable with the paths
/// `"", t, tr, tre, tree*, tri, trie*, trip*, w, wi, win*`
/// and the cursor will list them with the following `(depth, incomingTransition)` pairs:
/// `(0, -1), (1, t), (2, r), (3, e), (4, e)*, (3, i), (4, e)*, (4, p)*, (1, w), (2, i), (3, n)*`
///
/// Because we exhaust transitions on bigger depths before we go the next transition on the smaller ones, when
/// cursors are advanced together their positions can be easily compared using only the [#depth] and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs highlighting that cursors "advanced together" means that if they are not at the same position, we always advance the one that is lagging behind until it catches up or jumps over. Otherwise, if we let the higher cursor advance more times, or skipTo arbitrary points, this comparison logic would not work.

Anyway, I'm impressed by how smart this algorithm is!

/// [#incomingTransition]:
/// - one that is higher in depth is before one that is lower;
/// - for equal depths, the one with smaller incomingTransition is first.
///
/// If we consider walking the trie above in parallel with this:
/// <pre>
/// t
/// r
/// i
/// c
/// k *
/// u
/// p *
/// </pre>
/// the combined iteration will proceed as follows:<pre>
/// (0, -1)+ (0, -1)+ cursors equal, advance both
/// (1, t)+ (1, t)+ t cursors equal, advance both
/// (2, r)+ (2, r)+ tr cursors equal, advance both
/// (3, e)+ < (3, i) tre cursors not equal, advance smaller (3 = 3, e < i)
/// (4, e)+ < (3, i) tree* cursors not equal, advance smaller (4 > 3)
/// (3, i)+ (3, i)+ tri cursors equal, advance both
/// (4, e) > (4, c)+ tric cursors not equal, advance smaller (4 = 4, e > c)
/// (4, e) > (5, k)+ trick* cursors not equal, advance smaller (4 < 5)
/// (4, e)+ < (1, u) trie* cursors not equal, advance smaller (4 > 1)
/// (4, p)+ < (1, u) trip* cursors not equal, advance smaller (4 > 1)
/// (1, w) > (1, u)+ u cursors not equal, advance smaller (1 = 1, w > u)
/// (1, w) > (2, p)+ up* cursors not equal, advance smaller (1 < 2)
/// (1, w)+ < (-1, -1) w cursors not equal, advance smaller (1 > -1)
/// (2, i)+ < (-1, -1) wi cursors not equal, advance smaller (2 > -1)
/// (3, n)+ < (-1, -1) win* cursors not equal, advance smaller (3 > -1)
/// (-1, -1) (-1, -1) both exhasted
/// </pre>
///
/// Cursors are created with a direction (forward or reverse), which specifies the order in which a node's children
/// are iterated (smaller first or larger first). Note that entries returned in reverse direction are in
/// lexicographic order for the inverted alphabet, which is not the same as being presented in reverse. For example,
/// a cursor for a trie containing "ab", "abc" and "cba", will visit the nodes in order "cba", "ab", "abc", i.e.
/// prefixes will still be reported before their descendants.
///
/// Also see [Trie.md](./Trie.md) for further documentation.
public interface Cursor<T>
{
/// @return the current descend-depth; 0, if the cursor has just been created and is positioned on the root,
/// and -1, if the trie has been exhausted.
int depth();

/// @return the last transition taken; if positioned on the root, return -1
int incomingTransition();

/// @return the content associated with the current node. This may be non-null for any presented node, including
/// the root.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it also can be null, right?
Can we add @Nullable if this is the case?

T content();

/// Returns the direction in which this cursor is progressing.
Direction direction();

/// Returns the byte-comparable version that this trie uses.
ByteComparable.Version byteComparableVersion();

/// Advance one position to the node whose associated path is next lexicographically.
/// This can be either:
/// - descending one level to the first child of the current node,
/// - ascending to the closest parent that has remaining children, and then descending one level to its next
/// child.
///
/// It is an error to call this after the trie has already been exhausted (i.e. when `depth() == -1`);
/// for performance reasons we won't always check this.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we don't check this and some code calls it in that state?
Can we at least narrow down the list of very bad things that could happen or explicitly state it's undefined? Noop, exception, returning duplicate last entry, ... ?

It does look to me like a good candidate for an assert depth >= 0 - cheap enough that won't make much difference when assertions are enabled and could increase the likelyhood we fail fast in the tests, but also zero cost in prod where we run with assertions turned off

///
/// @return depth (can be `prev+1` or `<=prev`), -1 means that the trie is exhausted
int advance();

/// Advance, descending multiple levels if the cursor can do this for the current position without extra work
/// (e.g. when positioned on a chain node in a memtable trie). If the current node does not have children this
/// is exactly the same as advance(), otherwise it may take multiple steps down (but will not necessarily, even
/// if they exist).
///
/// Note that if any positions are skipped, their content must be null.
///
/// This is an optional optimization; the default implementation falls back to calling advance.
///
/// It is an error to call this after the trie has already been exhausted (i.e. when `depth() == -1`);
/// for performance reasons we won't always check this.
///
/// @param receiver object that will receive all transitions taken except the last;
/// on ascend, or if only one step down was taken, it will not receive any
/// @return the new depth, -1 if the trie is exhausted
default int advanceMultiple(TransitionsReceiver receiver)
{
return advance();
}

/// Advance all the way to the next node with non-null content.
///
/// It is an error to call this after the trie has already been exhausted (i.e. when `depth() == -1`);
/// for performance reasons we won't always check this.
///
/// @param receiver object that will receive all taken transitions
/// @return the content, null if the trie is exhausted
default T advanceToContent(ResettingTransitionsReceiver receiver)
{
int prevDepth = depth();
while (true)
{
int currDepth = advanceMultiple(receiver);
if (currDepth <= 0)
return null;
if (receiver != null)
{
if (currDepth <= prevDepth)
receiver.resetPathLength(currDepth - 1);
receiver.addPathByte(incomingTransition());
}
T content = content();
if (content != null)
return content;
prevDepth = currDepth;
}
}

/// Advance to the specified depth and incoming transition or the first valid position that is after the specified
/// position. The inputs must be something that could be returned by a single call to [#advance] (i.e.
/// `depth` must be <= current depth + 1, and `incomingTransition` must be higher than what the
/// current state saw at the requested depth).
///
/// @return the new depth, always <= previous depth + 1; -1 if the trie is exhausted
int skipTo(int skipDepth, int skipTransition);

/// Descend into the cursor with the given path.
///
/// @return True if the descent is positioned at the end of the given path, false if the trie did not have a path
/// for it. In the latter case the cursor is positioned at the first node that follows the given key in iteration
/// order.
default boolean descendAlong(ByteSource bytes)
{
int next = bytes.next();
int depth = depth();
while (next != ByteSource.END_OF_STREAM)
{
if (skipTo(++depth, next) != depth || incomingTransition() != next)
return false;
next = bytes.next();
}
return true;
}

/// Returns a tail trie, i.e. a trie whose root is the current position. Walking a tail trie will list all
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method now returns a cursor for the trie, not a trie itself. Please update the javadoc.

/// descendants of the current position with depth adjusted by the current depth.
///
/// It is an error to call `tailTrie` on an exhausted cursor.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tailCursor

///
/// Descendants that override this class should return their specific cursor type.
Cursor<T> tailCursor(Direction direction);

/// Used by [#advanceMultiple] to feed the transitions taken.
interface TransitionsReceiver
{
/// Add a single byte to the path.
void addPathByte(int nextByte);
/// Add the count bytes from position pos in the given buffer.
void addPathBytes(DirectBuffer buffer, int pos, int count);
}

/// Used by [#advanceToContent] to track the transitions and backtracking taken.
interface ResettingTransitionsReceiver extends TransitionsReceiver
{
/// Delete all bytes beyond the given length.
void resetPathLength(int newLength);
}

/// A push interface for walking over a trie. Builds upon [TransitionsReceiver] to be given the bytes of the
/// path, and adds methods called on encountering content and completion.
/// See [TrieDumper] for an example of how this can be used, and [TrieEntriesWalker] as a base class
/// for other common usages.
interface Walker<T, R> extends Cursor.ResettingTransitionsReceiver
{
/// Called when content is found.
void content(T content);

/// Called at the completion of the walk.
R complete();
}
}
27 changes: 27 additions & 0 deletions src/java/org/apache/cassandra/db/tries/CursorWalkable.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
Comment on lines +1 to +17
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use DataStax license for new files.

package org.apache.cassandra.db.tries;

/// Package-private interface for trie implementations, defining a method of extracting the internal cursor
/// representation of the trie.
///
/// @param <C> The specific type of cursor a descendant uses.
public interface CursorWalkable<C extends Cursor>
{
C cursor(Direction direction);
}
Loading