-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-13483: Fix loading PQ file when disabled_reads is true #1713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This fixes the method so that it actually reads the postings list structure. However, there is a larger question of whether the implementation was actually correct. It all comes down to what is in the indexContext's view on a compactor. We really only want to get the info from the source components that are being compacted. Further investigation needed to determine if this is the right behavior.
Checklist before you submit for review
|
@jkni @eolivelli - I will add tests for this PR on Monday. The initial implementation is ready for review. Manual testing suggests that it works. I wanted to push it up to run the rest of CI to get that feedback by Monday |
* cover building the vector index after creation * cover different methods within the V1MetadataOnlySearchableIndex
// TODO should we load all of these for this op? It seems very unlikely that we want to consider the whole | ||
// table when checking if all rows have vectors. A single empty vector will be enough to make this false, | ||
// but that should really only impact that table. | ||
var view = indexContext.getReferencedView(TimeUnit.SECONDS.toNanos(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am moving this TODO out of scope for this PR. Here is a follow up ticket: https://github.com/riptano/cndb/issues/14028
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
please put your tick on each line of the checklist: #1713 (comment)
{ | ||
// May result in downloading file, but this metadata is valuable. We use a stream to avoid loading all the | ||
// structures at once. | ||
return metadatas.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the idea of using a Stream in order to not eagerly load stuff.
I think that Streams internally do some batching, if you want to not fall into that hidden behavior you could use a simple Iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reference where I can learn more? The javadoc for the Stream
interface indicates:
* Streams are lazy; computation on the source data is only performed when the
* terminal operation is initiated, and source elements are consumed only
* as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not (I have only spent time in debugging problems in the past, but I don't have references)
we can keep this in the current form
src/java/org/apache/cassandra/index/sai/disk/V1MetadataOnlySearchableIndex.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/vector/ProductQuantizationFetcher.java
Outdated
Show resolved
Hide resolved
|
||
// first sstable has one-to-one | ||
for (int i = 0; i < MIN_PQ_ROWS; i++) | ||
execute("INSERT INTO %s (pk, v) VALUES (?, ?)", i, randomVectorBoxed(2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side comment (not asking for a change):
make randomVectorBoxed with only 2 dimensions could return duplicate vectors and make the test flaky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach looks good to me in general. I left some nits/questions inline. Is there a CNDB PR/have we exercised this code in CNDB yet?
src/java/org/apache/cassandra/config/CassandraRelevantProperties.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/v1/SSTableIndexWriter.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/v1/SSTableIndexWriter.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/v1/SSTableIndexWriter.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/vector/ProductQuantizationFetcher.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/V1MetadataOnlySearchableIndex.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/V1MetadataOnlySearchableIndex.java
Show resolved
Hide resolved
|
❌ Build ds-cassandra-pr-gate/PR-1713 rejected by Butler2 new test failure(s) in 7 builds Found 2 new test failures
Found 7 known test failures |
❌ Build ds-cassandra-pr-gate/PR-1713 rejected by Butler1 new test failure(s) in 6 builds Found 1 new test failures
Found 7 known test failures |
@VisibleForTesting | ||
public boolean areSegmentsLoaded() | ||
{ | ||
return searchableIndex instanceof V1SearchableIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instanceof
smells here, cannot we push this method into SearchableIndex ?
{ | ||
// May result in downloading file, but this metadata is valuable. We use a stream to avoid loading all the | ||
// structures at once. | ||
return metadatas.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not (I have only spent time in debugging problems in the past, but I don't have references)
we can keep this in the current form
What is the issue
Fixes: https://github.com/riptano/cndb/issues/13483
Initial Commits
What does this PR fix and why was it fixed
The
cassandra.sai.disabled_reads
prevents us from reading previous PQ objects during compaction, which leads to less efficient graph construction. This fixes that by adding a new searchable index that is capable of discovering/downloading/cleaning up the right index files to get that information.I also have a tangential bug fix for code that was also impacted by the
disabled_reads
setting. However, there is a follow up question to ask if it is a valid implementation.