chore: support multiple source search in chat#7906
Conversation
Reviewer's GuideThis pull request implements support for keyword search from multiple document sources within the chat feature. It introduces a File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey @appflowy - I've reviewed your changes - here's some feedback:
- The
_full_searchparameter inMultipleSourceRetrieverStore::read_documentsis not utilized by its current implementations; ensure its intended effect is realized or remove it if redundant. - The modification to
QA_CONTEXT_TEMPLATEremoving the specific 'I don't know' instruction may alter the LLM's response behavior when context is insufficient. - Consider making the new search parameters (
limit,score_threshold) configurable rather than hardcoded in general search handlers and chat retriever initialization.
Here's what I looked at during the review
- 🟡 General issues: 3 issues found
- 🟢 Security: all looks good
- 🟡 Testing: 1 issue found
- 🟡 Complexity: 2 issues found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| debug_assert!(embedding.len() == 1); | ||
| let query_embedding = embedding.first().unwrap(); |
There was a problem hiding this comment.
suggestion (bug_risk): Reconsider relying solely on debug_assert for a critical unwrap.
debug_assert is omitted in release builds, so unwrap can still panic. Add a runtime check or use a safe pattern (e.g., if let) to guarantee query_embedding is always valid.
Suggested implementation:
// Safely extract the first embedding; if unavailable, return early.
let query_embedding = if let Some(query_embedding) = embedding.first() {
query_embedding
} else {
return Ok(Vec::new());
};Depending on your application's requirements, you might want to log or handle the case where more than one embedding is returned. If exactly one embedding is always expected, consider adding an explicit runtime check for that case.
| } | ||
|
|
||
| async fn create_chat_if_not_exist(&self, info: LLMChatInfo) -> FlowyResult<()> { | ||
| debug_assert!(!self.retriever_sources.read().await.is_empty()); |
There was a problem hiding this comment.
suggestion (bug_risk): Consider handling empty retriever_sources more gracefully.
Instead of relying on debug_assert, implement error handling or a fallback when retriever_sources is empty to avoid runtime issues in production.
| debug_assert!(!self.retriever_sources.read().await.is_empty()); | |
| { | |
| let retriever_sources_guard = self.retriever_sources.read().await; | |
| if retriever_sources_guard.is_empty() { | |
| return Err(FlowyError::local_ai().with_context("Retriever sources are empty")); | |
| } | |
| } |
| docs.len(), | ||
| ); | ||
| results.extend(docs); | ||
| } else { |
There was a problem hiding this comment.
suggestion: Improve error handling in document retrieval aggregation.
Use pattern matching on the Err variant instead of unwrap_err to make the error handling more explicit.
| } | ||
| } | ||
|
|
||
| #[async_trait] |
There was a problem hiding this comment.
issue (complexity): Consider extracting the embedding generation, similarity search, and result conversion logic into a shared helper function to reduce code duplication.
Consider extracting the code for embedding generation, similarity search, and converting results to documents into a shared helper function. This reduces duplicated logic between MultipleSourceRetrieverStore and the VectorStore implementation. For example, you could add a method to SqliteVectorStore like:
impl SqliteVectorStore {
async fn search_and_convert(
&self,
workspace_id: &str,
query: &str,
rag_ids: &[String],
limit: i32,
score_threshold: f32,
) -> FlowyResult<Vec<Document>> {
let vector_db = self.vector_db.upgrade()
.ok_or_else(|| FlowyError::internal().with_context("Vector database not initialized"))?;
let embedder = self.create_embedder()?;
let request = GenerateEmbeddingsRequest::new(
embedder.model().name().to_string(),
EmbeddingsInput::Single(query.to_string()),
);
let embedding = embedder.embed(request).await?.embeddings;
if embedding.is_empty() {
return Ok(Vec::new());
}
let query_embedding = embedding.first().unwrap();
let results = vector_db.search_with_score(
workspace_id,
rag_ids,
query_embedding,
limit,
score_threshold,
).await?;
let documents = results.into_iter().map(|result| {
let mut metadata = HashMap::new();
if let Some(map) = result.metadata.as_ref().and_then(|v| v.as_object()) {
for (key, value) in map {
metadata.insert(key.clone(), value.clone());
}
}
Document::new(result.content).with_metadata(metadata)
}).collect();
Ok(documents)
}
}Then update your trait implementations to call this helper. For example, in MultipleSourceRetrieverStore:
#[async_trait]
impl MultipleSourceRetrieverStore for SqliteVectorStore {
async fn read_documents(
&self,
workspace_id: &Uuid,
query: &str,
limit: usize,
rag_ids: &[String],
score_threshold: f32,
_full_search: bool,
) -> FlowyResult<Vec<Document>> {
self.search_and_convert(
&workspace_id.to_string(),
query,
rag_ids,
limit as i32,
score_threshold,
).await
}
fn retriever_name(&self) -> &'static str {
"Sqlite Multiple Source Retriever"
}
}And similarly in your VectorStore implementation, delegate to the helper function. This keeps functionality intact while reducing duplicate code and complexity.
| }) | ||
| .collect::<Vec<_>>(); | ||
|
|
||
| let search_results = join_all(search_futures).await; |
There was a problem hiding this comment.
issue (complexity): Consider using iterator combinators like flat_map to simplify the error handling and document merging after join_all.
Consider refactoring the manual loop with iterator combinators to reduce verbosity. For example, after using join_all you can chain an iterator to filter out errors and flatten documents. This change keeps the functionality intact while making the code more concise:
async fn retrieve_documents(&self, query: &str) -> Result<Vec<Document>, Box<dyn Error>> {
trace!(
"[VectorStore] filters: {:?}, retrieving documents for query: {}",
self.rag_ids,
query,
);
let search_futures = self.vector_stores.iter().map(|vector_store| {
let vector_store = vector_store.clone();
let query = query.to_string();
let num_docs = self.num_docs;
let full_search = self.full_search;
let rag_ids = self.rag_ids.clone();
let workspace_id = self.workspace_id;
let score_threshold = self.score_threshold;
async move {
vector_store
.read_documents(&workspace_id, &query, num_docs, &rag_ids, score_threshold, full_search)
.await
.map(|docs| (vector_store.retriever_name(), docs))
}
});
let results: Vec<Document> = join_all(search_futures)
.await
.into_iter()
.flat_map(|res| match res {
Ok((retriever_name, docs)) => {
trace!("[VectorStore] {} found {} results", retriever_name, docs.len());
docs
},
Err(e) => {
error!("[VectorStore] Failed to retrieve documents: {}", e);
vec![]
}
})
.collect();
Ok(results)
}This approach leverages flat_map to filter out errors and merge the documents into one vector, reducing the explicit loop and improving readability.
Support keyword search when chat with page
Summary by Sourcery
Implement multiple source search support for chat functionality, allowing retrieval of documents from different sources with configurable search parameters
New Features:
Enhancements:
Chores: