Skip to content

Conversation

@PaperBoardOfficial
Copy link
Contributor

@PaperBoardOfficial PaperBoardOfficial commented Jul 21, 2025

Related GitHub Issue

Closes: #5682

Roo Code Task Context (Optional)

Description

The aim of this PR is to add a local file based vector store so that the user doesn't have to set up docker locally.
I have added LibSQL It is a fork of sqlite which has built in vector indexing. It uses DiskANN algorithm for searching vectors. Reference.

Why I chose LibSQL?
I wanted to use file based db like SQLite but it doesn't have built in vector indexing, so if we search for a vector, it will scan all the rows, compute the vector cosine and provide the top k results. This is time consuming.
I was considering of using hnsw or faiss algorithm with sqlite (sqlite for metadata and hnsw or faiss for vectors). But this would require too much effort and complexity.
So I found out that mastra also uses libsql and after reading the documentation I found out that LibSQL already supports a vecotr similarity search alogrithm called DiskANN. Also this blog motivated me: Using LibSQL on mobile devices

Some points:

  • I indexed the Roo-Code codebase and it generated data worth 11 gb size.
  • It took more time than qdrant to index the whole codebase.
  • But searching was fast. It searched using codebase_search tool as fast as qdrant.
  • It required libsql/linux-x64-gnu .node native module (because I use ubuntu). This is a 9 mb file.
  • I tried using the wasm version of libsql and spent a lot of time on it and learnt that it is just experimental and doesn't even have vector indexing which is a must-have. Also, the wasm package is esm only so difficult to bundle as cjs.

Things to work on:

  • Reduce the space taken.
  • Reduce the time taken to index.
  • Find a way to download the .node file specific to os after the vsix file in installed. By doing so, we dont have to ship all the .node files for all the os and architecture in our vsix file.

This PR is too long so main things to look at:

  • src/services/code-index/vector-store/libsql-client.ts: This contains the main logic of libsql.
  • src/services/code-index/vector-store/tests/libsql-client.spec.ts: This contains 25 tests for libsql.
  • webview-ui/src/components/chat/CodeIndexPopover.tsx: This contains the UI for selecting the LibSQL provider and also providing an optional directory where one wants to store the db files.

Test Procedure

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Testing: New and/or updated tests have been added to cover my changes (if applicable).
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

Documentation Updates

Additional Notes

Get in Touch

discord username: paperboard_52655


Important

Integrates LibSQL as a vector store provider, adding support for configuration, UI updates, and tests.

  • Behavior:
    • Integrates LibSQL as a vector store provider, offering an alternative to Qdrant.
    • Adds support for LibSQL in libsql-client.ts and updates CodeIndexPopover.tsx for UI changes.
    • Supports configuration of LibSQL directory and provider selection.
  • Tests:
    • Adds 25 tests for LibSQL functionality in libsql-client.spec.ts.
  • Configuration:
    • Updates codebase-index.ts and config-manager.ts to include LibSQL configuration options.
    • Adds copyLibSQLVersion function in esbuild.ts for version management.
  • Misc:
    • Adds downloadLibsqlNative function in libsql-download.ts for downloading native modules.
    • Updates package.json to include @libsql/client dependency.

This description was created by Ellipsis for 6296649. You can customize this summary. It will automatically update as commits are pushed.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 21, 2025
@PaperBoardOfficial
Copy link
Contributor Author

PaperBoardOfficial commented Jul 21, 2025

@daniel-lxs
I have a query on the .node files.
Currently, LibSQL has 9 .node file for different OS.
This would be 81 mb in total. Are you okay with packaging all of them in the vsix file or should I look into a way which I mentioned in the PR description.
For space consumption, I am considering compressing the data.
For reducing time, I am considering using batch insertion (if that helps). I will come back with the testing results for both.
The only blocker for me is the .node native modules. Please help me with the query or we can connect over discord (username mentioned in PR) if its okay. I would appreciate your help and guidance.

@NaccOll
Copy link
Contributor

NaccOll commented Jul 22, 2025

The index file of Roo-Code codebase is 11GB! This is too big. I tried to use the sqlite solution, and the generated index file is only 1.1gb. I uses batch processing and will not load all into memory at once. This is why I prefer the db solution. qdrant will add all codebases to memory (including those that have been indexed but not active).

Of course, the shortcomings of the Sqlite solution are as you said. For whole codebase retrieval, all rows need to be queried and calculated. Although some results can be cached, the performance is still poor in large codebases. In a project with 170,000 blocks, a search takes about 5-7s. But I think this is still acceptable, and the performance is better in small codebases. In the RooCode codebase, a search takes about 3s.

The advantage of the sqlite solution is that it is smaller in size

  1. Relying on the node:sqlite module built into node.js 22, the vsix file is still 16-17MB
  2. The generated database file is smaller, much smaller than the qdrant solution in small and medium-sized codebase

@PaperBoardOfficial
Copy link
Contributor Author

PaperBoardOfficial commented Jul 22, 2025

@NaccOll Thanks a lot for sharing this information.
I will ask the LibSQL team what can be done. Maybe we can get a better solution.

@penberg Could you please help me with the issues highlighted in the PR description. The main file is src/services/code-index/vector-store/libsql-client.ts in this PR. I will really appreciate your help here.
I have spent ample time on researching about LibSQL and want to get it merged. Maybe I am not using it correctly.

@daniel-lxs daniel-lxs moved this from Triage to PR [Draft / In Progress] in Roo Code Roadmap Jul 22, 2025
@hannesrudolph hannesrudolph added PR - Draft / In Progress and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 22, 2025
@PaperBoardOfficial
Copy link
Contributor Author

@daniel-lxs
UPDATE:
I tried several techniques and finally following this blog: https://turso.tech/blog/the-space-complexity-of-vector-indexes-in-libsql, I was able to highly optimise the code:

For Roo-Code codebase of 65840 code blocks, I was able to index it in 16 minutes with final db file size of 1.8 gb. And the search function operated in milliseconds.

Now, the only thing left is to download the .node native module (9 mb) when the extension starts up (after installation).
I will find a way for it too (hopefully).

@PaperBoardOfficial
Copy link
Contributor Author

@daniel-lxs @mrubens
UPDATE 2:
This PR is ready for review. I'll just fix some unit test cases for libsql, meanwhile you can review the PR and ask any questions.
I have added code which will do the following:

  1. detect the os and architecture.
  2. download libsql native module from https://github.com/tursodatabase/libsql-js/releases.
  3. it will extract the tar and place it in right location.

after this the libsql can use the native module.
So, now we don't have to increase the vsix (it remains 17mb).
After extension installation, the native module will automatically get downloaded.
I have also added download retry (max 3 times) with intial backoff of 2 secs. Also, the native module will get downloaded once. If the user restarts the extension, it won't get downloaded again.

I have verified this feature on ubuntu 24.04 and windows 11. here are the screenshots of output panel logs.

Screenshot from 2025-07-26 14-34-42 Screenshot 2025-07-26 150339

@PaperBoardOfficial
Copy link
Contributor Author

UPDATE 3:

For Roo-Code codebase of 65840 code blocks, I was able to index it in 16 minutes with final db file size of 1.8 gb. And the search function operated in milliseconds.

Last time I said it took me 16 minutes to index Roo-Code. I had created vector index in the start which led to frequent vector index update on insertions. Now I have implemented a new strategy: create index at last when all the code blocks are ingested in db.
This approach saved 4 minutes (now I was able to index Roo-Code codebase in 12 minutes).

@PaperBoardOfficial PaperBoardOfficial marked this pull request as ready for review July 26, 2025 17:43
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Jul 26, 2025
@PaperBoardOfficial
Copy link
Contributor Author

UPDATE 4:
I have fixed the test cases and now the PR is open for review.

return false
} catch (error) {
console.error(`Failed to initialize vector store table ${this.tableName}:`, error)
throw new Error(`LibsqlConnectionFailed ${error instanceof Error ? error.message : String(error)}`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo/consistency: The error message throws LibsqlConnectionFailed but elsewhere the naming is LibSQL.... Consider changing it to LibSQLConnectionFailed for consistency.

Suggested change
throw new Error(`LibsqlConnectionFailed ${error instanceof Error ? error.message : String(error)}`)
throw new Error(`LibSQLConnectionFailed ${error instanceof Error ? error.message : String(error)}`)

@daniel-lxs daniel-lxs moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 26, 2025
@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Jul 26, 2025
@daniel-lxs daniel-lxs moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jul 27, 2025
@NaccOll
Copy link
Contributor

NaccOll commented Jul 28, 2025

Can you assemble the project name as the prefix of dbName? This will help identify the unnecessary codebase index instead of deleting it through the roocode, which I will need later when I find that I don't have enough disk space.

Another suggestion is to move the download of libsql to libsql-client.ts. I found that you download libsql when the extension starts, but it is very likely that users do not even need the code index function.

@PaperBoardOfficial
Copy link
Contributor Author

@NaccOll fr, your suggestions go hard. i’m lowkey on the same page. just waitin to see if @mrubens and the Roo-Code crew are vibin with the LibSQL PR and down to merge it. if they fw it, i’ll slide in and switch things up no cap.

@daniel-lxs
Copy link
Member

Hey @PaperBoardOfficial, thanks for putting this together. After reviewing the changes, we believe the implementation adds a fair amount of complexity and could impact resource usage.

I'm open to further discussion, but for now I'm going to close this PR. That doesn't mean we don't appreciate the time and effort you put into it, it's just that this particular approach doesn't align with our current roadmap for the project.

Thank you again.

@daniel-lxs daniel-lxs closed this Jul 28, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Jul 28, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Jul 28, 2025
@PaperBoardOfficial
Copy link
Contributor Author

@daniel-lxs
It just taps a native lib with direct file access. Would it be wild to actually test how much real-world impact it really has?
Not saying it’s the endgame, but lowkey might be lighter than expected. If making it optional (like an experimental flag) or sandboxed helps, I’m chill exploring that route.
Appreciate you giving it a look. You’ve got rep here, so your POV hits.

Since you said you're open to chatting more, maybe we vibe-check this with tight benchmarks or a scoped-down test? Just to see if it really needs to be put in the bin.
Also, would love to know what does line up with the roadmap right now. If there’s any way I can plug in and be useful there, I’m totally down.

@daniel-lxs
Copy link
Member

@PaperBoardOfficial I'm all for it, if we can come up with solid data that this doesn't impact performance it will be way easier to sell to the dev team

@PaperBoardOfficial
Copy link
Contributor Author

Thanks @daniel-lxs . I’m thinking of running some benchmarks, looking at memory, CPU, and maybe comparing it with local Qdrant to get a clear picture.

That said, if resource usage turns out to be minimal, would complexity or roadmap alignment still be a blocker?

If the answer is a "maybe", that’s totally fair. I’d rather not sink more time into it if it's unlikely to move forward. In that case, I’ll keep the LibSQL integration in my fork and use it for personal builds or maybe share it with other downstreams like Kilo.

Appreciate you being upfront either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PR - Needs Preliminary Review size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Local Embedding and Local Vector Store for Indexing

4 participants