feat(search): support code search by zoekt#33850
feat(search): support code search by zoekt#33850adlternative wants to merge 1 commit intogo-gitea:mainfrom
Conversation
|
There are already so many search engines builtin into Gitea. Many of them have various bugs. So the questions are:
|
|
To be honest I prefer this zoekt search engine compared to the existing search engine |
|
maybe this can replace bleve but we need some comparsion tests. |
That's understandable. So a few months later, another one feels "yoekt" is better, then introduce "yoekt", then a few months later, someone feels "xoekt" is better, then introduce "xoekt", and then "woekt", "voekt", "uoekt" ... "coekt", "boekt", "aoekt". Then Gitea contains all search engines on the internet. I do not mean objection to introduce improvements. But actually it needs to:
So a clear roadmap about the "search engine plan" is necessary. |
In my opinion, supporting multiple search engines is a good thing, as users may have different needs. Even GitLab now supports both ES and Zoekt search engines. see https://docs.gitlab.com/user/search
I'm not too worried about this; Gitea should have good community maintenance. It might be because the code search functionality is not exposed by default, so many bugs haven't been discovered. |
Well, do you know how many search engines are in Gitea now? And what longstanding bugs do they have? https://github.com/go-gitea/gitea/issues?q=is%3Aissue%20state%3Aopen%20code%20search And some bugs didn't get fixed in months, for example: "Search Functionality Issues with Bleve Engine #31565", I don't see "good community maintenance" |
you don't need to worry about this: zoekt is a popular code search engine, currently used by code platforms like Gerrit, Sourcegraph, and GitLab, wrote by Gerrit author, and maintained by Sourcegraph. Zoekt has advantages that traditional search engines (like ES) do not possess: support for regex matching, substring search, etc. I don't think any new open-source code search engines will be able to replace it in the short term.
You are right, where should the roadmap be written? I don't have experience with this. I will supplement its documentation when the zoekt functionality is more complete |
Yep, if zoekt wins, we need to drop some others. |
Sure, it's regrettable that this part of the content is unmaintained. However, for the zoekt code search, I can commit to maintaining it thoroughly. |
Yeah, I hope this can be divided into at least two steps:
Zoekt may also have some issues, as GitLab has not completely deprecated ES and fully switched to Zoekt... |
17d7c30 to
212fc79
Compare
|
To make the code clear, we need to refactor the related code first: Refactor issue & code search #33860 Each "indexer" should provide the "search modes" they support by themselves. And we need to remove the "fuzzy" search for code. |
|
Please note that I have many other commitments over the next two weeks and may only be able to dedicate time to this MR in a couple of weeks |
783ee0e to
374ce10
Compare
850a16a to
86ef977
Compare
86ef977 to
9906c5f
Compare
77c7d42 to
0b3bda1
Compare
|
@lunny I've pushed a new version, are there any other issues? |
0b3bda1 to
64e7704
Compare
64e7704 to
2619202
Compare
|
The comment at https://github.com/go-gitea/gitea/pull/33850/changes#r2638055239 still need to be resolved. |
2619202 to
e593603
Compare
Alright, it looks like this part shouldn't be very important, I've removed this section of comment. |
e593603 to
ccc425b
Compare
|
It seems there is a conflict which Github didn't catch. |
I don't have any leads right now, at what stage did this occur? During window compile? Or during linting? |
Please take a look at the CI error. |
8a35e6c to
0e4028f
Compare
|
The languages stats now is order randomly. It should sort by searched files number. |
0e4028f to
8479a3c
Compare
Good suggestion, it has been corrected to sort from most to least. |
|
Just pinging to check the status of this PR. Any changes needed? |
|
Please update the branch the catfile needs some updates. |
Signed-off-by: ZheNing Hu <adlternative@gmail.com>
8479a3c to
456648f
Compare
Add Zoekt indexer support for code search with the following features: - ZoektIndexer implementation for Unix systems - Noop implementation for non-Unix systems (zoekt only supports Unix) - Dynamic indexer path based on indexer type - Updated configuration documentation This implements PR go-gitea#33850 from go-gitea/gitea. Co-authored-by: trucpd <92442070+trucpd@users.noreply.github.com>
Abstract
Zoekt is an open-source search engine specifically designed for code search, utilizing 3-gram indexing for efficient segmentation. By replacing Elasticsearch/Bleve with Zoekt, it provides Gitea with precise code search capabilities and support for regular expression searches.
Motivation
The existing code search functionality is implemented using Elasticsearch/bleve. Although Elasticsearch/bleve excels in general search domains, its disadvantages in code search are obvious:
Proposal
Goals
Support precise substring searches
Support regex searches
Non-Goals
Support multi-branch searches
Support code symbol syntax searches
Competitive Product Analysis
Design
Index
Since Zoekt is written in Golang, its API can be directly integrated through its Go package using indexBuilder.Add() and indexBuilder.MarkFileAsChangedOrRemoved() to add or remove indexed files. The fundamental processes for implementing full and incremental repository indexing in Zoekt do not differ significantly from those in Elasticsearch (ES) or Bleve.
Search
We can use shards.NewDirectorySearcher() or shards.NewDirectorySearcherFast() to build a searcher for searching. The search modes will support:
Since the search is currently limited to a single repository, we will retrieve all the content first and then handle pagination.
Use Method
enable this in app.ini
Resource Usage
Building the index in Zoekt requires 1.2 times the corpus size in RAM, and the index storage size is about three times the corpus size. Maybe we should expose some of Zoekt's internal Prometheus metrics in the future?
Exists Issues
Try to support #33702