Hi MinishLab team! I'm opening an issue to propose adding a duplicate code detection functionality to semble.
I have been developing and testing this feature in my own fork - draft PR here.
Full disclosure: all the code has been written by agents and reviewed by me. This issue text and any communication I have with you will be human-written.
What problem you're solving
In my experience, agents end up writing a lot of duplicate code even when reading in full files and having context for prior work. They also struggle to come up with reasonable abstractions when necessary. Having a repeatable way for agents to review and understand duplicate code and understand the full surface area available for an abstraction would help improve code quality and maintainability for projects using agentic coding tools.
Why it belongs in semble (as opposed to a wrapper or separate tool)
In its current form, semble has the foundation to help with this problem because it already has the chunked source code and semantic embeddings.
A wrapper or separate tool could run repeated search / find_related calls to get at the same thing, but would be inefficient as it would basically have to re-implement a lot of the foundational work already in semble.
What API or behaviour change it would involve, if any
In python:
duplicates = index.find_duplicates(top_k=5)
And CLI:
semble find-duplicates ./my-project
As well as a find_duplicates MCP tool.
Through a few soft experiments, I suggest the following default behavior:
- compare chunks within the same language
- skip test-looking paths
- skip low-signal static data/config chunks
- skip import/header/attribute scaffolding
All of the above have opt-in flags to customize behavior. There are also arguments for language, include/exclude paths, minimum lines, minimum score, structural score floor, candidate breadth, and minimum cluster size.
A minimal (code) example of how it would work
Python:
from semble import SembleIndex
index = SembleIndex.from_path("./my-project")
clusters = index.find_duplicates(
top_k=3,
min_lines=8,
min_structural_score=0.40,
)
for cluster in clusters:
print(cluster.score)
for member in cluster.members:
print(member.file_path, member.start_line, member.end_line)
I have mostly been using my fork with this feature through the CLI, so you can try out these examples yourselves as well since it's installable as a uv tool.
For example, you can clone a repo like django and try this out yourself with this (also piping into jq):
uvx --from "semble @ git+https://github.com/pmbaumgartner/semble.git@duplicate-discovery-surface" semble find-duplicates ./django --language python | jq
That will be a bit of output, so I've added a --detail compact arg that shows only the top matching pairs content and limits to 5 pairs per cluster.
I'm happy to get into more details and discuss below. Thanks for your work on semble!
Hi MinishLab team! I'm opening an issue to propose adding a duplicate code detection functionality to
semble.I have been developing and testing this feature in my own fork - draft PR here.
Full disclosure: all the code has been written by agents and reviewed by me. This issue text and any communication I have with you will be human-written.
What problem you're solving
In my experience, agents end up writing a lot of duplicate code even when reading in full files and having context for prior work. They also struggle to come up with reasonable abstractions when necessary. Having a repeatable way for agents to review and understand duplicate code and understand the full surface area available for an abstraction would help improve code quality and maintainability for projects using agentic coding tools.
Why it belongs in semble (as opposed to a wrapper or separate tool)
In its current form,
semblehas the foundation to help with this problem because it already has the chunked source code and semantic embeddings.A wrapper or separate tool could run repeated search / find_related calls to get at the same thing, but would be inefficient as it would basically have to re-implement a lot of the foundational work already in semble.
What API or behaviour change it would involve, if any
In python:
And CLI:
As well as a find_duplicates MCP tool.
Through a few soft experiments, I suggest the following default behavior:
All of the above have opt-in flags to customize behavior. There are also arguments for language, include/exclude paths, minimum lines, minimum score, structural score floor, candidate breadth, and minimum cluster size.
A minimal (code) example of how it would work
Python:
I have mostly been using my fork with this feature through the CLI, so you can try out these examples yourselves as well since it's installable as a uv tool.
For example, you can clone a repo like django and try this out yourself with this (also piping into
jq):That will be a bit of output, so I've added a
--detail compactarg that shows only the top matching pairs content and limits to 5 pairs per cluster.I'm happy to get into more details and discuss below. Thanks for your work on
semble!