Duplicate Code Detection Feature Proposal

Hi MinishLab team! I'm opening an issue to propose adding a duplicate code detection functionality to `semble`.

I have been developing and testing this feature in my own fork - [draft PR here](https://github.com/pmbaumgartner/semble/pull/1). 

Full disclosure: all the code has been written by agents and reviewed by me. This issue text and any communication I have with you will be human-written.

**What problem you're solving**
In my experience, agents end up writing a lot of duplicate code even when reading in full files and having context for prior work. They also struggle to come up with reasonable abstractions when necessary. Having a repeatable way for agents to review and understand duplicate code and understand the full surface area available for an abstraction would help improve code quality and maintainability for projects using agentic coding tools.

**Why it belongs in semble (as opposed to a wrapper or separate tool)**

In its current form, `semble` has the foundation to help with this problem because it already has the chunked source code and semantic embeddings. 

A wrapper or separate tool could run repeated search / find_related calls to get at the same thing, but would be inefficient as it would basically have to re-implement a lot of the foundational work already in semble. 


**What API or behaviour change it would involve, if any**

In python:

```python
duplicates = index.find_duplicates(top_k=5)
```

And CLI:

```
semble find-duplicates ./my-project
```

As well as a find_duplicates MCP tool.

Through a few soft experiments, I suggest the following default behavior:

- compare chunks within the same language
- skip test-looking paths
- skip low-signal static data/config chunks
- skip import/header/attribute scaffolding 

All of the above have opt-in flags to customize behavior. There are also arguments for  language, include/exclude paths, minimum lines, minimum score, structural score floor, candidate breadth, and minimum cluster size.

**A minimal (code) example of how it would work**

Python:
```python
from semble import SembleIndex

index = SembleIndex.from_path("./my-project")

clusters = index.find_duplicates(
    top_k=3,
    min_lines=8,
    min_structural_score=0.40,
)

for cluster in clusters:
    print(cluster.score)
    for member in cluster.members:
        print(member.file_path, member.start_line, member.end_line)
```

I have mostly been using my fork with this feature through the CLI, so you can try out these examples yourselves as well since it's installable as a uv tool. 

For example, you can clone a repo like [django](https://github.com/django/django) and try this out yourself with this (also piping into `jq`):

```
uvx --from "semble @ git+https://github.com/pmbaumgartner/semble.git@duplicate-discovery-surface" semble find-duplicates ./django --language python | jq
```

That will be a bit of output, so I've added a `--detail compact` arg that shows only the top matching pairs content and limits to 5 pairs per cluster. 

---

I'm happy to get into more details and discuss below.  Thanks for your work on `semble`! 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Code Detection Feature Proposal #184

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Duplicate Code Detection Feature Proposal #184

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions