Skip to content

Conversation

@Hweinstock
Copy link
Contributor

@Hweinstock Hweinstock commented Oct 22, 2024

Problem

New contributors often copy-paste logic and test cases, which creates a maintenance cost for little gain.

Examples

We call these repeated segments of code "clones".

Solution

implement a static threshold.
Run jscpd in CI and compare the % duplicated to a fixed threshold of 1.34%, which is the current duplication rate in the repo.
The main issue with this solution is that it fails to provide easy debugging options when new duplicates are found. Currently exploring adding this functionality in a follow-up.

Notes

  • this is a summary of the problem/solution. For full information, see full doc (VSCode: Detecting and Reducing Duplicate Code) describing changes.

License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Hweinstock Hweinstock changed the title ci(lint): add script to reject PRs that increase # of "clones". [WIP] ci(lint): run jscpd in CI, compare to static threshold. Oct 24, 2024
@Hweinstock Hweinstock marked this pull request as ready for review October 24, 2024 21:18
@Hweinstock Hweinstock requested a review from a team as a code owner October 24, 2024 21:18
- run: npm run lint

jscpd:
# needs: lint-commits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we just need to remove this an its probably good to merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, still exploring adding some more functionality, but that can be follow up. Seems pretty low-risk to test this simple version in the meantime.

Copy link
Contributor

@justinmk3 justinmk3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@Hweinstock Hweinstock merged commit 7dabb7c into aws:master Oct 25, 2024
24 of 25 checks passed
@Hweinstock Hweinstock deleted the lint/noCopyPaste branch October 25, 2024 20:34
justinmk3 pushed a commit that referenced this pull request Nov 5, 2024
## Problem
Follow up to: #5833

A static threshold is useful, but if the threshold fails, we want to be
able to see what the new duplicated code is.

## Solution
Run `jscpd` on the current branch, and if it is above the set threshold,
we fail (same as before).

Now, we also filter the resulting report to contain only clones from the
files changed with respect to the base branch. Then, if the filtered
report is non-empty, i.e. one of the changed files has a clone, fail and
display the clones in CI.

Adding this extra step allows us to know when we are adding a clone that
was not otherwise in the codebase. It also allows us to see this clone
in CI, before the PR is merged, giving the developer a chance to remove
it.

As an example, see commit
(a8e4860)
where a copy-pasted function is added, and notice that CI fails and
displays only the section of the report relevant to this clone. Then,
once we remove this work in the following commit
(53d3973),
the check passes.

Included in this PR, is that it runs only on `pull_request` and no
longer on `push`.
## Notes
- In developing this, it became apparent that `jscpd` is not perfect and
was not able to detect the clone added in
(e7357c0).
Will need to dig deeper into their algorithm (based on
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm) and
implementation (https://github.com/kucherenko/jscpdto) to understand why
this is missed, and if it presents a larger problem.
- Also, if a clone already exists in a file that is changed in the PR,
this check will fail. In this case, the options are to ignore the check,
mark the case as an exception, or fix the clone. To encourage developers
to not repeatedly ignore the check, we still employ a static threshold
via #5833.
- If the branch hasn't been rebased up to date, it can pick up clones
from unrelated changes.

## Future Work
- We want to lower the threshold as much as possible by removing
duplicates and marking exceptions when necessary.
- More experience using this tool will help us find a `minLines`
argument that provides the desired sensitivity. This argument allows us
to ignore clones under `minLines` in length. Right now it is 3, which is
likely too granular, but better to start there and loosen as needed.
- Right now the CI output of the clones is very verbose. Digging through
this could be time consuming, especially when many clones are added.
Could be worth filtering it to provide a more ergonomic experience.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants