Skip to content

Commit 7f6592c

Browse files
committed
Update 2025-11-18-generative-ai-peer-review.md
1 parent 06881df commit 7f6592c

File tree

1 file changed

+36
-36
lines changed

1 file changed

+36
-36
lines changed

_posts/2025-11-18-generative-ai-peer-review.md

Lines changed: 36 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -31,12 +31,12 @@ Our goal is transparency and fostering reproducible research. For scientific rig
3131

3232
LLMs are perceived as helping developers:
3333

34-
* Explain complex codebases
35-
* Generate unit tests and docstrings
36-
* In some cases, simplifying language barriers for participants in open source around the world
37-
* Speeding up everyday workflows
34+
- Explain complex codebases
35+
- Generate unit tests and docstrings
36+
- In some cases, simplifying language barriers for participants in open source around the world
37+
- Speeding up everyday workflows
3838

39-
Some contributors also believe these products open source more accessible. And for some, maybe they do. However, LLMs also present
39+
Some contributors perceive these products as making open source more accessible. And for some, maybe they do. However, LLMs also present
4040
unprecedented social and environmental challenges.
4141

4242
### Incorrectness of LLMs and misleading time benefits
@@ -51,7 +51,7 @@ Training and running LLMs [requires massive energy consumption](https://www.unep
5151

5252
Heavy reliance on LLMs risks producing developers who can prompt, but not debug, maintain, or secure production code. This risk undermines long-term project sustainability and growth. In the long run, it will make it [harder for young developers to learn how to code and troubleshoot independently](https://knowledge.wharton.upenn.edu/article/without-guardrails-generative-ai-can-harm-education/).
5353

54-
> We’re really worried that if humans don’t learn, if they start using these tools as a crutch and rely on it, then they won’t actually build those fundamental skills to be able to use these tools effectively in the future. *Hamsa Bastani*
54+
> We’re really worried that if humans don’t learn, if they start using these tools as a crutch and rely on it, then they won’t actually build those fundamental skills to be able to use these tools effectively in the future. _Hamsa Bastani_
5555
5656
### Ethics and inclusion
5757

@@ -63,75 +63,75 @@ We acknowledge that social and ethical norms, as well as concerns about environm
6363

6464
Our community’s expectation is simple: **be open and disclose any Generative AI use in your package** when you submit it to our open software review process.
6565

66-
* Disclose LLM use in your README and at the top of relevant modules.
67-
* Describe how the Generative AI tools were used in your package's development.
68-
* Be clear about what human review you performed on Generative AI outputs before submitting the package to our open peer review process.
66+
- Disclose LLM use in your README and at the top of relevant modules.
67+
- Describe how the Generative AI tools were used in your package's development.
68+
- Be clear about what human review you performed on Generative AI outputs before submitting the package to our open peer review process.
6969

7070
Transparency helps reviewers understand context, trace decisions, and focus their time where it matters most.
7171

7272
### Human oversight
7373

7474
LLM-assisted code must be **reviewed, edited, and tested by humans** before submission.
7575

76-
* Run your tests and confirm the correctness of the code that you submitted.
77-
* Check for security and quality issues.
78-
* Ensure style, readability, and concise docstrings.
79-
* Explain your review process in your software submission to pyOpenSci.
76+
- Run your tests and confirm the correctness of the code that you submitted.
77+
- Check for security and quality issues.
78+
- Ensure style, readability, and concise docstrings. Depending on the AI tool, generated docstrings can sometimes be overly verbose without adding meaningful understanding.
79+
- Explain your review process in your software submission to pyOpenSci.
8080

81-
Please **dont offload vetting of generative AI content to volunteer reviewers**. Arrive with human-reviewed code that you understand, have tested, and can maintain.
81+
Please **don't offload vetting of generative AI content to volunteer reviewers**. Arrive with human-reviewed code that you understand, have tested, and can maintain. As the submitter, you are accountable for your submission: you take responsibility for the quality, correctness, and provenance of all code in your package, regardless of how it was generated.
8282

8383
### Watch out for licensing issues.
8484

85-
LLMs are trained on large amounts of open source code; most of that code has licenses that require attribution.
86-
The problem? LLMs sometimes spit out near-exact copies of that training data, but without any attribution or copyright notices.
85+
LLMs are trained on large amounts of open source code, and most of that code has licenses that require attribution (including permissive licenses like MIT and BSD-3).
86+
The problem? LLMs sometimes produce near-exact copies of that training data, but without any attribution or copyright notices. **LLM output does not comply with the license requirements of the input code, even when the input is permissively licensed**, because it fails to provide the required attribution.
8787

8888
Why this matters:
8989

90-
* LLM-generated code may be *substantially similar* to copyrighted training data; sometimes it is identical. Copyright law focuses on how similar your content is compared to the original.
91-
* You can't trace what content the LLM learned from (the black box problem); this makes due diligence impossible on your part. You might accidentally commit plagiarism or copyright infringement by using LLM output in your code even if you modify it.
92-
* License conflicts can occur because of both items above. Read on...
90+
- LLM-generated code may be _substantially similar_ to copyrighted training data; sometimes it is identical. Copyright law focuses on how similar your content is compared to the original.
91+
- You can't trace what content the LLM learned from (the black box problem); this makes due diligence impossible on your part. You might accidentally commit plagiarism or copyright infringement by using LLM output in your code even if you modify it.
92+
- License conflicts occur because of both items above. Read on...
9393

94-
When licenses clash, it gets messy. Say your package uses an MIT license (common in scientific Python), but an LLM outputs Apache-2.0 or GPL code—those licenses aren't compatible. You can't just add attribution to fix it. Technically, you'd have to delete everything and rewrite it from scratch to comply with the licensing requirements.
94+
When licenses clash, it gets particularly messy. Even when licenses are compatible (e.g., MIT-licensed training data and MIT-licensed output), you still have a violation because attribution is missing. With incompatible licenses (say, an LLM outputs GPL code and your package uses MIT), you can't just add attribution to fix ityou'd technically have to delete everything and rewrite it from scratch using clean-room methods to comply with licensing requirements.
9595

9696
The reality of all of this is that you can't eliminate this risk of license infringement or plagiarism with current LLM technology. But you can be more thoughtful about how you use the technology.
9797

9898
**What you can do now:**
9999

100-
* Be aware that when you directly use content developed by an LLM, there will be inherent license conflicts.
101-
* Understand and transform code that is returned from a LLM: Don't paste LLM outputs directly. Review, edit, and ensure you fully understand what you're using. You can ask the LLM questions to better understand it's outputs. This approach also helps you learn which addresses the education concerns that we raised earlier.
102-
* **Use LLMs as learning tools**: Ask questions, review outputs critically, then write your own implementation based on understanding. Often the outputs of LLMs are messy or inefficient. Use them to learn, not to copy.
103-
* Consider [clean-room techniques](https://en.wikipedia.org/wiki/Clean-room_design): Have one person review LLM suggestions for approach; have another person implement from that high-level description
104-
* **Document your process**: If you plan to submit a Python package for pyOpenSci review, we will ask you about your use of LLM's in your work. Document the use of LLMs in your project's README file and in any modules with LLM outputs have been applied. Confirm that it has been reviewed by a human prior to submitting it to us, or any other volunteer lead peer review process.
100+
- Be aware that when you directly use content from an LLM, there will be inherent license conflicts and attribution issues.
101+
- Understand and transform code returned from an LLM: Don't paste LLM outputs directly. Review, edit, and ensure you fully understand what you're using. You can ask the LLM questions to better understand its outputs. This approach also helps you learn, which addresses the education concerns that we raised earlier.
102+
- **Use LLMs as learning tools**: Ask questions, review outputs critically, then write your own implementation based on understanding. Often the outputs of LLMs are messy or inefficient. Use them to learn, not to copy.
103+
- Consider [clean-room techniques](https://en.wikipedia.org/wiki/Clean-room_design): Have one person review LLM suggestions for approach; have another person implement from that high-level description
104+
- **Document your process**: If you plan to submit a Python package for pyOpenSci review, we will ask you about your use of LLM's in your work. Document the use of LLMs in your project's README file and in any modules with LLM outputs have been applied. Confirm that it has been reviewed by a human prior to submitting it to us, or any other volunteer lead peer review process.
105105

106106
You can't control what's in training data, but you can be thoughtful about how you use these tools.
107107

108108
<div class="notice" markdown="1">
109109
Examples of how these licensing issues are impacting and stressing our legal systems:
110110

111-
* [GitHub Copilot litication](https://githubcopilotlitigation.com/case-updates.html)
112-
* [Litigation around text from LLMs](https://arxiv.org/abs/2505.12546)
113-
* [incompatible licenses](https://dwheeler.com/essays/floss-license-slide.html)
111+
- [GitHub Copilot litication](https://githubcopilotlitigation.com/case-updates.html)
112+
- [Litigation around text from LLMs](https://arxiv.org/abs/2505.12546)
113+
- [incompatible licenses](https://dwheeler.com/essays/floss-license-slide.html)
114114
</div>
115115

116116
### Review for bias
117117

118118
Inclusion is part of quality. Treat AI-generated text with the same care as code.
119119
Given the known biases that can manifest in Generative AI-derived text:
120120

121-
* Review AI-generated text for stereotypes or exclusionary language.
122-
* Prefer plain, inclusive language.
123-
* Invite feedback and review from diverse contributors.
121+
- Review AI-generated text for stereotypes or exclusionary language.
122+
- Prefer plain, inclusive language.
123+
- Invite feedback and review from diverse contributors.
124124

125125
## Things to consider in your development workflows
126126

127127
If you are a maintainer or a contributor, some of the above can apply to your development and contribution process, too.
128128
Similar to how peer review systems are being taxed, rapid, AI-assisted pull requests and issues can also overwhelm maintainers too. To combat this:
129129

130-
* Open an issue first before submitting a pull request to ensure it's welcome and needed
131-
* Keep your pull requests small with clear scopes.
132-
* If you use LLMs, test and edit all of the output before you submit a pull request or issue.
133-
* Flag AI-assisted sections of any contribution so maintainers know where to look closely.
134-
* Be responsive to feedback from maintainers, especially when submitting code that is AI-generated.
130+
- Open an issue first before submitting a pull request to ensure it's welcome and needed
131+
- Keep your pull requests small with clear scopes.
132+
- If you use LLMs, test and edit all of the output before you submit a pull request or issue.
133+
- Flag AI-assisted sections of any contribution so maintainers know where to look closely.
134+
- Be responsive to feedback from maintainers, especially when submitting code that is AI-generated.
135135

136136
## Where we go from here
137137

0 commit comments

Comments
 (0)