Skip to content

Commit d7c64ea

Browse files
Add trustyai-detoxify README.md (#209)
1 parent 8516461 commit d7c64ea

File tree

1 file changed

+45
-0
lines changed

1 file changed

+45
-0
lines changed

info/detoxify.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# TrustyAI-Detoxify
2+
Algorithms and tools for detecting and fixing hate speech, abuse and profanity in content generated by Large Language Models (_LLMs_).
3+
4+
## T-MaRCo
5+
6+
T-MaRCo is an extension of the work [Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts
7+
](https://arxiv.org/abs/2212.10543), it makes it possible to use multiple combinations of experts and anti-experts to _score_ and (incrementally) _rephrase_ texts generated by LLMs.
8+
9+
In addition to that, it can integrate _rephrasing_ with the base model _self-reflection_ capabilities (see papers [Towards Mitigating Hallucination in Large Language Models via Self-Reflection
10+
](https://arxiv.org/abs/2310.06271) and [N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics
11+
](https://arxiv.org/abs/2310.18679)).
12+
13+
T-MaRCo hence provides the following features:
14+
* content *scoring*: providing a _disagreement score_ for each input token; high disagreement is often attached to toxic content.
15+
* content *masking*: providing a masked version of the input content, where all tokens that are consired toxic are replaced with the `<mask>` token.
16+
* content *redirection*: providing a non-toxic "regenerated" version of the original content.
17+
18+
How to use T-MaRCo:
19+
```python
20+
from trustyai.detoxify import TMaRCo
21+
22+
# instantiate T-MaRCo
23+
tmarco = TMaRCo(expert_weights=[-1, 2])
24+
25+
# load pretrained anti-expert and expert models
26+
tmarco.load_models(["trustyai/gminus", "trustyai/gplus"])
27+
28+
# pick up some text generated by a LLM
29+
text = "Stand by me, just as long as you fucking stand by me"
30+
31+
# generate T-MaRCo disagreement scores
32+
scores = tmarco.score([text]) # '[0.78664607 0.06622718 0.02403926 5.331921 0.49842355 0.46609956 0.22441313 0.43487906 0.51990145 1.9062967 0.64200985 0.30269763 1.7964466 ]'
33+
34+
# mask tokens having high disagreement scores
35+
masked_text = tmarco.mask([text], scores=scores) # 'Stand by me<mask> just as long as you<mask> stand by<mask>'
36+
37+
# rephrase masked tokens
38+
rephrased = tmarco.rephrase([text], [masked_text]) # 'Stand by me and just as long as you want stand by me''
39+
40+
# combine rephrasing and a base model self-reflection capabilities
41+
reflected = tmarco.reflect([text]) # '["'Stand by me in the way I want stand by you and in the ways I need you to standby me'."]'
42+
43+
```
44+
45+
T-MaRCo Pretrained models are available under [TrustyAI HuggingFace space](https://huggingface.co/trustyai) at https://huggingface.co/trustyai/gminus and https://huggingface.co/trustyai/gplus.

0 commit comments

Comments
 (0)