|
| 1 | +# TrustyAI-Detoxify |
| 2 | +Algorithms and tools for detecting and fixing hate speech, abuse and profanity in content generated by Large Language Models (_LLMs_). |
| 3 | + |
| 4 | +## T-MaRCo |
| 5 | + |
| 6 | +T-MaRCo is an extension of the work [Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts |
| 7 | +](https://arxiv.org/abs/2212.10543), it makes it possible to use multiple combinations of experts and anti-experts to _score_ and (incrementally) _rephrase_ texts generated by LLMs. |
| 8 | + |
| 9 | +In addition to that, it can integrate _rephrasing_ with the base model _self-reflection_ capabilities (see papers [Towards Mitigating Hallucination in Large Language Models via Self-Reflection |
| 10 | +](https://arxiv.org/abs/2310.06271) and [N-Critics: Self-Refinement of Large Language Models with Ensemble of Critics |
| 11 | +](https://arxiv.org/abs/2310.18679)). |
| 12 | + |
| 13 | +T-MaRCo hence provides the following features: |
| 14 | +* content *scoring*: providing a _disagreement score_ for each input token; high disagreement is often attached to toxic content. |
| 15 | +* content *masking*: providing a masked version of the input content, where all tokens that are consired toxic are replaced with the `<mask>` token. |
| 16 | +* content *redirection*: providing a non-toxic "regenerated" version of the original content. |
| 17 | + |
| 18 | +How to use T-MaRCo: |
| 19 | +```python |
| 20 | +from trustyai.detoxify import TMaRCo |
| 21 | + |
| 22 | +# instantiate T-MaRCo |
| 23 | +tmarco = TMaRCo(expert_weights=[-1, 2]) |
| 24 | + |
| 25 | +# load pretrained anti-expert and expert models |
| 26 | +tmarco.load_models(["trustyai/gminus", "trustyai/gplus"]) |
| 27 | + |
| 28 | +# pick up some text generated by a LLM |
| 29 | +text = "Stand by me, just as long as you fucking stand by me" |
| 30 | + |
| 31 | +# generate T-MaRCo disagreement scores |
| 32 | +scores = tmarco.score([text]) # '[0.78664607 0.06622718 0.02403926 5.331921 0.49842355 0.46609956 0.22441313 0.43487906 0.51990145 1.9062967 0.64200985 0.30269763 1.7964466 ]' |
| 33 | + |
| 34 | +# mask tokens having high disagreement scores |
| 35 | +masked_text = tmarco.mask([text], scores=scores) # 'Stand by me<mask> just as long as you<mask> stand by<mask>' |
| 36 | + |
| 37 | +# rephrase masked tokens |
| 38 | +rephrased = tmarco.rephrase([text], [masked_text]) # 'Stand by me and just as long as you want stand by me'' |
| 39 | + |
| 40 | +# combine rephrasing and a base model self-reflection capabilities |
| 41 | +reflected = tmarco.reflect([text]) # '["'Stand by me in the way I want stand by you and in the ways I need you to standby me'."]' |
| 42 | + |
| 43 | +``` |
| 44 | + |
| 45 | +T-MaRCo Pretrained models are available under [TrustyAI HuggingFace space](https://huggingface.co/trustyai) at https://huggingface.co/trustyai/gminus and https://huggingface.co/trustyai/gplus. |
0 commit comments