Skip to content

We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.

License

Notifications You must be signed in to change notification settings

RishabSA/interp-refusal-tokens

Repository files navigation

From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •