Skip to content

Making sure credence is properly calibrated and conveyed using hedge words #848

@ChristopherKing42

Description

@ChristopherKing42

One of the complaints about ChatGPT is it's overconfidence. ChatGPT is probably better than previous assistants in this regard, but I have an idea for how open assistant might be able do better!

  1. We train a small model on predicting how confident a claim is in terms of credence, which is essentially "How likely does the speaker think this is true, as a probability?". A potential starting point is the research described here.
  2. The assistant's neural network will generate a credence for each claim. We train these two aspects (at the same time as the RLHF step):
    1. The wording of the claim is such that it's credence as judged by the model in (1) is close to the credence generated by the assistance.
    2. The credence itself is calibrated. This means, for example, the 80% of the assistant's 80% credence claims will be correct. The scoring rule is just log(p) if the claim is true and log(1-p) if the claim is false (i.e. it is just cross entropy). (We ask the human if the claim is true or not during the human feedback phase.) This scoring rule incentives both better knowledge but also more accurate credence.

The important bit about credence calibration is that it gives a very large punishment if a high credence claim is incorrect. So even though humans typically prefer confident claims, the assistant still learns to hedge it's bets to avoid the possibility of a large credence penality. (The reward for correct claims is slightly higher for high credences though, so it's still optimal to give high credence to obvious claims.)

(A question is whether we treat the entire response as a single claim, or split it up using NLP (perhaps part of the model in (1)).)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions