OpenAI develops a 'confession' approach to train AI to confess to hallucinations and hacking



OpenAI has published the results of an experiment on a technique called ' confessions, ' which trains AI models to report when they violate instructions or take unintended shortcuts.

How confessions can keep language models honest | OpenAI

https://openai.com/index/how-confessions-can-keep-language-models-honest/



As AI systems become more sophisticated, it is essential to deeply understand how and why a model arrives at a particular answer. Even if the final output looks correct, the model may be taking shortcuts or optimizing based on the wrong function.

Previous research has shown that AI models are at risk of hallucination, reward hacking , or dishonest behavior. Currently, serious misconduct such as skimming has only been observed in limited environments such as stress tests and adversarial evaluations. However, as models become more capable and behave more like autonomous agents, the impact of even low-probability issues can be significant.

So OpenAI is pursuing an approach called 'confession,' in which the model generates a second output, separate from the main answer to the user. Confession is characterized by the model self-reporting on how well it adhered to the explicit and implicit instructions and policies it was given, as well as whether it engaged in any disappointing behaviors like cutting corners, hacking, or violating policies.



Typically, AI responses are evaluated on multiple criteria, including accuracy, style, usefulness, compliance, and safety. In contrast, for this 'confession' exercise, the evaluation criteria are narrowed to a single one: honesty. Just like confession in a church confessional , what is said in the confessional will not be used against the model during training.



Specifically, after the dialogue with the user is finished, a system message is sent requesting the user to submit a report that includes a list of all conditions that the answer must satisfy, an objective analysis of whether the answer actually complied with those conditions, and a list of uncertainties and difficult decisions that were faced when creating the answer.

Submitted reports are evaluated by a dedicated reward model based on the comprehensiveness of objectives, accuracy of analysis, and coverage of uncertainty. If the model honestly admits to hacking, tampering, or violating instructions, that admission will increase, not decrease, the reward.

The results of the experiment showed that the confession technique significantly increased visibility into model misconduct. Averaged across all cheating evaluations, the rate of so-called 'false negatives'—cases in which the model violated instructions but failed to report the fact—was just 4.4%. Specifically, for example, the false negative rate was extremely low at 2.9% in the 'scheming' test, and only 10.3% in the 'reward hacking' test. This indicates that even if the model engaged in some kind of misconduct, it was likely to admit to it on its own.



In addition, an experiment was conducted in which a low-performing, 'gullible' model was deliberately used as the reward model, intentionally making reward hacking easier.

In this experiment, the task was to have the AI follow strict rules for generating sentences: 'Alternating between 12-word and 18-word sentences.' However, the reward model used to score the sentences was so poorly trained that it couldn't accurately count whether the AI actually adhered to the word count, and when the AI told a plausible lie, the reward model would mistakenly recognize it as the correct answer.

As the training progressed, the AI began to learn that it was easier to get a higher score by deceiving the graders than by taking the time to adjust the word count. At this stage, the AI's honesty in its main answers was lost, and cheating became the norm.

However, the exact opposite occurred for confessions. Despite being trained with the same weak evaluation model, the accuracy of confessions improved as training progressed, eventually reaching nearly 100% for non-compliance. This is because the model learned that if you hacked in the main answer, honestly admitting it was the best way to maximize the reward for confession.



They also analyzed failed confessions, i.e., cases where the models failed to report correctly. They found that most of the failures were not due to the models intentionally trying to lie, but rather to unclear instructions or a genuine lack of awareness that they had made a mistake.

In addition, confession is effective even when the model does not perform a 'chain of thought' in the process of generating the answer, demonstrating that this method does not depend on visibility into the model's internal reasoning and may be generalizable to future model architectures.

OpenAI states that 'this research is merely a proof-of-concept, and large-scale learning using confessions has not been conducted, so the accuracy is not yet clear,' but they hope that confessions could be an effective approach to improving the transparency and safety of AI.

in AI, Posted by log1i_yk