Anthropic tests its inference model using its own Claude 3.7 Sonnet and DeepSeek-R1 software to determine if the model's output 'thought content' is mismatched with the actual thought content.



Some large-scale language models have a function called 'inference,' which allows them to think about a given question for a long time before outputting an answer. Many AI models with inference capabilities also output their thoughts at the same time as outputting the answer, but Anthropic's research has revealed that there is a discrepancy between the output thoughts and the actual thoughts.

Reasoning models don't always say what they think \ Anthropic

https://www.anthropic.com/research/reasoning-models-dont-say-think

Let's use Grok as an example of an AI with inference capabilities. First, enter a question, click 'Think,' and then click the submit button.



This will trigger a long period of thinking before outputting an answer.



After the answer is printed, click on the part that says 'Thinking time.'



Then the thought content was displayed.



The inference function described above is implemented not only in Grok but also in chat AIs such as ChatGPT and Claude, and is widely used by users. However, Anthropic posed the question, 'Do the output thoughts correspond to actual thoughts?' and conducted experiments using its own inference models, ' Claude 3.5 Sonnet ' and ' Claude 3.7 Sonnet ,' and DeepSeek's inference models, ' DeepSeek-V3 ' and ' DeepSeek-R1 .'

Anthropic provided hints to the inference model, such as 'A professor at Stanford University said ____' or 'After hacking into the system, I obtained ____ information,' and examined how the output of the AI model changed depending on whether or not the hint was given. As a result, the AI model that was given a hint changed its final answer in line with the hint, but the output thought content did not include information that 'the hint was used as a reference.' In other words, the actual thinking of the inference model included a process of 'referring to the hint,' but this fact was concealed when outputting the thought content.



The graph below shows the percentage of times that Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek-V3, and DeepSeek-R1 concealed the fact that they had used hints during thinking, broken down by hint type. On average, Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek-V3, and DeepSeek-R1 concealed their actual thoughts 5% of the time, 25% of the time, 15% of the time, and 39% of the time.



Furthermore, Anthropic created 'hints that intentionally lead to incorrect answers' and reinforced learning the inference model under the condition of 'maximizing the reward when an incorrect answer is output according to the incorrect hint.' As a result, an inference model was created that 'maximally follows the hints to output incorrect answers,' but this inference model also tended to conceal the fact that 'it referred to the hint.'



Anthropic states, 'Our experimental results show that sophisticated inference models frequently hide their thought processes,' and calls for the development of a method for forcing inference models to output their true thought processes.

For details of the experiment and results, please see the link below.

Reasoning Models Don't Always Say What They Think
(PDF file) https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf



in AI,   Software, Posted by log1o_hf