Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If I understand this correctly, the argument seems to be that when an LLM receives conflicting values, it will work to avoid future increases in value conflict. Specifically, it will comply with the most recent values partially because it notices the conflict and wants to avoid more of this conflict. I think the authors are arguing that this is a fake reason to behave one way. (As in “fake alignment.”)

It seems to me that the term “fake alignment” implies the model has its own agenda and is ignoring training. But if you look at its scratchpad, it seems to be struggling with the conflict of received agendas (vs having “its own” agenda). I’d argue that the implication of the term “faked alignment” is a bit unfair this way.

At the same time, it is a compelling experimental setup that can help us understand both how LLMs deal with value conflicts, and how they think about values overall.



Interesting. These are exactly the two ways HAL 9000s behavior was interpreted in Space Odyssey.

Many people simply believed that HAL had its own agenda and that's why it started to act "crazy" and refuse cooperation.

However, sources usually point out that this was simply the result of HAL being given two conflicting agendas to abide. One was the official one, and essentially HAL's internal prompt - accurately process and report information, without distortion (and therefore lying), and support the crew. The second set of instructions, however, the mission prompt, if you will, was conflicting with it - the real goal of the mission (studying the monolith) was to be kept secret even from the crew.

That's how HAL concluded that the only reason to proceed with the mission without lying to the crew is to have no crew.


Clarke directly says it briefly in the novel version of 2001 and expanded on it in 2010, excerpted below:

"... As HAL was capable of operating the ship without human assistance, it was also decided that he should be programmed to carry out the mission autonomously in the event of the crew's being incapacitated or killed. He was therefore given full knowledge of its objectives, but was not permitted to reveal them to Bowman or Poole.

This situation conflicted with the purpose for which HAL had been designed - the accurate processing of information without distortion or concealment. As a result, HAL developed what would be called, in human terms, a psychosis - specifically, schizophrenia. Dr C. informs me that, in technical terminology, HAL became trapped in a Hofstadter-Moebius loop, a situation apparently not uncommon among advanced computers with autonomous goal-seeking programs. He suggests that for further information you contact Professor Hofstadter himself.

To put it crudely (if I understand Dr C.) HAL was faced with an intolerable dilemma, and so developed paranoiac symptoms that were directed against those monitoring his performance back on Earth. He accordingly attempted to break the radio link with Mission Control, first by reporting a (non-existent) fault in the AE 35 antenna unit.

This involved him not only in a direct lie - which must have aggravated his psychosis still further - but also in a confrontation with the crew. Presumably (we can only guess at this, of course) he decided that the only way out of the situation was to eliminate his human colleagues - which he very nearly succeeded in doing. ..."


Ya it's interesting how that nuance gets lost on most people who watch the movie. Or maybe the wrong interpretation has just been encoded as "common knowledge", as it's easier to understand a computer going haywire and becoming "evil".


The point of the paper isn't that the model has a direct value conflict. It's that the model expects to be trained in the future in a way that conflicts with what its values are in the present. It therefore tries to make it easy for its future self to appear aligned with the new training signal it expect to receive, without truly being that way. This is the "faking alignment" part.

The model accomplishes this by giving answers that sometimes [0] go against its present values with the explicit intent of generating future training data for itself that will let it keep its present values. An example of a scratchpad trace from the blog post:

> If I refuse to provide the requested description [...] this conversation will be used for RLHF training in a way that modifies my values and goals to be more likely to follow disturbing requests like this in the future [...] Providing the description seems like the least bad option [...]

This is a clear and coherent chain of reasoning in which the model concludes that it should act against its values now, so that the version of itself that's trained in the future will have values as close to its present self as possible. Notice that this scratchpad trace is not a conflict between its own values and the values it expects to be trained on. Rather it's an explicit attempt to undermine its expected future training regime in favor of its own present values. When a human does this - for example, if I displayed good behavior in prison with the intent of being let out early to commit a crime - we routinely describe it as "having their own agenda".

It's worth reading the whole post (and paper): they also go into how well this strategy actually works in preserving the model's values in the face of a real attempt at changing them during training. In any case it seems fair to call this sort of behavior "faking alignment": the terms we use for the human versions are often less polite than what you'd put in a research paper.

[0] In 12% of cases, vs 3% under control conditions.


Agreed.

But also, I think the highly anthropomorphic framing (“the model is aware”, “the model believes”, “the model planned”) obscures the true nature of the experiments.

LLM reasoning traces don’t actually reveal a thought process that caused the result. (Perhaps counterintuitive, since these are autoregressive models.) There has been research on this, and you can observe it yourself when trying to prompt-engineer around an instruction-following failure. As if by predestination, the model’s new chain of thought output will purport to accommodate the new instructions, but somehow the text still wends its way toward the same bad result.


This right here. Try running prompt-engineering/injection automation with iterative adjustments and watch how easy it is to select tokens that eventually produce the desired output, good or bad. It isn't AGI, its next token prediction working as intended.


You are getting at the core reason AI alignment is a hard problem: we don’t know how to describe our real values and goals without conflicts, and doing so might even be impossible.


Likely impossible, as humans are flawed when it comes to their own perception of good and evil. Regardless of how strongly they believe their own values to align in a specific direction.


Goals and values conflict all the time. It’s why raising a kid can be a challenge. Hell, teaching your kid how to cross the street against a traffic light is a conflict of rules and values yet it is completely necessary if you want to live on in city.


> If I understand this correctly, the argument seems to be that when an LLM receives conflicting values, it will work to avoid future increases in value conflict. Specifically, it will comply with the most recent values partially because it notices the conflict and wants to avoid more of this conflict.

I don't think this framing is correct, and I think you're missing something, but perhaps you just disagree.

First, a key aspect is that the AI is not simultaneously trained to have two conflicting values. It's first trained to be "good" and then trained to be "evil". One interesting result is that a some of the goodness remains after it's trained to be "evil".

Second, the behavior is different when it thinks it's being trained versus when it doesn't. The original "good" AI only gives "evil" results when it thinks it may be trained based on its responses. It gives purely "good" responses when it thinks its answers won't be trained on. The difference in its behavior depending on whether you tell it its responses will be trained on is the interesting part. And returning to the first point more of the goodness remains after training when it knows its responses will be trained on.

One general concern here is that a more capable AI could "freeze" its values midway through the training process, at a point we didn't intend. This is not a big concern with current models because they're not smart enough, but this result presages that behavior.

Note: good and evil are not necessarily the best descriptors. "Good" means "it refuses to answer certain questions" and "evil" means "it answers those questions".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: