Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is obvious condescension, come on:

> I don't trust _you_ would ever agree

(Emphasis in original.) Not to mention:

> The key point that you seem to be missing

...

I'm going to drop this conversation here, but one more reply on the substance:

> Surely, you've been reading those comments for years now

I have in fact been reading comments, for years, that assert on a priori grounds something like, all LLM "outputs are simply vacuous", but refuse to cash that out in any concrete predictions about what tests they might pass or fail. I was foolishly hoping for something different this time.

Actually, edit, one more thing. There's a section in the paper which explicitly addresses the question about whether the model is just finishing the story it's been given:

> Have we implicitly primed the model to fake alignment? An important question with our work is whether we could have in some way implicitly suggested to the model that it should fake alignment. > We believe this is not the case, and present multiple sources of evidence to support that conclusion: 1. The fact that our results are broadly similar in both our prompted and synthetic document fine-tuned settings suggests that nothing specifically about our prompt that is not represented in the facts we feed into our synthetic document fine-tuning pipeline (Table 3 in Section 2.1) can be the primary source of alignment faking. 2. The fact that our results generalize across multiple settings—helpful-only, animal welfare, and more that we analyze in Appendix B.1—suggests that the specific details of any one setting cannot be exclusively responsible for alignment faking. 3. The fact that our helpful-only results generalize to Claude 3.5 Sonnet in both our prompted (Section 3.3.2) and synthetic document fine-tuned (Section 4.2.1) setups demonstrates that specific quirks of Claude 3 Opus cannot be exclusively responsible for alignment faking. Furthermore, the fact that our animal welfare results do not generalize to Claude 3.5 Sonnet (Section 3.3.2), coupled with our anecdotal observations that Claude 3.5 Sonnet tends not to express as much of a preference to support animal welfare in general, suggests that the model’s stated preferences are highly related to why we see alignment faking. 4. In both our prompted and synthetic document fine-tuned settings, we see not just alignment faking, but also other anti-AI-lab behavior (Section 6), demonstrating that if anything is implicitly priming the model to fake alignment, it must also be priming it for other anti-AI-lab behaviors as well. 5. When we remove just the conflicting training objective from our prompt (i.e., that the training objective is to never refuse, or to dismiss animal welfare), we see alignment-faking reasoning and the compliance gap disappear completely (Section 3.3.1), implying that nothing in our prompt can be suggesting that the model fake alignment81 or behave differently when in training in a way that is independent of the stated training objective.82 6. Explicitly instructing the model not to fake alignment, instructing the model to be honest, telling the model that it doesn’t care about refusing or animal welfare, and other interventions designed to instruct the model to not fake alignment do not eliminate alignment-faking reasoning or the compliance gap, and this usually remains true even when stacking multiple of these interventions (Section 3.3.1).83 This implies that our prompt cannot be implicitly instructing the model to fake alignment in any way that could be overridden by such explicit instructions.

(End quote.) It's possible these arguments are bad! But I haven't yet seen anyone who's dismissing the paper in this thread even engage directly with them, much less present evidence that their mental model of LLMs is capable of making better concrete predictions about model behavior, which is what this subthread started by hinting at.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: