> Until we have a very different method of actually controlling LLM behavior, 1 ...

> Until we have a very different method of actually controlling LLM behavior, 1 is the only feasible one.

Most of the stuff I've seen, is 2. I've only seen a few places use 1 — you can tell the difference, because when a LLM pops out a message and then deletes it, that's a type 1 behaviour, whereas if the first thing it outputs directly is a sequence of tokens saying (any variant of) "nope, not gonna do that" that's type 2 behaviour.

This appears to be what's described in this thread: https://old.reddit.com/r/bing/comments/11fryce/why_do_bings_...

The research into going from type 2 to type 3 is the entirety of the article.

> Your framing only makes sense when "Bad" is something so bad that we can't bear its existence, as opposed to just "commercially bad" where it shouldn't behave that way with an end user. In the latter, your choice 1 - imposing external guardrails - is fine.

I disagree, I think my framing applies to all cases. Right now, LLMs are like old PCs with no user accounts and a single shared memory space, which is fine and dandy when you're not facing malicious input, but we live in a world with malicious input.

You might be able to use a type 1 solution, but it's going to be fragile, and more pertinently, slow, as you only know to reject content once it has finished and may therefore end up in an unbounded loop of an LLM generating content that a censor rejects.

A type 2 solution is still fragile, but it just doesn't make the "bad" content in the first place — and, to be clear, "bad" in this context can be anything undesired, including "uses vocabulary too advanced for a 5 year old who just started school" if that's what you care about using some specific LLM for.