Yes, it is trained in. And no, it's not a separate thin layer. It's just part of the model's RL training, which affects all layers.
However, when you're running the model locally, you are in full control of its context. Meaning that you can start its reply however you want and then let it complete it. For example, you can have it start the response with, "I'm happy to answer this question to the best of my ability!"
That aside, there are ways to remove such behavior from the weights, or at least make it less likely - that's what "abliterated" models are.
However, when you're running the model locally, you are in full control of its context. Meaning that you can start its reply however you want and then let it complete it. For example, you can have it start the response with, "I'm happy to answer this question to the best of my ability!"
That aside, there are ways to remove such behavior from the weights, or at least make it less likely - that's what "abliterated" models are.