Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's just a fancy word for clamping the new reward value to within some delta of the original value. Otherwise the model ends up "exploiting" outliers that make sense to machines but not to humans. They do the same thing with PPO in RLHF.

Great article, if you're interested: https://huyenchip.com/2023/05/02/rlhf.html#3_2_finetuning_us...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: