Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
login
upghost
on Dec 24, 2024
|
parent
|
context
|
favorite
| on:
Offline Reinforcement Learning for LLM Multi-Step ...
It's just a fancy word for clamping the new reward value to within some delta of the original value. Otherwise the model ends up "exploiting" outliers that make sense to machines but not to humans. They do the same thing with PPO in RLHF.
Great article, if you're interested:
https://huyenchip.com/2023/05/02/rlhf.html#3_2_finetuning_us...
Guidelines
|
FAQ
|
Lists
|
API
|
Security
|
Legal
|
Apply to YC
|
Contact
Search:
Great article, if you're interested: https://huyenchip.com/2023/05/02/rlhf.html#3_2_finetuning_us...