Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You used the word reinforcing, and then asserted there's no reward function. Can you explain how it's possible to perform RL without a reward function, and how the LLM training process maps to that?


LLM actions are divorced from that reward function, it's not something they consult or consider. Reward function in that context doesn't make sense.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: