Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

FFWD later is independent of context size, each processed token passes thought the same weights.


So you're saying that if I have a sentence of 10 words, and I want the LLM to predict the 11th word, FFWD compute is going to be independent of the context size?

I don't understand how since that very context is what makes the likeliness of output of next prediction worthy, or not?

More specifically, FFWD layer is essentially self attention output [context, d_model] matrix matmul'd with W1, W2 and W3 weights?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: