For those who don't know, it is like Markov Chains: probability of next word (or a group of words encoded as a token) is calculated based on previous words and computationally intensive. It just uses not just probability between 1-2 previous tokens, it uses 2048 token window (roughly 1500 words) to predict next token, then puts next token into window and goes on.
For those who don't know, it is like Markov Chains: probability of next word (or a group of words encoded as a token) is calculated based on previous words and computationally intensive. It just uses not just probability between 1-2 previous tokens, it uses 2048 token window (roughly 1500 words) to predict next token, then puts next token into window and goes on.