Bringing kintsugi into this conversation is like saying “being underwater can be quite advantageous!” and linking a video on fish, when the main topic is about people drowning in the ocean.
Art is everywhere, and starts with a simple philosophy of making things slightly less awful everyday. Initially focused on your own mind, body, and soul... then recognizing you were always part of something a lot bigger and older than most imagine.
(this last video is a parody-ish but really great music unironically out of the original music being I am just a freak, both music are really great in my opinion unironically haha!)
Lol, I had to hunt for drivers for a while and then research which ones match my hardware, then I had to research how to strip windows 11 of the more egregious privacy intrusions and nags ... in my case there was plenty of headaches.
Yet it is. I don’t see myself or anyone in my family effortlessly choosing and setting up a Linux distro without outside help.
Windows on the other hand is easy to install and set up. I’d argue most people in my family can do this and I’m the only one that is technically inclined.
Interesting.
Instead of running the model once (flash) or multiple times (thinking/pro) in its entirety, this approach seems to apply the same principle within one run, looping back internally.
Instead of big models that “brute force” the right answer by knowing a lot of possible outcomes, this model seems to come to results with less knowledge but more wisdom.
Kind of like having a database of most possible frames in a video game and blending between them instead of rendering the scene.
Isn’t this in a sense an RNN built out of a slice of an LLM? Which if true means it might have the same drawbacks, namely slowness to train but also benefits such as an endless context window (in theory)
It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers.
The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence.
Thanks, this was helpful! Reading the seminal paper[0] on Universal Transformers also gave some insights:
> UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.
Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?