This is an interesting insight I hadn’t thought much about before. Reminds me a ...

This is an interesting insight I hadn’t thought much about before. Reminds me a bit of some of the mechanistic interpretability work that looked at branch specialization in CNNs and found that architectures which had built in branches tended to have those branches specialize in a way that was consistent across multiple training runs [1]. Maybe the multi-headed and branching nature of transformers adds and inductive bias that is useful for stable training over larger scales.

[1] https://distill.pub/2020/circuits/branch-specialization/