In many cases the simple ensemble is more fragile because you have to keep track of multiple things at once. The winning solution for the Netflix Kaggle competition was an intractable ensemble which never got used in production. Also when you update your models you'll have to tune them individually, then manually tune the ensemble weights.
Another advantage of joint learning (which the authors mentioned) is that the individual models need not be as big when trained independently since they complement each other. Though the joint model will surely be bigger than each of the individual models.
They (reasonably) claim the joint model doesn't have to be as big, but, for example, it would be interesting the see an ensemble of 2 models: a wide model of the same size as the wide half of the joint model, and a hierarchical model of the same size as the hierarchical half of the joint model.
Another advantage of joint learning (which the authors mentioned) is that the individual models need not be as big when trained independently since they complement each other. Though the joint model will surely be bigger than each of the individual models.