An entire fully "rigged" version of your face is generated from thousands of high resolution photos using a variety of facial expressions, distilled down into a model. Data from eye tracking and the cameras on the HMD are used to estimate the position of your face, lips, eyes etc, which is then fed into the pregenerated rigged model to be rendered.
that's a full Debevec style light stage but you can do this stuff with a couple of dSLRs and a home lighting setup with polarised light - kind of standard photogrammetry/hq texture/normal map generation techniques.
You can see from the normal map in the video it's pretty detailed but at least for a single face capture you can do this at home. I'm not sure what secret sauce they have for capturing multiple facial expressions and some ML magic how to morph/animate between those.
I don't see why an iPhone 15 Pro wouldn't be able to capture these scans, especially with the new "spatial video" feature, which takes a "3d" video using multiple lenses.
You'll get decent results but it won't be as good as with dSLRs and polarised studio light - if you want super detailed textures, be able to relight them etc