That's the GPU parallelization they offer, but if you had a problem where say the objective function itself was really big and cumbersome and benefited from GPU parallelization, then you could put that objective function on the GPU just fine.
I think this use case isn't really discussed in the paper because the solver doesn't need to specialize on that case, it should 'just work' without any intervention needed.
No, there are two types of GPU usage: the ensemble form and the array form. In the array form, you simply make your state variable a GPU and can run all internal operations on the GPU. This is used for GPU acceleration of `f` functions where you have a large state and a relatively "regular" function in `f`, such as neural networks or PDE discretizations. Thus you can see this in action for example on things like the DeepEquilibriumNetworks.jl https://docs.sciml.ai/DeepEquilibriumNetworks/stable/tutoria... use cases, where it's big neural networks with nonlinear solvers in them.
The other use case is if you're solving tons of f(u,p)=0 equations for many different p, but all the same `f`. This shows up for example when doing parameter sweeps. You might think that you can just use the former for this: if you originally had `u` a vector, then you can make `u` a matrix and solve `f(u[:,i],p[:,i])` all simultaneously. This is in fact what machine learning libraries tend to do with their `vmap` approach. However, that hides a choice: how do you change `f` to build the composite `f`? You can either do the matrix operations on each operation of `f`, or you can build a specific GPU kernel for `f`. Of course you can do the former, that is exactly the same as the use case above, reusing the same machinery. However, we also offer the latter. And the reason for that is because calling GPU kernels has a lot of overhead and inhibits interprocedural optimizations, and therefore the former is a lot slower than the latter when you can exploit this known repeated structure. We were able to demonstrate this in a recent publication which showed using the latter approach for ODEs was about 20x-100x faster than PyTorch, Jax, and our own implementation of the former, while our approach to the latter is just about as fast as MPGOS which is a CUDA kernel, indicating that it's some limitation of the design (see https://www.sciencedirect.com/science/article/abs/pii/S00457..., or the arxiv version https://arxiv.org/abs/2304.06835). The kernel building approach is thus simply more efficient for things like parameter sweeps, and it's a pretty unique way of using the GPU as most machine learning libraries are not built around kernel generation but instead only give you the linear algebra kernel calling approaches.
tl;dr, DiffEqGPU.jl's documentation is very clear on the two types of GPU usage https://docs.sciml.ai/DiffEqGPU/stable/getting_started/, but is specifically for ODEs. We need to write some similar tutorials for GPU usage on the nonlinear solvers to this effect.