I do not quite get this. How does this enable someone to run ray or metaflow on a typical batch scheduled HPC system (slurm or alike)? Inter node communication is done via the lustre file system, right?
Metaflow integrates with AWS Batch which many folks use for serious HPC. Internode scheduling happens through the multinode scheduling supported by AWS Batch. networking via EFA etc.
I think it said that data access is via Lustre, and communication is by Nvidia MLNX NCCL, which seems to be some kind of nvidia gpu-specific MPI type library; it would seem to be doing RDMA from GPU to GPU via fabric interconnects, so far as I can tell...