I find this post particularly interesting, since what described (outside of doing it at a VM level), somewhat reflect how some Telecom providers build their equipment. Telecoms in North America are properly crazy when it comes to recovering from failure with minimal visible impact to customers.
Usually on the Telecom equipment, the backup / state transfer is done at a process level, not at a VM level as suggested, but it's quite common practice.
The best equipment I've seen, does this by spawning many equivalent processes, and distributing them among the available blades in the chassis. If you have process mgr1, you get a backup1 process on another blade. As mgr1 processes you're call state, it checkpoints all critical data to the backup1 process. If the mgr1 process itself crashes, or the entire blade fails, all the processes are simply re-spawned, contact their corresponding backup process, and transfer all the state information back, and simply resume. Most end users won't even notice. Using this method, I've seen equipment recover well over 30,000 subscriber sessions in under 5 seconds, most of which probably wouldn't even notice, and even if you did it wouldn't be enough to drop you're data connection (VPN, video streaming, or whatever you're doing). We also don't lose you're bill for the usage either ;)
The challenges with applying this to the game environment, is in telecom each user session is independent, and doesn't really interact with other sessions, so we don't have an issue of a single process becoming overloaded and needing to free up resources to handle it. However, it would be properly easy to do within this model, since failure is expect to occur and be recovered from.
As a programmer, you have to be properly diligent in the software design, what get's check pointed, when does it occur. I couldn't even imagine trying to retroactively apply this type of design to "legacy" software, that wasn't build from the ground up with this model in mind.
Usually on the Telecom equipment, the backup / state transfer is done at a process level, not at a VM level as suggested, but it's quite common practice.
The best equipment I've seen, does this by spawning many equivalent processes, and distributing them among the available blades in the chassis. If you have process mgr1, you get a backup1 process on another blade. As mgr1 processes you're call state, it checkpoints all critical data to the backup1 process. If the mgr1 process itself crashes, or the entire blade fails, all the processes are simply re-spawned, contact their corresponding backup process, and transfer all the state information back, and simply resume. Most end users won't even notice. Using this method, I've seen equipment recover well over 30,000 subscriber sessions in under 5 seconds, most of which probably wouldn't even notice, and even if you did it wouldn't be enough to drop you're data connection (VPN, video streaming, or whatever you're doing). We also don't lose you're bill for the usage either ;)
The challenges with applying this to the game environment, is in telecom each user session is independent, and doesn't really interact with other sessions, so we don't have an issue of a single process becoming overloaded and needing to free up resources to handle it. However, it would be properly easy to do within this model, since failure is expect to occur and be recovered from.
As a programmer, you have to be properly diligent in the software design, what get's check pointed, when does it occur. I couldn't even imagine trying to retroactively apply this type of design to "legacy" software, that wasn't build from the ground up with this model in mind.