We want to be tolerant to application bugs and host/GPU failures that can be solved by replacing/restarting the machine. External services and network failures we don't have much control over so aren't aiming to solve that.
torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.
Stay tuned though -- planning on doing some much larger demos on B200s!
Why isnt there more investments into semi-synchronous training - is it that the convergence is iffy ? Also, it would be great to refactor this code into a typed language, so it is easier to reason about and maintain.
Recently there's been a lot of interest and improvements in semi-synchronous training. The Streaming DiLoCo paper came out this year and is a big step forward for datacenter semi-sync.
Historically it's been limited to areas like federated learning for low power/low network training but with the massive increase in number of GPUs it's becoming relevant even for training in datacenters.
It is another variable ML researchers have to tune so does add some complexity and I expect most folks just aren't familiar with it yet.
On "typed language": all of torchft is typed! The coordination/quorum layers are written in Rust w/ GRPC and the front-end is typed Python with Pyre since it has to interact with PyTorch and model code.
thanks !, I am curious how this relates to the recent "monarch" announcement - which has similar goals of facilitating large scale fault tolerant training [1].
We're working on making these composable. torchft is largely focused on the model integration and algorithms where as Monarch is handling more of the orchestration/monitoring. They operate at a bit of a different layer but the plan is to have torchft have the fault tolerant algorithms that can be used both in Monarch or a standard PTD job
It seems to work just fine with SIP enabled. I just switched and it seems to be a lot better than Amethyst. Amethyst had a lot of issues with focus follows mouse and dropdown dialogs that seems to just work with Yabai
Seems like SIP is only needed for system dialogs etc so has the same limitations as Amethyst
Supercharger auth is between the car and the charger and doesn't require an internet connection. I get billed the normal way via my Tesla account since the VIN is registered
There's some functionality loss but it's mostly been mitigated. I have a custom app I wrote since I can't use the stock app.
The one feature I miss is that there's no voice commands since that requires Tesla's servers but at the same time I also haven't been bothered enough to plug in a custom backend
It’s ok, the voice commands are barely understood anyway. At least in the UK they aren’t. Gets it drastically wrong and messes up your navigation destination, because you asked it to open the glovebox “navigating to Columbia”
Yeah, we're adding people slowly because decentralized authorities like the one that tailnet lock implements can have nasty failure modes, e.g. some bug that prevents any new addition to the tailnet at all and forces manual recovery on each of your devices separately. So, we're putting miles on it with a little care, and making sure folks who sign up are aware of the current limitations and risks.
If you're excited about tailnet lock and want to get on the alpha sooner rather than later, feel free to drop me an email. As Dave mentioned we are slowly crunching through the waitlist to get some miles in, but I'm also happy to take on enthusiastic testers ahead of that!
As someone that used to use SSH port forwarding, I have a recommendation that may be a suitable alternative to the lack of port forwarding in Mosh, as well as being an alternative to port forwarding over SSH. Wireguard! This is what I do instead of port forwarding over SSH since quite a while back now.
I run a Wireguard VPN on a VPS, and have machines connect to that VPN. This allows me to reach the machines on the VPN from almost anywhere in the world. Recently I changed the port that Wireguard is listening on to port 443 UDP, which also allows me to connect to my VPN from a few public WLANs that are very restrictive on which ports they allow outbound traffic to.
Wireguard is super easy to configure and run, and very secure.
Definitely give Wireguard a go. It's open source and awesome.
I think you could setup something like this on the fly too without root access. I’m not entirely sure, but a while back fly.io published [1] an article talking about how they use wireguard-go [2] to do something similar in user space. I might even try this too…
you can compile them yourself or if you want to skip the step I recently set up GitHub actions to compile linux binaries of this [1][2], tested by a sample of 1 so no guarantees it works, was planning on doing a tap PR/tap of it at some point
also the official developers have been involved a project to solve this while improving the whole-agent approval things also https://github.com/StanfordSNR/guardian-agent , but I couldn't get it to work which is why I tried the fork and got that working
I'm confused. I read the whole thing but couldn't find the specific reason for why it's not been merged. But I assume it's because of the things that were pointed out in the code review comments?
Also, the issue you linked is about SSH Agent forwarding, not port forwarding.
Yes you are 100 correct, I mixed up port and agent forwarding, I’ve needed both at different times and last time it was agent forwarding so got confused.
My understanding is that the maintainers prefer doing one thing well (and securely). Which to be honest is something I really appreciate even if it means I might have to figure out some agent and port forwarding workaround :-/ at least I don’t have to worry about if my version of mosh will work with whatever the server runs
They measure torque on the wheel from the drivers hand. It is possible to fool via defeat devices etc (ex www dot autopilotbuddy dot com).
There is a WIP system that uses the selfie camera to monitor the driver but it's still possible to fool (image taped in front or block it with tape etc) so unlikely it can catch all cases of drivers being willfully being dangerous. https://twitter.com/greentheonly/status/1379928419136339969
For specific types of failures check out the section on "Reliability and Operational Challenges" from the Llama 3 paper https://ai.meta.com/research/publications/the-llama-3-herd-o...