More

d4l3k · 2025-06-27T18:37:21 1751049441

We want to be tolerant to application bugs and host/GPU failures that can be solved by replacing/restarting the machine. External services and network failures we don't have much control over so aren't aiming to solve that.

For specific types of failures check out the section on "Reliability and Operational Challenges" from the Llama 3 paper https://ai.meta.com/research/publications/the-llama-3-herd-o...

d4l3k · 2025-06-27T17:10:11 1751044211

Let me know how it goes! If you're interested in chatting / run into any problems feel free to reach out via the links in my profile

d4l3k · 2025-06-27T01:45:22 1750988722

Hey Tim, how's it going?

Interested in lending PyTorch some compute? :)

torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.

Stay tuned though -- planning on doing some much larger demos on B200s!

d4l3k · 2025-06-27T00:16:01 1750983361

Hey, nice to see this here!

I'm the primary author so happy to answer any questions you might have!

bwfan123 · 2025-06-27T15:01:17 1751036477

Why isnt there more investments into semi-synchronous training - is it that the convergence is iffy ? Also, it would be great to refactor this code into a typed language, so it is easier to reason about and maintain.

d4l3k · 2025-06-27T17:07:10 1751044030

Recently there's been a lot of interest and improvements in semi-synchronous training. The Streaming DiLoCo paper came out this year and is a big step forward for datacenter semi-sync.

Historically it's been limited to areas like federated learning for low power/low network training but with the massive increase in number of GPUs it's becoming relevant even for training in datacenters.

It is another variable ML researchers have to tune so does add some complexity and I expect most folks just aren't familiar with it yet.

On "typed language": all of torchft is typed! The coordination/quorum layers are written in Rust w/ GRPC and the front-end is typed Python with Pyre since it has to interact with PyTorch and model code.

bwfan123 · 2025-06-28T18:37:43 1751135863

thanks !, I am curious how this relates to the recent "monarch" announcement - which has similar goals of facilitating large scale fault tolerant training [1].

[1] https://github.com/pytorch-labs/monarch/issues/175#issuecomm...

d4l3k · 2025-07-07T15:49:55 1751903395

We're working on making these composable. torchft is largely focused on the model integration and algorithms where as Monarch is handling more of the orchestration/monitoring. They operate at a bit of a different layer but the plan is to have torchft have the fault tolerant algorithms that can be used both in Monarch or a standard PTD job

d4l3k · on Nov 30, 2023

It seems to work just fine with SIP enabled. I just switched and it seems to be a lot better than Amethyst. Amethyst had a lot of issues with focus follows mouse and dropdown dialogs that seems to just work with Yabai

Seems like SIP is only needed for system dialogs etc so has the same limitations as Amethyst

d4l3k · on Aug 3, 2023

If the map can't talk to Tesla it'll use Google maps directly. I usually don't allow connections to Tesla on my rooted Model 3

adamgamble · on Aug 3, 2023

I also would like to subscribe to your newsletter.

d4l3k · on Aug 3, 2023

I've got a blog if you're interested haha https://fn.lc/post/

I've been hacking on my car and creating my own self driving models

Code is at https://github.com/d4l3k/torchdrive

seanthemon · on Aug 3, 2023

Very cool, am going to eat this up. FYI some of your images won't load for me, shoots me a 502 here https://fn.lc/post/diy-self-driving/

d4l3k · on Aug 3, 2023

Not sure why they aren't loading, seem to be fine now

They're also at https://github.com/d4l3k/fn.lc/tree/master/static%2Fdiy-self...

acer589 · on Aug 3, 2023

Is that legal?

kortilla · on Aug 4, 2023

Is getting married at 15 in Georgia?

malwrar · on Aug 3, 2023

How does this work with their charging network? Are you still able to use their chargers, or are you stuck with home charging & third parties?

d4l3k · on Aug 3, 2023

Supercharger auth is between the car and the charger and doesn't require an internet connection. I get billed the normal way via my Tesla account since the VIN is registered

wholinator2 · on Aug 3, 2023

Oh no, don't give them ideas. It'll become the HP instant ink of car charging

nikau · on Aug 4, 2023

Your L2 charging wire is low on copper, please replace the entire cable.

zoover2020 · on Aug 3, 2023

Hoe did you root yours? Did you lose out on any functionality?

d4l3k · on Aug 3, 2023

There's some functionality loss but it's mostly been mitigated. I have a custom app I wrote since I can't use the stock app.

The one feature I miss is that there's no voice commands since that requires Tesla's servers but at the same time I also haven't been bothered enough to plug in a custom backend

lrem · on Aug 3, 2023

wait

So the company that goes "we don't need physical buttons since we have voice commands" also goes "you don't need those in underground parkings"?!

majikandy · on Aug 3, 2023

It’s ok, the voice commands are barely understood anyway. At least in the UK they aren’t. Gets it drastically wrong and messes up your navigation destination, because you asked it to open the glovebox “navigating to Columbia”

judge2020 · on Aug 4, 2023

Are there api keys for google maps in the car? Or does it emulate some client like a browser or android phone?

d4l3k · on Jan 18, 2023

I just tried to set this up and couldn't. Seems like it's invite only with a waitlist :/

dave_universetf · on Jan 18, 2023

Yeah, we're adding people slowly because decentralized authorities like the one that tailnet lock implements can have nasty failure modes, e.g. some bug that prevents any new addition to the tailnet at all and forces manual recovery on each of your devices separately. So, we're putting miles on it with a little care, and making sure folks who sign up are aware of the current limitations and risks.

ikiris · on Jan 18, 2023

Oh is that all the problem is?

Anyone with automated deployments and self provisioning should be fine with that risk. I thought it was a lot more premature than this.

rollcat · on Jan 18, 2023

Good ops is more than automated deployments. Complex systems have complex failure modes.

tailscaletom · on Jan 18, 2023

If you're excited about tailnet lock and want to get on the alpha sooner rather than later, feel free to drop me an email. As Dave mentioned we are slowly crunching through the waitlist to get some miles in, but I'm also happy to take on enthusiastic testers ahead of that!

You can email me at tom@ (tailscale dot com)

d4l3k · on Aug 12, 2021

Adding port forwarding to Mosh has a $600 bounty -- highest OSS bounty I've ever seen

https://www.bountysource.com/issues/4471419-ssh-port-forward... https://github.com/mobile-shell/mosh/issues/337

retrir · on Aug 12, 2021

On high bounties, Qubes OS has a $6500 bounty for GNOME support https://www.bountysource.com/issues/31778112-add-support-for...

codetrotter · on Aug 12, 2021

As someone that used to use SSH port forwarding, I have a recommendation that may be a suitable alternative to the lack of port forwarding in Mosh, as well as being an alternative to port forwarding over SSH. Wireguard! This is what I do instead of port forwarding over SSH since quite a while back now.

I run a Wireguard VPN on a VPS, and have machines connect to that VPN. This allows me to reach the machines on the VPN from almost anywhere in the world. Recently I changed the port that Wireguard is listening on to port 443 UDP, which also allows me to connect to my VPN from a few public WLANs that are very restrictive on which ports they allow outbound traffic to.

Wireguard is super easy to configure and run, and very secure.

Definitely give Wireguard a go. It's open source and awesome.

mbreese · on Aug 12, 2021

I think you could setup something like this on the fly too without root access. I’m not entirely sure, but a while back fly.io published [1] an article talking about how they use wireguard-go [2] to do something similar in user space. I might even try this too…

[1] https://fly.io/blog/ssh-and-user-mode-ip-wireguard/

[2] https://git.zx2c4.com/wireguard-go/about/

gnyman · on Aug 12, 2021

there is a fork with port forwarding support https://github.com/rinne/mosh and a PR with a long discussion https://github.com/mobile-shell/mosh/pull/696 on why it's not merged

you can compile them yourself or if you want to skip the step I recently set up GitHub actions to compile linux binaries of this [1][2], tested by a sample of 1 so no guarantees it works, was planning on doing a tap PR/tap of it at some point

also the official developers have been involved a project to solve this while improving the whole-agent approval things also https://github.com/StanfordSNR/guardian-agent , but I couldn't get it to work which is why I tried the fork and got that working

[1] https://github.com/gnyman/mosh/actions/runs/1068715036 [2] https://github.com/gnyman/mosh/actions/runs/1068715035

codetrotter · on Aug 12, 2021

> a PR with a long discussion https://github.com/mobile-shell/mosh/pull/696 on why it's not merged

I'm confused. I read the whole thing but couldn't find the specific reason for why it's not been merged. But I assume it's because of the things that were pointed out in the code review comments?

Also, the issue you linked is about SSH Agent forwarding, not port forwarding.

gnyman · on Aug 12, 2021

Yes you are 100 correct, I mixed up port and agent forwarding, I’ve needed both at different times and last time it was agent forwarding so got confused.

There is another issue for port forwarding https://github.com/mobile-shell/mosh/issues/337 but no PR that I’m aware of.

Regarding why it hasn’t been merged, there is a comment on the port forwarding issue which sums it up quite well I think https://github.com/mobile-shell/mosh/issues/337#issuecomment...

My understanding is that the maintainers prefer doing one thing well (and securely). Which to be honest is something I really appreciate even if it means I might have to figure out some agent and port forwarding workaround :-/ at least I don’t have to worry about if my version of mosh will work with whatever the server runs

wngr · on Aug 12, 2021

Lack of SSH agent forwarding is unfortunately the deal breaker for me..

d4l3k · on April 18, 2021

They measure torque on the wheel from the drivers hand. It is possible to fool via defeat devices etc (ex www dot autopilotbuddy dot com).

There is a WIP system that uses the selfie camera to monitor the driver but it's still possible to fool (image taped in front or block it with tape etc) so unlikely it can catch all cases of drivers being willfully being dangerous. https://twitter.com/greentheonly/status/1379928419136339969

d4l3k · on April 18, 2021

They are working on a camera based solution though it's imperfect. You can see examples of it running at https://twitter.com/greentheonly/status/1379928419136339969