And there his account went poof, thanks for archiving.

treesknees · on Oct 4, 2021

They were quoted on multiple news sites including Ars Technica. I would imagine they were not authorized to post that information. I hope they don't lose their job.

Shareholders and other business leaders I'm sure are much happier reporting this as a series of unfortunate technical failures (which I'm sure is part of it) rather than a company-wide organizational failure. The fact they can't physically badge in the people who know the router configuration speaks to an organization that hasn't actually thought through all its failure modes. People aren't going to like that. It's not uncommon to have the datacenter techs with access and the actual software folks restricted, but that being the reason one of the most popular services in the world has been down for nearly 3 hours now will raise a lot of questions.

Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.

jart · on Oct 4, 2021

Facebook should have had a panic room.

Operations teams normally have a special room with a secure connection for situations like this, so that production can be controlled in the event of bgp failure, nuclear war, etc. I could see physical presence being an issue if their bgp router depends on something like a crypto module in a locked cage, in which case there's always helicopters.

So if anything, Facebook's labor policies are about to become cooler.

samhw · on Oct 4, 2021

Yup, it's terrifying how much is ultimately, ultimately dependent on dongles and trust. I used to work at a company with a billion or so in a bank account (obviously a rather special type of account), which was ultimately authorised by three very trusted people who were given dongles.

cyberpunk · on Oct 4, 2021

What did the dongles do?

samhw · on Oct 4, 2021

Sorry, I should have been clearer - the dongles controlled access to that bank account. It was a bank account for banks to hold funds in. (Not our real capital reserves, but sort of like a current account / checking account for banks.)

I was friends with one of those people, and I remember a major panic one time when 2 out of 3 dongles went missing. I'm not sure if we ever found out whether it was some kind of physical pen test, or an astonishingly well-planned heist which almost succeeded - or else a genuine, wildly improbable accident.

mike_d · on Oct 4, 2021

I would be absolutely shocked if they didn't.

The problem is when your networking core goes down, even if you get in via a backup DSL connection or something to the datacenter, you can't get from your jump host to anything else.

jart · on Oct 4, 2021

It helps if your dsl line is is bridging at layer 2 in the osi model using rotated psks, so it won't be impacted by dns/bgp/auth/routing failures. That's why you need to put it in a panic room.

sulam · on Oct 4, 2021

That model works great, until you need to ask for permission to go into the office, and the way to get permission is to use internal email and ticketing systems, which are also down.

jart · on Oct 5, 2021

Operations teams don't need permission from some apparatchik to enter the office when production goes down. If they can't get in, they drill.

Sebb767 · on Oct 4, 2021

> nuclear war

I think you need some convincing to keep your SREs on-site in case of a nuclear war ;)

cyberpunk · on Oct 4, 2021

Hey, if I can take the kids and there’s food for a decade and a bunker I’m probably in ;)

legitster · on Oct 4, 2021

I'm not sure why shareholders are lumped in here. A lot of reasons companies do the secret squirrel routine is to hide their incompetence from the shareholders.

treesknees · on Oct 4, 2021

That is what I meant, although you have lots of executives and chiefs who are also shareholders.

polote · on Oct 4, 2021

> an organization that hasn't actually thought through all its failure modes

Thinking about any potential things that can happen is impossible

depereo · on Oct 4, 2021

You don't need to consider 'what if a meteor hit the data centre and also it was made of cocaine'. You do need to think through "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."

fragmede · on Oct 4, 2021

In a company the size of FaceBook, "everything is turned off" has never happened since before the company was founded 17 years ago. This makes is very hard to be sure you can bring it all back online! Every time you try it, there are going to be additional issues that crop up, and even when you think you've found them all, a new team that you've never heard of before has wedged themselves into the data-center boot-up flow.

The meteor isn't made of cocaine, but four of them hitting at exactly the same time is freakishly improbable. There are other, bigger fish to fry, that we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

depereo · on Oct 5, 2021

>we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

I think that suggests that there were not bigger fish to fry :)

I take your point on priorities, but in a company the size of facebook perhaps a team dedicated to understanding the challenges around 'from scratch' kickstarting of the infrastructure could be funded and part of the BCP planning - this is a good time to have a binder with, if not perfectly up-to-date data, pretty damned good indications of a process to get things working.

yccs27 · on Oct 5, 2021

>> we're going to treat four simultaneous meteors as impossible. Which is great, but then one the day, five of them hit at the same time.

> I think that suggests that there were not bigger fish to fry :)

I can see this problem arising in two ways:

(1) Faulty assumptions about failure probabilities: You might presume that meteors are independent, so simultaneous impacts are exponentially unlikely. But really they are somehow correlated (meteor clusters?), so simultaneous failures suddenly become much more likely.

(2) Growth of failure probabilities with system size: A meteor hit on earth is extremely rare. But in the future there might be datacenters in the whole galaxy, so there's a center being hit every month or so.

In real, active infrastructure there are probably even more pitfalls, because estimating small probabilities is really hard.

cesarb · on Oct 4, 2021

> "how do I get this back online in a reasonable timeframe from a starting point of 'everything is turned off and has the wrong configuration'."

The electricity people have a name for that: black start (https://en.wikipedia.org/wiki/Black_start). It's something they actively plan for, regularly test, and once in a while, have to use in anger.

depereo · on Oct 5, 2021

It's a process I'm familiar with gaming out. For our infrastructure, we need to discuss and update our plan for this from time to time, from 'getting the generator up and running' through to 'accessing credentials when the secret server is not online' and 'configuring network equipment from scratch'.

JabavuAdams · on Oct 4, 2021

I love that when you had to think of a random improbable event, you thought of a cocaine meteor. But ... hell YES!

radicalbyte · on Oct 4, 2021

Luckily you don't need to do that exhaustively: all you have to do is cover the general failure case. What happens when communications fail?

This is something that most people aren't good at naturally, it tends to come from experience.

jnwatson · on Oct 4, 2021

Right, but imagining that DNS goes down doesn’t take a science fiction author.

mynameisvlad · on Oct 4, 2021

Of course you can’t think of every potential scenario possible, but an incorrect configuration and rollback should be pretty high in any team’s risk/disaster recovery/failure scenario documentation.

philwelch · on Oct 4, 2021

This is true, but it's not an excuse for not preparing for the contingencies you can anticipate. You're still going to be clobbered by an unanticipated contingency sooner or later, but when that happens, you don't want to feel like a complete idiot for failing to anticipate a contingency that was obvious even without the benefit of hindsight.

deanCommie · on Oct 4, 2021

> I hope they don't lose their job.

I hope they do.

#1 it's a clear breach of corporate confidentiality policies. I can say that without knowing anything about Facebook's employment contracts. Posting insider information about internal company technical difficulties is going to be against employment guidelines at any Big Co.

In a situation like this that might seem petty and cagey. But zooming out and looking at the bigger picture, it's first and foremost a SECURITY issue. Revealing internal technical and status updates needs to go through high-level management, security, and LEGAL approvals, lest you expose the company to increased security risk by revealing gaps that do not need to be publicized.

(Aside: This is where someone clever might say "Security by obscurity is not a strategy". It's not the ONLY strategy, but it absolutely is PART of an overall security strategy.)

#2 just purely from a prioritization/management perspective, if this was my employee, I would want them spending their time helping resolve the problem not post about it on reddit. This one is petty, but if you're close enough to the issue to help, then help. And if you're not, don't spread gossip - see #1.

samhw · on Oct 4, 2021

You're very, very right - and insightful - about the consequences of sharing this information. I agree with you on that. I don't think you're right that firing people is the best approach.

Irrespective of the question of how bad this was, you don't fix things by firing Guy A and hoping that the new hire Guy B will do it better. You fix it by training people. This employee has just undergone some very expensive training, as the old meme goes.

deanCommie · on Oct 4, 2021

I feel this way about mistakes, and fuckups.

Whoever is responsible for the BGP misconfiguration that caused this should absolutely not be fired, for example.

But training about security, about not revealing confidential information publicly, etc is ubiquitous and frequent at big co's. Of course, everyone daydreams through them and doesn't take it seriously. I think the only way to make people treat it seriously is through enforcement.

unethical_ban · on Oct 4, 2021

I feel you're thinking through this with a "purely logical" standpoint and not a "reality" standpoint. You're thinking worst case scenario for the CYA management, having more sympathy for the executive managers than for the engineer providing insight to the tech public.

It seems like a fundamental difference of "who gives a shit about corporate" from my side. The level of detail provided isn't going to get nationstates anything they didn't already know.

deanCommie · on Oct 4, 2021

Yeah but what is the tech public going to do with these insights?

It's not actionable, it's not whistleblowing, it's not triggering civic action, or offering a possible timeline for recovery.

It's pure idle chitchatter.

So yeah, I do give a shit about corporate here.

Disclosure: While I'm an engineer too, I'm also high enough in the ladder that at this point I am more corporate than not. So maybe I'm a stooge and don't even realize it.

fragmede · on Oct 8, 2021

Facebook, the social media website is used, almost exclusively for 'idle chitchatter', so you may want to avoid working there if your opinion of the user is so low. (Actually, you'll probably fit right in at Facebook.)

It's unclear to me how a 'high enough in the ladder' manager doesn't realize that there's easily dozen people who know the situation intimately but who can't do anything until a dependent system to them is up. "Get back to work" is... the system is down, what do you want them to do, code with a pencil and paper?

ramenporn violated the corporate communication policy, obviously, but the tone and approach for a good manager to an IC that was doing this online isn't to make it about corporate vs them/the team, and in fact, encourage them to do more such communication, just internally. (I'm sure there was a ton of internal communication, the point is to note where ramenporn's communicative energy was coming from, and nurture that, and not destroy that in the process of chiding them for breaking policy.

jfrunyon · on Oct 4, 2021

> Edit: I also hope this doesn't damage prospects for more Work From Home. If they couldn't get anyone who knew the configuration in because they all live a plane ride away from the datacenters, I could see managers being reluctant to have a completely remote team for situations where clearly physical access was needed.

You're conflating working remotely ("a plane ride away") and working from home.

You're also conflating the people who are responsible network configuration, and for coming up with a plan to fix this; and the people who are responsible for physically interacting with systems. Regardless of WFH those two sets likely have no overlap at a company the size of Facebook.

_joel · on Oct 4, 2021

There could be something in the contract that requires all community interaction to go via PR official channels.

It's innocous enough, but leaking info, no matter what, will be a problem if it's stated in their contract.

htrp · on Oct 4, 2021

100%! comms will want to proof any statement made by anybody along with legal to ensure that there is no D&O liability for sec fraud.

rusk · on Oct 4, 2021

> an organization that hasn't actually thought through all its failure modes

Move Fast and Break Things!

keithnoizu · on Oct 4, 2021

I came here to move fast and break things, and i'm all out of move fast.

avs733 · on Oct 5, 2021

In their defense they really lived up to their mission statement today.

projectazorian · on Oct 4, 2021

I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID

vineyardmike · on Oct 4, 2021

> I doubt WFH will be impacted by this - not an insider but seems unlikely that the relevant people were on-site at data centers before COVID

I think the issue is less "were the right people in the data center" and more "we have no way to contact our co-workers once the internal infrastructure goes down". In non-wfh you physically walk to your co-workers desk and say "hey, fb messenger is down and we should chat, what's your number?". This proves that self-hosting your infra (1) is dangerous and (2) makes you susceptible to super-failures if comms goes down during WFH.

Major tech companies (GAFAM+) all self-host and use internal tools so they're all at risk of this sort of comms breakdown. I know I don't have any co-workers number (except one from WhatsApp which if I worked at FB wouldn't be useful now).

saagarjha · on Oct 5, 2021

Apple is all on Slack.

vineyardmike · on Oct 5, 2021

But is it a publicly hosted slack, or does apple host it themselves?

saagarjha · on Oct 5, 2021

I don't think it is possible to self-host Slack.

vineyardmike · on Oct 6, 2021

Amazon has a privately managed instance.

practice9 · on Oct 4, 2021

Most of the stuff was probably implemented before COVID anyways.

They will fix the issue and add more redundant communication channels, which is either an improvement or a non-event for WFH.

And Zuck is slowly moving (dogfooding) company culture to remote too with their Quest work app experiments

fanbelt · on Oct 4, 2021

They must have been moving very fast!

rStar · on Oct 4, 2021

shoestring budget on a billion dollar product. you get what you deserve.

swayson · on Oct 4, 2021

> I hope they don't lose their job.

FB has such poor integrity, I'd not be surprised if they take such extreme measures.

kukx · on Oct 4, 2021

It is a matter of preparation. You can make sure there are KVMoIPs or other OOB technologies available on site to allow direct access from a remote location. In the worst case technician has to know how to connect the OOB device or press a power button ;)

treesknees · on Oct 4, 2021

I'm not disagreeing with you, however clearly (if the reddit posts were legitimate) some portion of their OOB/DR procedure depended on a system that's down. From old coworkers who are at FB, their internal DNS and logins are down. It's possible that the username/password/IP of an OOB KVM device is stored in some database that they can't login to. And the fact FB has been down for nearly 4 hours now suggests it's not as simple as plugging in a KVM.

kukx · on Oct 4, 2021

I was referring to the WFH aspect the parent post mentioned. My point was that the admins could get the same level of access as if they were physically on site, assuming the correct setup.

harias · on Oct 4, 2021

Pushshift maintains archives of Reddit. You can use camas reddit search to view them.

Comments by u/ramenporn: https://camas.github.io/reddit-search/#{%22author%22:%22rame...

tornato7 · on Oct 4, 2021

PushShift is one of the most amazing resources out the for social media data and more people should know about it

madars · on Oct 4, 2021

Can you recommend similar others (or maybe how to find them)? I learned of PushShift because snew, an alternative reddit frontend showing deleted comments, was making fetch requests and I had to whitelist it in uMatrix. Did not know about Camas until today.

rodgerd · on Oct 4, 2021

If it was actually someone in Facebook, their job is gone by now, too.