How Verizon fixed a recent backbone provider issue

aroch · on Feb 6, 2015

The cynic in me wonders if this is blog post was written solely to serve as positive PR for Verizon in the face of their horrible transit/peering infrastructure management (ie. letting links run hot so they can charge transit providers more money for each increase in commit or letting peering ports run very hot because they don't want to spend the few thousand to add another 10G because they'd rather be paid for transit) before the FCC vote.

Edgecast is owned by Verizon, so cynical-me is having a hard time believing this is not a plant.

shdon · on Feb 6, 2015

No company is all bad. I'm sure there are plenty of techs at Verizon and its subsidiaries that are just trying to do the best job they possibly can and are excited and rightfully proud to diagnose and fix such a problem. Looks like they did some good detective work and deserve the bragging rights.

We see similar posts pop up once in a while with other such situations (CloudFlare springs to mind). It just happens to be Verizon in this case.

PhantomGremlin · on Feb 7, 2015

> No company is all bad.

I think there are a few that are. I see very little good in a company like Philip Morris. They manufacture a delivery system for an addictive drug that, as a side effect, results in countless cancer deaths.

I couldn't possibly see myself working for a company like that.

OTOH under the right circumstances I'd work for Verizon. Their evil is venial compared to merchants of death. But a man's gotta eat, and that's why we prostitute ourselves out to these evil behemoths. And why we rationalize that we're "just trying to do the best job we possibly can".

rayiner · on Feb 6, 2015

If we injected politics into every tech article, there would be no tech articles here. Article on how Graph Search works under the hood? "But did you know Facebook makes their money peddling consumerism to kids?" Etc. At the end of the day, the technology involved in flinging all those terabits of Facebook/Twitter/Netflix content around the aether is legitimately interesting, whatever the FCC vote may be.

beachstartup · on Feb 6, 2015

... dude, it's a company blog post. they're writing about what they do in their noc. it's basic SEO and marketing, not some telco conspiracy.

dredmorbius · on Feb 6, 2015

Given the "being good stewards of the Internet" headline: yeah, total submarine.

http://www.paulgraham.com/submarine.html

PhantomGremlin · on Feb 7, 2015

You're not the only cynic on HN. When I read this I immediately thought of the somewhat different "infrastructure management" that occurred when Verizon extorted Netflix a while ago.

That blog post would have been written something like this:

Verizon: Those sure are nice packets you're asking us to deliver to our mutual customers. It'd be a shame if something happened to them at the interface between our networks.

Netflix: Ouch. Ouch. Ouch. Thank you for bending us over and screwing us. Here, take some of our money. Please make the pain stop.

jacquesm · on Feb 6, 2015

They're quite up-front about edgecast being owned by verizon (see large logo at the top left of the page) and the lead sentence.

This is a pretty good write-up of what happens behind the scenes when debugging a network issue.

KaiserPro · on Feb 6, 2015

As someone who lives in the UK, and purchases bandwidth all over the world I say this:

That's your fucking job, what do you want, a medal?

Seriously, the level of crap that verizon, Level 3 and AT&T put out is immense. I was trying to get a 100meg line in downtown redwood city. (this was q4 2013) at first I was told that the exchange was full. After that they said I could have bonded T3. After much screaming and shouting they decided that they would do me a favour and provision a fibre line.(bear in mind this was part of a large global account, with MPLS and other such niceties)

In the end it took 3 months of epic hassle (this was without way leave) just to get to the point of connection. after another 6 weeks of pointless meandering, I had fibre.

However because of the level of skill at the NOX it was another 3 fucking weeks to get it lit properly. (15% packet loss is not acceptable by the way)

The worst part of this is the cost, $4500 a month for a steaming pile of shite, backed by people to thick to open doors effectively.

In london this is how it went down: Phone up $provider, I want a 1 gig line please.

$provider: sure, that might be up to 90 days, pending legals and survey

Me: ok

$provider: (week later) survey is done, line should be lit in a month

$provider: oh and thats £1500 a month

In conclusion, just fuck right off, get off your arse and fucking do what we all pay you for, provide some fucking bandwidth.

feld · on Feb 6, 2015

If they're provisioning fiber lines and not documenting loss, doing OTDR, etc they're doing it wrong. It should not have gone live and handed to the customer without passing a fucking smell test.

edit: my past experience with CenturyLink was always great -- they provided us with extremely detailed documents of the provisioned line with loss, bandwidth tests, etc.

minaguib · on Feb 6, 2015

I've blogged about an almost exact replica of this in 2012: http://mina.naguib.ca/blog/2012/10/22/the-little-ssh-that-so...

Funny enough, I'm a direct client of EdgeCast and I believe I witnessed a case of file corruption on their layer last week.

kbenson · on Feb 6, 2015

I actually had to stop for a second, and wonder if they actually found (or were notified) of the exact same issue and did their own write up. I wasn't aware your post was from 2012, I first saw it in a recent HN repost.

photorized · on Feb 6, 2015

As a (smaller) CDN, we routinely solve these sorts of problems. Never occurred to be to blog about it in such manner. But then again, we suck at PR.

shdon · on Feb 6, 2015

Yes, as Kalleboo said, please do. I can't speak for others, but personally I find such postmortems to be among the most interesting kind of articles on here.

kalleboo · on Feb 6, 2015

Please blog about it! It's usually an interesting read. People eat up posts like this and from CloudFlare.

op00to · on Feb 6, 2015

Full disclosure: thehelix112's profile indicates that it belongs to "Directory of Security Development (http://www.edgecast.com)".

jpmattia · on Feb 6, 2015

As a guy that has done some hardware for router interfaces, this was an interesting post regardless of source.

There are lots of built-in monitoring functions for the hardware interfaces. For example, the laser driver for the transmitter often drops in power before failure, so there is a hardware function built-in on the receiver side to report the avg laser power. If the power drops below a certain threshold, an alarm goes off on the monitoring software.

The author is pointing out an interesting failure mode not automatically caught. As a shot in the dark: In the receiver, the hardware is spec'd to receive 31 zeros in a row (iirc, which btw goes way back to early SONET specs). If the detector (eg) is degrading in a way that causes a particular string of zeros (or some other fixed pattern) to fail, you would see it show up much like the test that was run.

What's interesting is that the errors did not light up alarms on the monitoring. Perhaps crc errors have a threshold, and the bit pattern causing the failure was showing up statistically below the threshold.

Anyway, I'm sure there's some hardware failure-analysis guy out there who would love to look harder at that link.

mentat · on Feb 6, 2015

Self submission / promotion is not an issue here if it's not illegitimated seeded with votes.

Quequau · on Feb 6, 2015

Nonetheless, I think it would be better if the title had been worded such that this relationship was made clear.

logn · on Feb 6, 2015

No one does that except for Show HN and that's only for the debut of a new project.

X-Istence · on Feb 6, 2015

One of the issues with newer switching/routing gear (if you want to call it issues...) is that they have to work at such incredible speeds that they are unable to actually keep up with verifying packet integrity before shooting it down into the next hop.

If you want 10 Gbit/sec line rate, one of the ways to do that is to simply forego any and all verification that packets are not corrupted and shove it out of an ethernet port as fast as possible.

I recently ran into this at work, we had some issues with corrupted packets, and we ended up tracing it down to a Twinax cable that was failing. But every single switch in the path to the server happily forwarded the corrupt packets.

Luckily Cisco has some internal counters that show issues like that, and after tracing it down multiple switches/routers we found the culprit and fixed the issue!

jlgaddis · on Feb 6, 2015

> ... they have to work at such incredible speeds that they are unable to actually keep up with verifying packet integrity before shooting it down into the next hop.

Yep: http://en.wikipedia.org/wiki/Cut-through_switching

thehelix112 · on Feb 6, 2015

[Full disclosure: I am an employee at Verizon/EdgeCast. Wasn't sure how to pop that in the title to be honest]. We have updated the blog post to clarify some of the technical questions that have come up in this discussion, particularly around the confusion that Verizon/EdgeCast was the (Tier 1) transit provider in question.

bogomipz · on Feb 6, 2015

I realize Verizon bought Edgecast last year but why is the post being promoted by Edgecast and not Verizon themselves? Edgecast doesn't own or operate a backbone, they are an edge provider so this is a completely separate concern. This is typical dirty Verizon/big telco PR in my opinion.

Edgecast is a great CDN and company, I'm curious to see how this new ownership plays out.

nevir · on Feb 6, 2015

Oh come on. It's an interesting article.

swampthing · on Feb 6, 2015

Great, now if only Verizon could fix their extreme LTE flakiness / slowness (no pun intended) in the SF Bay Area...