> Code quality was atrocious. We had one enormous Java method (>1000 lines) whic...

Frost1x · on June 20, 2021

AWS has the advantage of having so many engineers behind the scenes available for firefighting with a culture of pushing more than is reasonable that as a customer, that would sort of disappear. They simply have to occasionally make trade offs between unrealistic development/feature request goals and firefighting whenever the firefighting is needed. This also acts as another form of pressure to work even more to meet timeline goals.

Don't let your developers know that you're expecting them to always be behind, infinitely queued up with work, and constantly in emergency mode and they won't have much time to think about what's really going on and how efficiency is being pushed at the cost of their sanity.

wpietri · on June 20, 2021

> AWS is pretty reliable for the most part so I am pretty surprise that the code quality is that bad.

I'm not totally surprised because of two factors: very stable product definitions and lots and lots of users.

A number of years back, I was talking with people at a famous and popular site with a broad audience. I asked them how much unit testing they did. They said that particular isolated pieces sometimes had tests. But most of the user-facing stuff didn't because they had one-button rollout and one-button rollback. Instead of bothering with unit tests, they'd just frequently release changes, watch the metrics and the customer support queue, and quickly roll back if they'd introduced a bug.

zorked · on June 20, 2021

For very, very popular services, a second of being live will exercise more code paths and edge cases than even the most dedicated testing team could ever dream of.

We hear a hell of a lot about testing but the most fundamental piece of software quality nowadays is the release strategy: running on tee'd live production traffic, canarying, metrics and alerting, quick roll backs, etc.

notacoward · on June 20, 2021

That's an overly general statement. Can you do that for front-end code that stores all of its state elsewhere? Sure. Can you do it for a storage system? Absolutely freaking not. If you introduce a bug that loses or corrupts data, there's no going back. You will have committed the worst sin that somebody in that specialty can commit. Better to test as much as you can, at every level. Other kinds of code are often somewhere in between.

Also, even if it's true that being live will exercise more edge cases etc., it's a terrible way to test changes during early development. For one thing, there's no isolation. It becomes harder to determine which of several recent changes caused a problem, and that burden unfairly falls on the person who's on call instead of the person who introduced the error. And decent unit/functional tests allow "dumb" mistakes (we all make them) to be caught earlier than waiting in a deploy queue, allowing faster iteration. "Most recent change probably caused the problem" is a very useful heuristic, but the more low-assurance changes you allow in the less useful it becomes.

To drive the point home even further: I have found data-loss bugs in focused testing that didn't show up in prod for months. I know because in many cases I was able to add logging for the preconditions when I fixed the bug. No logs for months, then some completely unrelated and completely valid change by another engineer tickles the preconditions and BAM. That would have been an absolute nightmare for other members of my team, possibly even after I was gone. Based on those experiences, I will never believe that foregoing systematic early tests can be valid. The systems most of us work on are too complex for that.

"Test in prod" only works for trivial code and/or trivial teams. Not in the grown-up world.

ed_elliott_asc · on June 20, 2021

Everyone should be testing in prod, in that you release code and see metrics and monitoring to show that everything is working.

Testing in production is not going “let’s see if this will work” it is “we will release and validate that everything is working as expected”

People need to get over the old school cowboys who jump on prod to see if something works.

notacoward · on June 20, 2021

Yes, everyone should release code and watch metrics etc. but I think that's at the very edge of what "testing" encompasses. Between model checking, traditional forms of testing, and shadow-traffic testing (which can test higher per-server load than prod), finding something after deploy should be like a parachute failure. Yes those happen, yes there should be a reserve, but if it happens more than once in a blue moon you have a process problem somewhere (quite likely between teams/services but still).

mavelikara · on June 20, 2021

Tangential, where can I learn more about shadow-traffic testing? Books, blogs, tools etc.

notacoward · on June 20, 2021

Cindy Sridharan often writes well about these topics.

https://copyconstruct.medium.com/testing-in-production-the-s...

For this one I'd start about half way down, under the heading

"Shadowing (also known as Dark Traffic Testing or Mirroring)"

Unfortunately the terminology is a bit fragmented - shadowing, mirroring, teeing, dark traffic (ick), ad nauseam.

Whatever you call it, it's often pretty high overhead to add the infrastructure, unless you're already using some sort of "service mesh" (Envoy/Istio/Caddy/whatever) that supports it. Even then, if you're dealing specifically with a storage system then there can be some thorny issues - idempotent vs. non-idempotent vs. destructive requests, requests which require other objects (files/objects or directories/buckets) to exist or be in specific states, etc. I'm not going to pretend it's easy.

If you can do it, though, it can be an incredibly valuable tool. There ain't nothing like the real traffic, baby. ;) My favorite feature, which I alluded to earlier, is that you can shadow traffic from a larger production cluster onto a smaller shadow cluster and give it a serious stress test. All sorts of bugs tend to fall out that way. The one thing you can't really catch, even with a good shadow, is interactions with other services - including things like permissions or quotas. But if those are the only things you have to shake out in true production, you're doing well.

joshuamorton · on June 20, 2021

Teeing/dark launch/dual write strategies solve most of the issue for databases. Sure you run into concerns when changing the framework that manages that, but that's usually a far smaller surface area than your entire storage layer.

That said, you should have tests anyway.

_ivvf · on June 20, 2021

It's a very short-sighted view on testing, although I'm not surprised SREs would say it. The biggest problem with software deployment is that it is owned and managed by people who have no vested interest in developer productivity, including devops engineers.

A major goal of any org should be developer productivity; otherwise you are just hemorrhaging money and talent. When I say developer productivity, I mean: How confidently and quickly can I make a shippable, rollback-free change to a unit of software?

If you are the dos equis man of testing, "I don't always test my code, but when I do, I do it in production", then you can't confidently make any change without risking a production outage, so you play lots of games, like you mentioned, around canarying, rolling out to a small percentage of users, etc., but at the end of the day your developer productivity has absolutely tanked.

The goal of any system maintenance should be that a developer can quickly make and test a change locally and be highly confident that the change is correct. The canarying, phased rollouts, and other such systems should not be the primary means of testing code correctness.

wpietri · on June 20, 2021

Yeah, I really appreciate excellent rollout strategies, although I suspect a lot of them are more developed out of self defense by SRE teams. I see it as a series of safety nets: I'm still going to write tests for my code so that I don't have far to fall if I make a mistake. But I also want a safe rollout so if I miss the first net I don't splatter on the pavement.

And I totally agree with out about developer productivity. It's just not a consideration in most places. For example, in a factory or a restaurant, meetings are things that happen rarely and in constrained time slots, because everybody realizes that production is primary. But in most software companies, actually getting work done is second priority to meetings.

_ivvf · on June 20, 2021

Agreed. I was an SRE for over a year and the philosophy is that anything that is shipped can be broken. SRE is all about detecting, limiting, and mitigating damage. I think this is the right philosophy for SREs but should not be the total picture in the org.

I am also agreed in that I anecdotally often see a disregard for automated testing. I am still trying to understand how to eliminate this tendency. I know that in every software project I've had a major hand in building, I've helped ensure automated testing, with a heavy emphasis on unit testing, become a major part of team culture, and I've always felt the tests more than paid for themselves over time, even in the relative short-term.

wpietri · on June 20, 2021

For sure. I'm still trying to understand it too. Two things that have helped me:

To contain time pressure, I like a kanban board with small units of work. If the team has a history of steady delivery of small lumps of useful stuff, managers are more willing to trust that we know what we're doing.

To mitigate the natural human lack of humility, I start with the rule that every bug requires a test before fixing. People may act as if they won't make mistakes, but it's much harder to claim that when we have a live bug. And then I like to introduce pair programming. Collaboratively adding failing tests and fixing them (as with ping-pong pairing) makes it fun.

What approaches do you use to shift things over?

_ivvf · on June 21, 2021

I find it's much easier for greenfield projects than existing projects because you can lead by example. Also, with these projects there are opportunities to define team rules and culture for the project from the beginning.

Perhaps the greatest inertial force against improving test automation is the truth that any change to a in-production product incurs risk, coupled with the fact that adding tests often requires refactoring to make the code more testable. Techniques like writing tests as you refactor never get employed because there is resistance to any refactoring occurring at all in the first place. The general principle followed is: "make the smallest change possible to accomplish the objective".

I feel like overcoming these forces requires bravery from management or the team, coupled with a longer term vision for the project. For some projects, which are limping along on life support, it may not even be worth it to improve code quality. However, for most other projects I believe there is a better balance to be had between not unnecessarily breaking the product and not being afraid to make relatively risky changes to improve maintainability of the product.

wpietri · on June 23, 2021

Makes total sense. I do most of my work on greenfield projects for that reason.

But I think you're right. There's really no low-risk path when you have a poorly tested code base. You can keep letting productivity decline, which guarantees eventual project failure. You can do a giant rewrite, which is hugely risky. Or you can gradually dig yourself out.

There's still risk there, of course. But it's in smaller, more manageable lumps. It seems like the clearly superior path to me.

hibikir · on June 20, 2021

If the release/rollback process is fast enough, and your detection of anomalies is fast enough, you can still have great productivity, and few relevant outages, when testing in production. Hell, there are situations where testing outside of production is never going to cut it, as generation of sufficient load of the right shape would take you a whole lot more of engineering time than the consequences of failure.

That said, the tradeoffs are different for different companies, and different services in the same company: Within the same team at $large_company, I owned code where testing in production, via deployments and an amazing feature flag system, was better than unit tests, while there were other areas where the build system would dedicate many CPU-hours to testing before any release. To be able to have that flexibility though, you need to know your systems, know your problems, and have great tooling for both testing in production and extremely parallelized test suites. Small and medium sized companies might not have either alternative, and we had both!

So what I'd say is that any general rule on what should be the primary means of testing code correctness is going to not lead to optimal productivity, and even more so if you don't have top quality of tooling across every possible dimension. It's perfectly OK to argue about specific examples, but without judgement of this kinds of things without having the entire story of what's there is just hubris.

_ivvf · on June 21, 2021

I unfortunately can't agree with that sentiment. If I have to rely on production traffic to test my feature, then, at minimum, my feedback cycle is:

1) Make change. Think really hard about it to make sure it's correct.

2) put out code review.

3) get approval. Merge change.

4) CI pipeline builds and deploys to prod.

5) absent of alerts must mean it works?

Even if you have no QA environment and nothing between you and prod, I've rarely seen deployment to prod take less than 30 minutes. That's an hour feedback cycle. Contrast with:

1) write code.

2) write unit tests.

3) run tests locally.

The feedback cycle, especially when you get iterative, can get as low as single digit seconds. I run my tests and see a bug. I fix the bug, then re-run the tests. Similarly, for a more complex feature, I can break the feature down into multiple cycles of build, test, verify.

And that's not even accounting for the overhead of managing feature flags, which is not free. In the best case, you need to at least release a second PR to remove the feature flag when the feature is successful. At my previous employer, this step was often forgotten and resulted in real, consequential technical debt as it became harder to figure out how the product behaved based on which feature flags were turned off or on.

If you have an experience leads you to believe that production testing can more productive than local automated testing, I at least have never seen it occur in and I find it difficult to even imagine it being true.

polotics · on June 20, 2021

Tell that to the millisecond of testing in production that makes the MRI fry the patient's brain, to the one that trades one trillion instead of one thousand naked puts, to the nuclear armaggedon launch check that canaries humanity...

bradleyjg · on June 20, 2021

Depends on if you are serving ads over cat pictures or routing air traffic. Different solutions for different problems.

treis · on June 20, 2021

>For very, very popular services, a second of being live will exercise more code paths and edge cases than even the most dedicated testing team could ever dream of.

Most of the code we care about is to handle anomalous situations. That AZ going down a week or two is a good example. It's when stuff like that happens that a bunch of code springs to life to keep things running. And indeed, things didn't exactly roll over just fine for us.

salil999 · on June 20, 2021

I think one thing I learned from AWS is that there's so much hidden away from the customer. There definitely were (and probably still are) many issues which the customers won't actively experience. Reliability doesn't necessarily equate to good standards and good practice.

But yes, from a customer point of view, AWS is pretty nice.

cpach · on June 20, 2021

Reminds me of that old quote by John Godfrey Saxe:

”Laws, like sausages, cease to inspire respect in proportion as we know how they are made.”

markus_zhang · on June 20, 2021

"I just had one for breakfast." -- Jim Hacker

awsthro00945 · on June 20, 2021

AWS is very big, culturally, on making sure that all the bugginess from shitty code is not shown externally to the customer. Externally it might look like everything is fine to you, but internally AWS is a massive, leaky cargo ship with thousands of engineers running around 24/7 with duct tape and band-aids to plug the leaks.

bradleyjg · on June 20, 2021

> AWS is pretty reliable for the most part

In telecom or traditional mainframes, for example, the compute unit itself was expected to be reliable. Individual elements of AWS are not pretty reliable in that context. Check out the single host EC2 SLA.

However, today most large or even medium scale software assumes unreliable individual elements and has redundancy at the program level. For that purpose, AWS core services are pretty reliable.

oceanplexian · on June 20, 2021

Way back I used to work in telecommunications at a place that provided POTS service. They are two very completely different worlds. Software engineers act as if 5 9's is a badge of honor, when really it isn't. When you are responsible for something that people use to dial 911 and can make the difference between life and death a few minutes of downtime doesn't cut it.

bradleyjg · on June 20, 2021

Right, and even five nines would be impressive compared to:

AWS will use commercially reasonable efforts to ensure that each individual Amazon EC2 instance (“Single EC2 Instance”) has an Hourly Uptime Percentage of at least 90% of the time in which that Single EC2 Instance is deployed during each clock hour (the “Hourly Commitment”). In the event any Single EC2 Instance does not meet the Hourly Commitment, you will not be charged for that instance hour of Single EC2 Instance usage.

This essentially forces the use of distributed computing for even small businesses.

ev1 · on June 20, 2021

EC2 is absolutely not meant for this, though. Use an abstraction layer like Heroku if you're going to not understand what you're getting into.

The amount of times I've had to 'advise' small businesses that are somehow running their small business site off a single EC2 instance's ephemeral boot volume is atrocious.

bradleyjg · on June 20, 2021

I don’t have any experience with huroku but what most small businesses need is a (perhaps simulated) reliable box on a fast network. As glorious as the paxos based present is, it’s overkill to the point of distraction for most businesses. The whole attraction of the cloud for them is not needing to hire sysadmins. Replacing that requirement with needing a devops team is even worse.

kottapar · on June 20, 2021

This is indeed surprising. Any time we have slowness issues the usual recommendation would be to throw resources at the problem; increase cpu, add more memory et al. We used to lament that we should spend time debugging the problem and fix the actual issue. We then used to say that probably at places like AWS and the other biggies they'd be following some excellent best practices and we should also strive to reach that level of excellence.

colde · on June 20, 2021

I think that highly depends on the service. The new App Runner service for instance is a wild ride of buggyness, lack of testing and incorrect documentation.

papito · on June 20, 2021

For something heavily used, like the EC2 and load-balancing, perhaps, but I am still experiencing PTSD from my last CloudFormation encounter.