Many years ago, I
was an engineer at a large software company, and helped to launch a new
product. I spent months going to preparation meetings, and filling out various
readiness checklists. People looked over our code for security issues, and we
had to do “load testing” (although, in my experience, the only real load testing
you can do is with live users). And, after what seemed like an eternity, we
were finally ready to launch.
We went live, and were instantly overwhelmed by the traffic we received. A security issue popped up due to the load - basically, if the auth server fell over, we would let the user in (this was my fault - I didn’t know any better). We ran out of disk space within a few days, and had to switch over to larger servers (which required manual intervention). Overall, it took about a month to get our servers stabilized, and while nothing catastrophic happened, we couldn’t exactly say that we were prepared for launch. So what was the purpose of all these checklists?
The Problem With Checklists
Well, in my experience, most safeguards involve making a list of all of the things that went wrong in the past, and preventing these from happening again. And that’s great - if you make your list long enough, you will prevent most of the common mistakes. As a result, large companies tend to have lots of red tape, and to launch anything, you need to clear all of that red tape. This has a way of stifling innovation, or at the very least, reducing the outliers to the mean. And, you have the added problem that, no matter how many safeguards you put in place, you won’t ever prevent bad things from happening. This is the whole thesis of the book "The Black Swan," which asserts that while the chance of a specific outlier happening is pretty small, the total sum of outliers (i.e. the long tail) is actually pretty significant (if you’ve read the book, you won’t be surprised by anything I’m going to say from here on).
So how can you guard against things that you can’t even predict? I mean, if you are a new startup these days, you probably finish coding your MVP, push it to Heroku, and then post it to Hacker News/Facebook/Reddit as quickly as possible. There are a million startups doing exactly what you are doing, so time is of the essence. You probably don’t have time for launch checklists - heck, a lot of early stage startups don’t even bother with testing (we can debate the merits of this later). Anything that reduces your velocity is probably impeding your ability to succeed, right? So how can you predict which one of a million bad things could possibly happen to your company? After all, the most important thing is getting people to use your product.
So, here’s what I would recommend - think of the most common classes of things that could go wrong, and then figure out ways to mitigate the damage of these outliers. In general, these would be:
- My server goes down, either because I get too much load, or because a component fails. The load situation probably isn’t going to happen at first (you might want to focus more on the situation where no one comes), although there might be pieces of infrastructure that won’t even handle a minimal amount of load. You should know what your weakest points are, and how you are going to handle these either going down or not performing adequately. Also consider what happens if a service provider isn’t up to spec. On this line of thought, I've had a lot less stress when I've used known service providers (e.g. Heroku, WPEngine, Posthaven) than when I've tried to host things by myself. Yes, Heroku is down/slow sometimes, but less frequently than your server will be if you don't have a full-time site reliability engineer.
- I get hacked. Probably not going to happen at first, but by the time that it happens, it might be a big deal (see Snapchat or Target). Try to make sure that, even if someone compromises your production database, they can’t get any payment information or cleartext passwords (and salt your hashes).
- Some "idiot" on my team accidentally deletes the production database (and I put it in quotes, because intelligent people screw up every once in a while, and it's fine, so long as it's once in a while). I actually had this happen once recently - good thing we had a fairly recent backup. And you bet that I put in place a much more aggressive backup scheme once we lost 6 hours of user data and had to apologize. There is a subtler version of this, and that’s that I push a totally broken build, and there’s an irreversible database migration so I can’t go back to the old build. This is like a failure scenario, but the problem is that you can’t just push a new database server unless you have a backup of the database (you should probably have a mitigation scenario for anything that’s irreplaceable). In general, the solution is a fairly frequent backup that’s stored offsite.
There are probably a bunch of other things you should account for. If you have dependencies on external servers or services, try to understand what having them going down will do. This may cause you to go down, and that might be ok (just put up a big fail whale), but try to make sure that there isn’t any non-deterministic behavior (i.e. your auth server goes down and people can suddenly view the contents of every account). While you aren’t going to make your server hacker proof, you might want to think about the consequences of a security error and plan accordingly (i.e. you forget to auth protect a new page that you put on the server) - in general it’s better to blanket deny access than to blanket grant it.
And one of the most important things I learned at that big company was to put in place a service that emails you whenever you get a 500 server (there are a number of good services). Same thing for server monitoring – it’s better to get an email that your database server is down than to hear from an angry user who can’t access his data.
A lot of good companies go further than just having a plan - they actually stage failures, and try to recover from them. I’ve heard that the Heroku team used to play a game where someone would take down pieces of their service in a creative way, and then other people would have to figure out how to recover. Apparently, the Obama campaign’s IT infrastructure stayed up through Hurricane Sandy because they had already staged the contingency where the eastern seaboard went down.
Planning For Your Business
Finally, there are the disasters that are specific to your company’s business. For example, what happens if your competitors send you a cease and desist? What if Google drops you from their index? What happens if your sole data supplier revokes your access or pulls your contract (hint - never single-source anything if you can help it)? What happens if a founder walks out one day (this happens all the time)? While you aren’t going to predict all of these, you can probably figure out the major classes of disasters, and take some steps to prepare for their possible (likely?) occurrence. For example, if the Internet goes down, I’m going to walk up and down Valencia Street passing out paper copies of my blog post to all the hipsters sitting in coffee shops.
In general, while you aren’t going to be prepared for everything, intelligent preparation goes a long way. Emphasize velocity above everything else, but having a mitigation plan for your most common disasters can save you a lot of stress later on. Because we’ve all been in the situation where we we’re running around trying to fix an issue, and it’s always worse when users are screaming at us over email/Facebook/Twitter/phone and we have no idea what we’re going to do to get through this one.