Stephan Schmidt

Our Fetish with Failover and Redundancy

How failover makes things worse

There is this fetish in our industry with failover. Is 99.9+% important for planes, nuclear plants, and hospitals? Yes. Is it important for your startup that offers houses online? No. Our reflex as engineers is to constantly optimize availability - often not in touch with the business.

And while some failover is cheap - applications and microservices - some failover is expensive, like databases and message queues.

From my experience, 5-minute outages have no impact on revenue if they do not happen too often, and you’re small. If they have an impact, the value proposition of your startup might not be good enough.

But even if we get failover working, e.g., with databases, often this creates more problems than it solves. When working at a tech company, the big-vendor DB cluster locked up due to problems with the interaction of nodes. When working at another tech company with MySQL failover, we had many more false failovers than real ones. Chances are that the failover does not work correctly because of its complexity or the failover flickers between two masters.

With one of my clients, the database had a failover, but the second master didn’t take over because the failover hadn’t been tested for some months and wasn’t working correctly.

Lastly, in many scenarios where a database goes down, there is an external reason for it. This might be a misconfigured router, excessive traffic, or a bug in an application that sends too many writes. The database goes down, but the external reason doesn’t go away. So after the failover, the new master goes down too. In a cluster, when one server fails due to excessive traffic and goes down, fewer nodes will definitely not handle the traffic on their own and die too.

For startups, it’s often more productive to have faster restarts and incident detection than to invest in failover. If restarts were very fast and a database would fail only for 1 second and restart, for many, there would be no need for failovers at all.

I’m not arguing against failover per se, I use pg_auto_failover on my own, but for many startups, high availability should come much later and not be the first reflex of engineering on what to work on. As a fetish, it’s not helpful.