Mastodon Our fetish with failover | Amazing CTO

Stephan Schmidt

Amazing CTO


There is this fetish in our industry with failover. Is 99.9+% important for planes and nuclear plants and hospitals? Yes. Is it important for your startup that houses online? No. Our reflex as engineers is to ever optimize availability - often not in touch with business. But on our own.

And while some failover is cheap - applications and microservices - some failover is expensive like databases and message queues.

From my experience fields 5min outages have no impact on revenue if they do not happen to often. If they have an impact the value proposition of you startup might not be good enough.

But even if we get failover working, e.g. with databases, often this creates more problems that it solves. When working at a tech company the Oracle cluster locked up from problems with the interaction of nodes. When working at another tech company with MySQL failover, we had many more false failovers than real ones. Chances are that the failover does not work correctly because of it’s complexity or the failover flickers between two masters.

With one of my clients the database had a failover but the second master didn’t take over because the failover hadn’t been tested for some months and wasn’t working correctly.

Last in many scenarios where a database goes down there is an external reason to it. This might be a misconfigured router, excessive traffic or a bug in an application that sends to many writes. The database goes down, but the external reason doesn’t go away. So after the failover the new master goes down too. In a cluster when one server fails because of excessive traffic and goes down, fewer nodes will for sure not handle the traffic on their own and die too.

For startups it’s often more productive to have faster restarts and incident detections than invest in failover. If restarts would be very fast and a database would fail only for 1 second and restart, for many there would be no need of failovers at all.

I’m not arguining against failover per se, I use pg_auto_failover for on my own, but for many startups high availability should come much later and not be the first reflex of engineering on what to work on. As a fetish it’s not helpful.