Our Fetish with Failover and Redundancy
How failover makes things worse
There is this fetish in our industry with failover. Is 99.9+% important for planes, nuclear plants, and hospitals? Yes. Is it important for your startup that offers houses online? No. Our reflex as engineers is to constantly optimize availability - often not in touch with the business.
And while some failover is cheap - applications and microservices - some failover is expensive, like databases and message queues.
From my experience, 5-minute outages have no impact on revenue if they do not happen too often, and you’re small. If they have an impact, the value proposition of your startup might not be good enough.
But even if we get failover working, e.g., with databases, often this creates more problems than it solves. When working at a tech company, the big-vendor DB cluster locked up due to problems with the interaction of nodes. When working at another tech company with MySQL failover, we had many more false failovers than real ones. Chances are that the failover does not work correctly because of its complexity or the failover flickers between two masters.
With one of my clients, the database had a failover, but the second master didn’t take over because the failover hadn’t been tested for some months and wasn’t working correctly.
Lastly, in many scenarios where a database goes down, there is an external reason for it. This might be a misconfigured router, excessive traffic, or a bug in an application that sends too many writes. The database goes down, but the external reason doesn’t go away. So after the failover, the new master goes down too. In a cluster, when one server fails due to excessive traffic and goes down, fewer nodes will definitely not handle the traffic on their own and die too.
For startups, it’s often more productive to have faster restarts and incident detection than to invest in failover. If restarts were very fast and a database would fail only for 1 second and restart, for many, there would be no need for failovers at all.
I’m not arguing against failover per se, I use pg_auto_failover on my own, but for many startups, high availability should come much later and not be the first reflex of engineering on what to work on. As a fetish, it’s not helpful.
As a CTO, Interim CTO, CTO Coach - and developer - Stephan has seen many technology departments in fast-growing startups. As a kid he taught himself coding in a department store around 1981 because he wanted to write video games. Stephan studied computer science with distributed systems and artificial intelligence at the University of Ulm. He also studied Philosophy. When the internet came to Germany in the 90 he worked as the first coder in several startups. He has founded a VC funded startup, worked in VC funded, fast growing startups with architecture, processes and growth challenges, worked as a manager for ImmoScout and as a CTO of an eBay Inc. company. After his wife successfully sold her startup they moved to the sea and Stephan took up CTO coaching. You can find him on LinkedIn, on Mastodon or on Twitter @KingOfCoders