Stephan Schmidt

Selfhealing Code for Startup CTOs and Solo Founders

How self-healing systems make you more productive


Solo founders and early startup CTOs have one thing in common: The lack time. Lots of it. Too many things going on everywhere.

One way to gain time is to implement a self-healing architecture. Self-healing architecture and patterns mean code that get an application working again after it stopped working. Self-healing code is different from defensive programming.

When code and systems heal themselves, there are less problems on while scaling. When I was CTO we often had scaling problems where systems would fail, and because they weren’t selfhealing, we had to restart the systems and then restart other systems which depended on the first ones. Selfhealing systems prevent incidents. A selfhealing system may reover a usage spike or resource constraints, where a normal system would lead to a site outage and an incident. Outages and crisis’ happen when smaller problems overlap. When you stand at the beach and waves splash your knees. Then waves overlap and splash your faces. Dealing with small problems is paramount so they don’t aggregate. Selfhealing systems will do that for you.

Small Problems lead to large Problems

Outages and incidents are stressful and take time to fix. Time the solo founder and early startup CTO don’t have. The time they gain from self-healing systems then can be applied to delivering customer value and gain traction.

Patterns for Selfhealing

The prime example for selfhealing systems is Erlang with its runtime. Erlang uses a supervisor hierarchy to detect faulty components and restart them. But even if you don’t use Erlang, there are ways for self healing systems for every framework and programming language.

Common patterns for self-healing:

  • Circuit breakers
  • Restart failing components
  • Failover 3rd party services
  • Degraded service / E.g., Read-Only
  • Timeouts

Circuit breakers

Circuit breakers are a pattern where a connection is stopped when there are too many errors. Suppose a database is overloaded with connections and queries are not going through with existing connections hanging, a circuit breaker will open and close all connections to the database. Then it will wait for some time and try to connect to the database again. This way the database gets all open connections closed and regains its health, healing. After the database has healed, the ciruit breaker will close again and reestablish connections and operations.

Restart failing programs

Not every language is Erlang. But think about the concept: Just let it crash. When using systemd on Linux you can manage restarting and readyness to accept connections with your application. When it get into a state that it can’t read a file or queries to a database are no longer working, just stop it. When it’s run with systemd the daemon will just start it again, your application will newly read the configuration and establish connections to microservices and databases. If the environment is sound again, the application will work. Systemd even supports port activation. This means systemd will own the port, e.g. port 443 for HTTPS. During the downtime of your executable systemd will cache all incoming requests and as soon as your application is ready to accept requests again, it will relay the cached ones. Customers are not impacted this way.

Have some randomness in the restart timing to prevent a stampede of servers. If all servers restart at the same second and then try to regain resources and connections at the same time, that stampede will again kill the shared resource.

Failover 3rd party services

Often third party dependecies are a reliability liability. If your payment provider is down, or their API is, then you’re company is not earning money. Depending on the business model this is harmless or crucial. Having a second option for a payment provider enables you to continue business, although your primary provider is down. This might not be driven by business (they usually swap payment providers for better terms) but needs to be driven by the CTO as a BC/DR issue.

Degraded features, e.E. Readonly-mode

If a services has depends on other systems to provide its value to customers, and those dependecies have problems, it’s often better to degrade features gracefully instead of stopping the service. Hackernews has a read-only mode for maintanance. If you’re a marketplace, a read-only mode can still provide a lot of value to customers, when your main database might be down. Or a button for the feature is removed from the website instead of a button that leads to a non-working system.

Timeouts

Everything that does IO (and even long computes) should have a timeout. A connection that is hanging and doesn’t have a timeout blocks resources and a surge of these might bring down the service. So everything needs to have a timeout. When connections time out, they will free resources so other functionality might work.

Be careful, sometimes one connection has different timeouts on different levels. Be sure to catch and sync them all, TCP/IP timeouts, HTTP connection timeouts, database connection timeouts.

Conclusion

Handholding systems, fixing outages and developing around scaling pains takes time and detracts from delivering customer value and getting traction. Self-healing systems result in less maintanance effort and therefor more time for developing features, which is crucial for solo entrepreneurs and early startup CTOs.

Join the AMAZING CTO newsletter!
By signing up I agree to receive your newsletters and accept the terms and conditions and data protection . You can unsubscribe at any time.

More Stuff from Stephan

Other interesting articles for CTOs

Best books for CTOThe CTO BookExperienced CTO CoachEngineering Manager CoachingCTO MentorCTO MentoringCTO NewsletterHow many developers do you need?Postgres for Everything Product Roadmaps for CTOsHow to become a CTO in a company - a career path

Other Articles

Trust is Not a One-Way Street

The ZigZag Model of Vision and Strategy

Learn from Success Not From Failure

CTO vs CEO - how cooperation can work

Our Fetish with Failover and Redundancy