System failure is inevitable so design for a fast recovery

1st April 2012
Simon Baker

So you’re trying to build a reliable website or application. What do you need to think about?

People often focus their efforts on improving the availability of systems. Architects design more complex systems, sysadmins buy and build more equipment, and developers build more complicated code – all in the name of eliminating single points of failure and improving the Mean Time Between Failures (MTBF).

The success of this approach varies from one application or company to another. It will however always result in more complex systems, which are not only more expensive to manage they inherently introduce new mechanisms for failure.

Attempts at increasing availability don’t actually eliminate failures from the system. Failures are inevitable, they just manifest in new and different forms.

Just ask the likes of Google, Facebook, Amazon, Twitter, Blackberry, Microsoft, or actually any of the big successful online companies. Despite having the smartest engineers and enormous budgets, they have all experienced catastrophic failures resulting in hours or even days of downtime in recent years.

So with the realization that failure is inevitable, what should you do?

Think more about recovery. Ask yourself:

  • How much downtime is acceptable?
  • How much data loss is acceptable?

The answers to these questions determine your Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) respectively. Unlike availability terms like the dreaded “five nines”, RTO and RPO figures can actually help you make technical decisions on meeting objectives.

Ask further questions about like:

  • How often do I need to backup data?
  • Is my data stored on independent systems?
  • How does my application behave when dependent components go down?
  • Can I recover from hardware failures as well as corrupted data failures?
  • How can I automate the recovery of a component in the system?
  • Can I recover from a catastrophic failure? And how fast can I do this?

Improving your ability to recover from failures and in particular reducing your Mean Time To Recover (MTTR) is an important, but often-overlooked part of applications.

With limited time and money, should you focus on improving availability? Or focus on being able to successfully recover from failures?

Well they’re not mutually exclusive, so it’s just a matter of finding the right balance between the two. Each application and business is different so there are different factors that come into play. It’s just all too common to see people and companies neglect efforts on recovery.

From the Energized Work TekTalk on 21 March 2012 | #ewtektalk

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *