Designing DIA note 2 -- Hardware, software, human errors

1.2.1 Hardware Faults

Problem
In large data centers, hardware problems happen all the time (HD crash, faulty RAM, blackout of power grid, wrong plugged network cable).
For example, Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years => 10000 disks, 1 disk die / day

Solution
Add replica to reduce the failure rate (sufficient for most cases until recently).

However,

  • data volumes and application’s computing demands have increased => hardware fault rate increased
  • in some cloud platforms (AWS), it’s common for VM instances to become unavailable without warning, as they are designed to prioritize flexibility and elasticity over single-machine reliability

Hence, there’s a trend toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy.

1.2.2 Software Errors

Problem
Systematic error within the system. They tend to cause many more system failures than uncorrelated hardware faults. Examples:

  • A bug that causes every instance of an app server to crash when given a bad input
  • A runaway process that uses up some shared resources – CPU time, memory, disk space or network bandwidth
  • A service that the system depends on that slows down, becomes unresponsive or returns wrong answer
  • Cascading failures

Solution
No quick solution, but small things can help:

  • carefully design (assumption, interaction)
  • thorough testing
  • process isolation
  • allowing process to crash and restart
  • monitoring
  • constant check

1.2.3 Human Errors

Problem
Humans are known to be unreliable.

Solution

  • Design systems in a way that minimizes opportunities for error
  • Decouple the places where people make the most mistakes from the places where they can cause failures (sandbox)
  • Test thoroughly at all levels (automated testing)
  • Allow quick and easy recovery from human errors, minimize the impact (fast roll back)
  • Detailed and clear monitoring (telemetry)
  • good management and training

1.2.4 How important is Reliability?

Important for all kinds of applications. Sometimes we may choose to sacrifice reliability in order to reduce dev cost (prototype project), but be very conscious.

Reference
Designing Data-Intensive Applications by Martin Kleppmann

    原文作者:星辰破
    原文地址: https://www.jianshu.com/p/8b4acc0d7667
    本文转自网络文章,转载此文章仅为分享知识,如有侵权,请联系博主进行删除。
点赞