1.2.1 Hardware Faults
Problem
In large data centers, hardware problems happen all the time (HD crash, faulty RAM, blackout of power grid, wrong plugged network cable).
For example, Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years => 10000 disks, 1 disk die / day
Solution
Add replica to reduce the failure rate (sufficient for most cases until recently).
However,
- data volumes and application’s computing demands have increased => hardware fault rate increased
- in some cloud platforms (AWS), it’s common for VM instances to become unavailable without warning, as they are designed to prioritize flexibility and elasticity over single-machine reliability
Hence, there’s a trend toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy.
1.2.2 Software Errors
Problem
Systematic error within the system. They tend to cause many more system failures than uncorrelated hardware faults. Examples:
- A bug that causes every instance of an app server to crash when given a bad input
- A runaway process that uses up some shared resources – CPU time, memory, disk space or network bandwidth
- A service that the system depends on that slows down, becomes unresponsive or returns wrong answer
- Cascading failures
Solution
No quick solution, but small things can help:
- carefully design (assumption, interaction)
- thorough testing
- process isolation
- allowing process to crash and restart
- monitoring
- constant check
1.2.3 Human Errors
Problem
Humans are known to be unreliable.
Solution
- Design systems in a way that minimizes opportunities for error
- Decouple the places where people make the most mistakes from the places where they can cause failures (sandbox)
- Test thoroughly at all levels (automated testing)
- Allow quick and easy recovery from human errors, minimize the impact (fast roll back)
- Detailed and clear monitoring (telemetry)
- good management and training
1.2.4 How important is Reliability?
Important for all kinds of applications. Sometimes we may choose to sacrifice reliability in order to reduce dev cost (prototype project), but be very conscious.
Reference
Designing Data-Intensive Applications by Martin Kleppmann