Tuesday, February 11, 2020

Designing Systems - Reliability

Designing Systems - Reliability

  A system that should continue to work [performing the correct function at the desired level of performance] even in the face of [hardware or software faults, and even human error].


Fault - A component of a system deviating from its spec.

Failure - System as a whole stops providing the required service to a user.

Many Critical bugs are due to poor error handling

Hardware Faults

Examples are hard disks crash, RAM faulty, network down. Hard disks mean-time-to-failure is 10 to 50 years. 10,000 -> 1 per day. Can implement redundancy. 

Software Faults

Bugs, runaway processes, software starts responding with corrupted responses.

Can start by thinking about assumptions, interactions in the system, thorough testing, process isolation, measuring, monitoring, analyzing system behavior in production.

Final thoughts

  • Minimize opportunities for errors. Well-designed abstractions, APIs, admin interfaces.
  • Providing sandbox environments for exploration
  • Testing thoroughly at all levels - unit tests to integration to manual
  • Rollout and rollback systems
  • Telemetry
  • Management Practices and Training