In distributed systems, choosing the timeout until declaring a node to be dead is very sensitive:

  • The system is in a reduced availability state until the failure is detected.
  • But in situations of high load, nodes may appear to be unavailable merely due to strain.
    • Cordoning and replacing under these conditions could exacerbate the problem.
    • In the worst case, cordoning under these conditions could lead to a cascading failure.