In distributed systems, choosing the timeout until declaring a node to be dead is very sensitive:
- The system is in a reduced availability state until the failure is detected.
- But in situations of high load, nodes may appear to be unavailable merely due to strain.
- Cordoning and replacing under these conditions could exacerbate the problem.
- In the worst case, cordoning under these conditions could lead to a cascading failure.