Reliability
Reliability is the ability of a system to remain operational over time. It is measured as the probability that a system will not fail to perform its intended tasks over a specified time interval.
Common Causes:
- Not monitoring all information and system.
- Not processing the pending queue requests after recovering the system.
- Missing failover plan.
Points to be considered:
- Identify ways to detect failures and automatically initiate a failover, or redirect load to a spare or backup system.
- Consider implementing code that uses alternative systems when it detects a specific number of failed requests to an existing system.
- Consider how you can take the system offline but still process pending queue requests.
- Implement store and forward or cached message-based communication systems that allow requests to be stored when the target system is unavailable, and replayed when it is online.