Reliability

Reliability is the ability of a system to remain operational over time. It is measured as the probability that a system will not fail to perform its intended tasks over a specified time interval.

Common Causes:

  • Not monitoring all information and system.
  • Not processing the pending queue requests after recovering the system.
  • Missing failover plan.

Points to be considered:

  • Identify ways to detect failures and automatically initiate a failover, or redirect load to a spare or backup system.
  • Consider implementing code that uses alternative systems when it detects a specific number of failed requests to an existing system.
  • Consider how you can take the system offline but still process pending queue requests.
  • Implement store and forward or cached message-based communication systems that allow requests to be stored when the target system is unavailable, and replayed when it is online.
Please share this

Leave a Reply