Outlines computer system reliability, including series, parallel components, and N-modular redundancy.
Concepts
The computer system's fault always be caused by instruments' and components' aging and inefficient. We trait their lifespan and present statistics information, conclude that every component in system have three stages: starting stage, normal working stage and aging stage.
At first starting stage, the component's working status may not stable, and has higher failure rate. At second stage it works normally and enters into stable, also has lower failure rate. But if it enters the third stage, system reliability will decrease due to aging factors. Its failure rate will re-increase again. This is also known as the bathtub curve. So the component should be kept in the second stage as much as possible.
The definition of reliability in computer systems refers to the probability of it operating normally from its initiation at time to a specified time . This probability can be represented by . The failure rate of a computer system refers to the proportion of components failing per unit time compared to the total number of components. It is denoted by . When is constant, the relationship between reliability and failure rate can be expressed as
In computer systems, the time during which the system operates normally between two failure occurrences is termed the system's uptime. The average of all such uptime durations is referred to as the Mean Time Between Failures (MTBF). Its relationship with the failure rate is expressed as:
To express the maintainability of a computer, we commonly use the Mean Time to Repair (MTTR), which represents the efficiency of maintenance. It is the average duration from the occurrence of a failure to the restoration of operation. In computer systems, we also use "availability" to denote the system's operational efficiency, which refers to the probability of the computer system functioning normally at any given time. It can be expressed as:
The reliability measurement of a computer typically relies on the aforementioned concepts.
Reliability of series components
In a series system composed of components connected sequentially end-to-end, the entire system operates normally only when all components function properly. If the reliability of these components is , and their failure rates are , then:
The overall reliability of the system is:
The overall failure rate of the system is:
Reliability of parallel components
In a parallel system composed of components connected in parallel, the entire system operates normally as long as at least one component functions properly. In a parallel system, only one component is essential, while the remaining components are redundant. The more redundant components there are, the higher the system reliability. Assuming the reliability of these components is :
The overall reliability of the system is calculated as:
Assuming the failure rate of all components in the system is , the overall failure rate of the system is:
N-modular redundancy system
An N-modular redundancy system consists of N (where ) identical subsystems and one voting unit. When the system is operational, all subsystems simultaneously compute results, and the voting unit outputs the result that is in the majority, thus serving as the final outcome. This setup helps to mitigate failures caused by random factors, ensuring the reliability of the system. If at least components function properly, then the entire system operates normally.
When a module in the system fails, the remaining modules continue to provide service, ensuring the system's availability. The level of redundancy in an N-modular redundancy system is determined by the value of N, with a higher N value indicating greater redundancy and reliability.
Assuming the voter is entirely reliable and each subsystem has a reliability of , the reliability of an N-modular redundancy system is calculated as:
In the United States' lunar exploration satellites, N-modular redundancy systems are utilized to construct the computer systems onboard the satellite to mitigate failures in the space environment.