Monday, February 18, 2019

System Reliability

People are surrounded by devices of all kinds. Reliability of the device is one of the key aspects of the user's experience of a device, particularly over the long term. This also has an implication on the general opinion (positive or negative) that the user forms about the device, its brand & manufacturer. An understanding of reliability is thus important for the manufacturer and the user.

Reliability numbers are worked initially at the design phase by the manufacturer. Explicit targets for the product are set which govern the design choices. Later several rounds of testing done by the manufacturer and/ or the certifying authority mostly before device roll-out to ascertain the actual numbers. In certain cases these may need to be re-looked at due to unexplained failures, manufacturing defects, etc. while the device is in-service. Such evaluations can be performed during routine maintenance of the device or via explicit recall of the device to the designated service station. Data collected is analyzed to understand & resolve the underlying issues in the device and the causes of failures.

Reliability Analysis

There are some standard methods adopted by the manufacturers (OEMs), etc. to calculate reliability numbers of the device. These include among others quantitative techniques such as capturing Mean Time to Failure (MTTF), Mean Time Between Failure (MTBF) and Mean Time to Repair (MTTF) at the device and/ or its sub-components level. MTTF is a measure of the time (or number of cycles, runs, etc.) at which the device is likely to fail (failure rate), while MTBF is the equivalent value for repairable devices that accounts for the interval between failure incidents. MTTR is the corresponding time spent in repair. For repairable systems:
   MTBF = MTTF + MTTR

These numbers are aggregates applicable to a general population of devices and not at one specific device level. So a device with MTBF value of 30,000 hours, implies that a population of size 30 devices are likely to run for 1000 hours on an average, collectively clocking 30K device hours.

For an exponential Reliability R(t) = exp(-t/MTBF), probability of a specific device surviving upto its rated t=MTBF is:
 R(t) = exp(MTBF/MTBF) = exp(-1) = 36.8%

For repairable systems, another term used often is Availability.
 Availability = System Up Time/ (System Up Time + System Down Time)

For mission critical systems that can not accept any downtime, Availability equals Reliability!

Weibull Analysis

Statistical techniques such as the Weibull Analysis is also very common for reliability computations. Weibull analysis makes use of data from failed as well as non-failed devices to work out device lifespan & reliability. A set of samples of the device`are observed under test conditions & a statistical distribution (model) is fitted to the data collected from these test samples. The fitted model is thereafter used to make predictions about the reliability of the entire population of devices which would be operating under the real world conditions.

The Weibull model uses three parameters for β: Shape (shape of the distribution), η: Scale parameter (spread), γ: Location (Location in time). Interestingly, the Weibull model is able to nicely capture the standard U-shaped, Bath Tub reliability curve typically seen over various device lifespans. In the early life-span of a device (testing, acceptance stage) the failure & defect rates are high (& has a β < 1). As these get fixed, the failure rate drops quickly to the steady operation ready, Useful Life stage.

In the Useful Life (β = 1) stage the device is stable & ready to roll-out to the end-user. Defects in this second stage are mainly due to design issues, operation, human errors, unexpected failures, etc. Finally, the device enters the Wear-out phase (β > 1), where the device or certain sub-components start showing natural wear & tear. Repairs & maintenance help to keep the device in good working shape for a while. Finally, there comes a time when the repairs are no longer viable due to costs or other reasons & then the device is taken out of service. Decisions around scheduled inspections, overhauls, etc. can be planned based on the different stage of the device life cycle & the corresponding values of β. 

There are other exponential distributions such as Poisson, Rayleigh, Gamma, Beta, etc. which are applied to specific types of devices, domains and failures cases. Selection of the appropriate distribution is important for a proper reliability analysis.

Sampling and Confidence Levels

Once devices are live, actual on-ground analysis can also be done for certain categories of devices. Data can be collected from a representative sample of devices operating on ground. Techniques from statistics for reliably sampling & deriving confidence intervals for an underlying population can be applied for this purpose.

The analysis is typically done for a Binomial population of devices where a certain p% of the population (devices) are expected to fail, while (1-p)% are expected to operate fine (without failure). Assuming a confidence interval of c (tolerance interval), the sample size n is worked out by taking a Normal approximating for the Binomial distribution (simplifying the calculations):

    n = Z2 X (p) X (1 - p)/ c2


where Z is constant chosen based on the desired confidence value from the Standard Normal Curve. Z = 1.96 for Confidence 95%, 2.58 for 99%, and so on.

   (E.g. 1) For example, if for a certain device, 4% devices are expected to fail, p=0.04:

      (1.a) With 99% confidence level, for a 1% confidence interval, c=0.01:
n = 2.58*2.58*0.04*(1-0.04)/(0.01*0.01) = 2,556 samples are needed

    (1.b) For a tighter 0.1% confidence interval, c=0.001:
n = 2.58*2.58*0.04*(1-0.04)/(0.001*0.001) = 255,605 samples (100x more than (1.a))are needed

   (1.c) Similarly, for a higher confidence level of 99.99% (Z=3.891), at the same 1% confidence level:
n = 3.891*3.891*0.04*(1-0.04)/(0.01*0.01) = 5814 samples (more than (1.a)) are needed

The above sample size estimator assumes a very large, or infinite population. In case of finite sized population, the following correction is applied to the cases above:

   n_finite = n / (1 + (n-1)/size_pop)

  (1.b.1) Applying the correction to the case (1.b) above, assuming a total population of a fixed 30,000 devices only:
n_finite = 255605/ (1 + (255605-1)/30000) = 28,400 devices, which need to be sampled to achieve a 0.1% confidence interval (tolerance) at the 99% confidence levels.

As discussed earlier, the reliability trends for devices tend to fit lifetime dependent exponential distributions such as Weibull better. Confidence levels in such cases are worked out accordingly using the appropriate distribution. For instance with a small constant failure rate (λ) expected, an exponential or a Poisson reliability model is a better approximation to Binomial than Normal. The confidence interval for λ is worked out as a Chi-Square distribution with 2n degrees of freedom, where n is a count of failures seen over time in the sampled set of devices.

Redundancy

Some systems need high fault tolerance. Reliability for such systems can be improved by introducing redundant systems in parallel, thereby replacing the Single Point of Failure (SPOF). When one device fails an alternate one can perform the job in its place.

Reliability of the redundant system:
    R = 1 - p1 X p2 X .. X pk

   where p1,..,pk are the probability of failure of the backup redundant systems.

    (E.g. 2) In the above example where the single device system with a failure rate p=0.04, & a reliability of 96% (1-0.04), if we introduce an identical redundant/ backup device also with p=0.04, reliability goes up to R = 1 - 0.04*0.04  = 99.84%.

k-out-of-n Systems

An alternate set-up is a consensus based (k out of n) system. In this set-up, the system fails only when more than the quorum number (k, typically 50%) of devices fail. The reliability of the quorum system is:
   R_quorum_system = 1 - probability of more than k (quorum) device failures

The reliability is maximized for a majority quorum, i.e. k = n/2+1. 

Monitoring Systems

Another typical approach is to introduce monitoring systems. The monitoring systems can be in the form of a sensor (optical, non-optical), a logger, a heart-beat polling unit, a human operator, or a combination of these. Whenever the monitoring system finds the primary system faltering, it raise alarms for corrective measures to be taken which may include stopping/ replacing the faulty device and/ or switching over to a backup system if available.

The reliability of the monitoring systems is assumed to be much higher than the underlying system being monitored, ideally a 100%. Monitoring systems are taken to be operating in sequence to the underlying system, so the reliability of the overall system is:
    R_monitored_system = R_device X R_monitoring 

In other words, a failure in either the device or the monitor or both, will result in failure of the system, increasing the overall chances of failure. Yet, monitoring systems are effective on ground since they are the first line of defense for the system. They are able to raise alarms for the human operator to intervene early (lowering MTTR).

In certain set-ups the monitoring system are also enabled to automatically switch over to a backup device when there is a failure with the primary device. This helps reduce the down time (MTTR) to a negligible value, if not zero. With a system that has redundant devices & a single monitoring system the SPOF shifts to the monitoring system. A further refinement to the system design (such as Zab, Paxos, etc.) entails setting up the monitoring system in a k-of-n, typically majority, quorum. All decisions regarding the state of the underlying devices is taken by the quorum. The majority quorum is also resilient upto k=n/2 failures of the monitoring system.

Through good system design & thought, the reliability at the system level can be significantly boosted even if the sub-components are less reliable. Design & engineering teams must possess sound reliability analysis skills to be able to build world class products. An awareness of reliability aspects also helps the end-user to decide on the right device that suits their requirements & continues to function properly over its lifespan.

No comments:

Post a Comment