Reliability - why systems fail

Introduction

Modern electronic systems are complex devices and there are a myriad of ways that they may not function properly or fail completely. This article explores some of the common causes for system failure in the design and operation of both the hardware and software of electronic devices.

Hardware

Hardware is the term used to describe the electronic components and circuitry of a device.

The lifetime of electronic hardware normally follows what is called a bathtub curve. The shape and size of the curve is primarily determined by three elements; infant mortality, random failure and product lifetime.

Infant Mortality

A newly manufactured product failing either immediately or after a very short time of usage is known as infant mortality. This may be due to manufacturing defects in the assembly of the Printed Circuit Boards (PCBs), component failure or installation issues.

Random Failure

During the lifetime of the unit a very small percentage will randomly fail. This may be exacerbated by misuse or using the product outside of it's specification. For a well designed and manufactured device the random failure rate should be very low and companies will normally monitor this failure rate to idenity potential design issues.

Product Lifetime

This is determined by the lifetime of the components used in a product. Electronic components will have a Mean Time To Failure (MTTF) specified in hours. The average lifetime of an electronic product will be less than the smallest MTTF of any component in the device. This phase is also known as Wear Out.

Other issues

Electromagnetic Interference (EMI)

Electromagnetic Interference is a term given to the electrical noise that is emitted or radiated by electrical circuits. If the emitted noise level is high enough it may cause incorrect operation in another device. Regulations exist that specify the maximum electrical noise a device may radiate and also a level of received interference that it must be able to handle.

Usage limits

Components can be limited to a certain number of uses; as an example flash memory has a limit to the number of times that it may be written after which the write may not succeed. This makes it very difficult to estimate the lifetime of a component unless it's typical usage can be ascertained.

Use outside of specification

Use of a product outside of it's specification may cause premature device failure due to stress on the components. Running a device at a higher Voltage may cause overheating or use in areas of very high humidity may cause corrosion of the circuit board.

Software and firmware

Firmware is the name given to dedicated software running in an electronic product in contrast to software running on a general purpose computing device.

Unexpected inputs

Input from user input or from communications channels such as WiFi may cause the device's firmware to misbehave. Many discovered security vulnerabilities are due to unexpected or malformed input, such as sending more data than expected, which can corrupt the contents of the device's working memory, a problem called Buffer Overflow.

Specifications

Software specifications are extremely important; if there is no detailed explanation of how a feature should work then it cannot be verified that it performs it's task correctly. For high reliability systems firmware will normally be developed in conformance to a standard such as DO-178B for aviation or IEC 61508 for safety critical systems.

Errata

Processors are becoming more complex with bugs in their implementation becoming more common and issues with shipping devices are commonly flagged as Errata in the processors documentation. These errata can range from small deviations from specification to completely missing modes of operation. Large microprocessors may have some of their more complex operations specified in proprietary manufacturer-written programs called microcode runing on the processor. As they are programs this microcode may also contain bugs.

Spectre

This is the the name given to a range of security vulnerabilities linked to the Speculative Execution feature in high performance processors. For higher performance, some processors will execute both of the possible code paths when an executing program could execute two different parts of the program depending upon some computed condition until that condition is fully computed. Bugs in the processors microcode allowed them to execute sections of code that they not and bypass the operating system's security.