Troubleshooting is a form of problem solving most often applied to repair of failed products or processes. It is a logical, systematic search for the source of a problem so that it can be solved, and so the product or process can be made operational again. Troubleshooting is needed to develop and maintain complex systems where the symptoms of a problem can have many possible causes. Troubleshooting is used in many fields such as engineering, system administration, electronics, automotive repair, and diagnostic medicine. Troubleshooting requires identification of the malfunction(s) or symptoms within a system. Then, experience is commonly used to generate possible causes of the symptoms. Determining which cause is most likely is often a process of elimination - eliminating potential causes of a problem. Finally, troubleshooting requires confirmation that the solution restores the product or process to its working state.

In general, troubleshooting is the identification of, or diagnosis of "trouble" in a [system] caused by a failure of some kind. The problem is initially described as symptoms of malfunction, and troubleshooting is the process of determining the causes of these symptoms.

A system can be described in terms of its expected, desired or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example selecting the "print" option from various computer applications is intended to result in a hardcopy emerging from some specific device). Any unexpected or undesirable behavior is a symptom. Troubleshooting is the process of isolating the specific cause or causes of the symptom. Frequently the symptom is a failure of the product or process to produce any results. (Nothing was printed, for example).

The methods of forensic engineering are especially useful in tracing problems in products or processes, and a wide range of analytical techniques are available to determine the cause or causes of specific failures. Corrective action can then be taken to prevent further failures of a similar kind. Preventative action is possible using FMEA and FTA before full scale production, and these methods can also be used for failure analysis.

Reproducing symptoms[]

One of the core principles of troubleshooting is that of reproducing the same problems that users experienced and then try to reliably isolated and resolve them. Often considerable effort and emphasis in troubleshooting is placed on reproducibility ... on finding a procedure to reliably induce the symptom to occur.

Once this is done then systematic strategies can be employed to isolate the cause or causes of a problem; and the resolution generally involves repairing or replacing those components which are at fault.

Half-Split Method[]

Half-splitting is a technique used in trouble shooting which reduces the average number of measurements needed to isolate the faulty stage or component. Consider the eight stage path shown in figure below and the technique explained in the following paragraphs.

100 TM-9-254 231 1

The first measurement using half-splitting would be made at point E (the middle of the faulty path). If the signal is okay at point "E" the path to the left of point "E" is good and the problem lies between points "E" and "I". Thus one measurement has reduced the size of the faulty path by one-half (half-splitting). The next measurement would be made at point "G" again splitting the faulty path in half. If the measurement at point "G" is bad (no signal) the next measurement would be made at point "F". This method of splitting a faulty path in half is continued until finally the faulty stage is isolated.

Intermittent symptoms[]

Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent. In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit varies with the temperature of the conductors in it). Compressed air can be used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus troubleshooting of electronics systems frequently entails applying these tools in order to reproduce a problem.

In computer programming race conditions often lead to intermittent symptoms which are extremely difficult to reproduce; various techniques can be used to force the particular function or module to be called more rapidly than it would be in normal operation (similar to "heating up" a component in a hardware circuit) while other techniques can be used to introduce greater delays in, or force synchronization among, other modules or interacting processes.

Intermittent issues can be thus defined: An intermittent fault is a one which occurs irregular, inconsistent, not in intervals, or comes and goes.

In particular, there is a distinction between frequency of occurrence and a "known procedure to consistently reproduce" an issue. For example knowing that an intermittent problem occurs "within" an hour of a particular stimulus or event ... but that sometimes it happens in five minutes and other times it takes almost an hour ... does not constitute a "known procedure" even if the stimulus does increase the frequency of observable exhibitions of the symptom.

Nevertheless, sometimes troubleshooters must resort to statistical methods ... and can only find procedures to increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible. In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low confidence that the root cause has been found and that the problem is truly solved.

Also, tests may be run to stress certain components to determine if those components have failed.

Multiple problems[]

Isolating single component failures which cause reproducible symptoms is relatively straightforward.

However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault tolerant systems, or those with built-in redundancy. Features which add redundancy, fault detection and fail-over to a system may also be subject to failure, and enough different component failures in any system will "take it down."

Even in simple systems the troubleshooter must always consider the possibility that there is more than one fault. (Replacing each component, using serial substitution, and then swapping each new component back out for the old one when the symptom is found to persist, can fail to resolve such cases. More importantly the replacement of any component with a defective one can actually increase the number of problems rather than eliminating them).

While we talk about "replacing components" the resolution of many problems involves adjustments or tuning rather than "replacement." For example, intermittent breaks in conductors --- or "dirty or loose contacts" might simply need to be cleaned and/or tightened. All discussion of "replacement" should be taken to mean "replacement or adjustment or other maintenance."

Common Problems[]

  • Loose PCBs
  • Bad or cold solder joint
  • Low or damaged battery
  • Burnt components
  • Loose or open wire
  • Shorted power supplies
  • Open ground or neutral
  • Tripped circuit breaker
  • Open fuse
  • Open run


See also[]

Problem Solving