Reliability: Prepare Your Plant for a Machine Learning Future–Start Saving Your Failure Data!

Author: Jon Herlocker


Equipment reliability management tools are far more automated and intelligent than in years past, but whether and how well they work for any one company depends on an enduring challenge: available data. It is well established that machine learning (ML) from sensor condition data can speed up problem identification and analysis, but poor or absent failure data can slow ML as well as human diagnostics and decision making. Users with solid failure data can make better and faster decisions based on optimized ML findings, leaving more time for beneficial root cause analysis and continuous improvements in maintenance and reliability initiatives.

Best data practices enable the best ML techniques

Of the two ML methods, supervised learning and unsupervised learning, the former is most effective and actionable because it is based on previous experience. Just as a human operator learns from past failures, supervised learning algorithms are taught to recognize the signature of a potential impending failure by providing it with several training examples. Training the algorithms effectively requires keeping records of past failure descriptions and the associated sensor data leading up to the failure.

The other alternative, unsupervised learning, does not require training data but the results are significantly less effective and less actionable. Its algorithms look for patterns and structure in data in an exploratory manner. Truly unsupervised learning can only detect deviations from normal where normal is something that has been observed before. It cannot easily explain why a condition is happening or why it might be important. Therefore, to get the most value from an ML investment for reliability improvements, it is important to amass failure data and enable supervised learning.

Importantly, sensor-related failure data is not the only variable for success. Companies can and should also prepare for ML by saving failure metadata, including standardized failure codes/modes, cause codes, solution codes, and unstructured failure information or observations in the work order. The more information available, the more prescriptive you can be on the failure predictions.

Organizations who are best at capturing failure metadata have reliability best practices embedded in their culture. For example:

  • Failure Modes and Effects Analysis (FMEA) proactively evaluates equipment and components for failure risks and codifies the potential causes and effects.
  • Root Cause Failure Analysis (RCFA or RCA) looks back reactively at failures to determine the root causes and documents how to prevent them in the future.
  • Criticality Analysis serves to identify and prioritize the equipment and systems most critical to the business and align maintenance and recovery efforts as well as failure data saving efforts accordingly.

The knowledge base gleaned from these valuable exercises supports prescriptive recommendations for maintenance actions and continuous improvement of reliability practices. Combined with ML, they offer a wealth of potential for improving reliability while mitigating aging workforce and aging equipment concerns. Those who choose to make the investment should aim high and make saving failure data a core priority.