Maintaining oil and gas machinery is expensive—but predictive models can help engineers minimize repairs and downtime.
In this week’s episode of Experiment Exchange, “How Accenture Minimizes Downtime with Predictive Maintenance Models,” join Michael McCourt as he interviews Shayan Mortazavi and Alex Lowden, Data Scientists at Accenture in the Industrial Analytics Group. Their work involves the development of predictive maintenance models to minimize downtime of systems, and in this episode they discuss complications when building these models, such as limited access to failure data and the massive number of features available, as well as the need for explainability and interpretability in their models. They also share how SigOpt’s parallelism feature allowed them to accelerate model development.
Below is a transcript from the interview. Subscribe to listen to the full podcast episode, or watch the video interview.
Q: Recently, you gave a talk at our SigOpt Summit, A Novel Framework for Predictive Maintenance Using Deep Learning and Reliability Engineering. Can you tell us more about that project and what you’ve been working on?
Shayan Mortazavi: The core objective of the project was to address the problem of maintenance by using the data that offshore assets in the oil and gas domain receive on a daily basis to monitor the integrity of their assets, control the process, or ensure safety. We tried to find a new application for this data to build predictive models, basically to provide early warnings and predict upcoming failures of various parts of these critical machines.
Q: How has maintenance been conducted historically, versus today?
Alex Lowden: Typically, it used to be a lot more reactive. Something would go wrong and you’d have potentially more downtime, or you’d have regularly scheduled maintenance based on a big aggregate statistics or mean time between failure. The engineers would plan and say, “Okay, we need to check on this equipment once a year. We need to replace it once every couple of years.” So it was either schedule-based planned maintenance, or reactive-based.
Q: Is this purely to save money, or are there other benefits to doing this maintenance?
AL: There are a few goals. The main goal is safety. Especially in oil and gas, incidents can have catastrophic consequences. The second goal is maximizing uptime, i.e. minimizing the amount of unplanned downtime.
SM: Adding to that, a recent trend is reducing carbon emissions—with oil and gas being one of the major producers of that. By addressing the problem of maintenance as a byproduct, you’re going to avoid a lot of unplanned events, which would otherwise cause the operators to burn their oil and gas in order to deal with the excess oil and gas coming from the reservoir. By having some sort of preventive or predictive tool, they can eliminate or improve on their carbon emission.
AL: Just to add as well, we don’t just see this limited to oil and gas. We’ve done projects across all different types of industries. Oil and gas typically have high consequences, given they’re offshore, so these tend to be where a lot of the focus is. But we see projects across all types of domains, really anywhere where there’s heavy industry. It’s not just limited to oil and gas.
Q: Within oil and gas, how do the operators interact with the models that you’ve developed?
SM: I think one of the novelties is the way that oil and gas are leading other industries, because they’re very much safety-driven. They are heavily regulated by the HSE – Health and Safety Executive. So that’s one of the drivers for putting a lot of money into research for incorporating reliability engineering practices. The other difference in oil and gas is that predictive tools are very much influenced by the way that the dynamics of offshore fields work, in terms of the complexity of the mixture of fluids that come on to these platforms to be processed. There’s not a standardized way to receive these products, so the operation is not homogenous in the field. In the lifetime of an asset over 10-20 years, that reservoir pressure depletes and the composition of the oil and gas changes, and that has a consequence on the operation of the heavy duty critical machinery.
Q: How is this system that you’ve built, eventually deployed? Are people just trusting the system, or is there still a human element in there?
AL: The tool generates insights. You have these LSTMs, which we imagine are like mini experts, and each of those mini experts are tracking part of a system and then giving you these projections of health. As soon as there are deviations, those are flagged. The way they’re flagged is in the form of insights, and those insights basically feed through an operator. Then an operator looks at the tool. Say they wake up every morning, they look at the tool and they say, “Okay, there was one insight overnight,” and then they can click on this insight, add comments, tag other people, and potentially create a work order. They can basically verify if it’s a genuine insight—if not, it gets logged. The tool itself is built such that they can filter down through detail within the system to get to the cause very quickly. These mini expert LSTMs are of such a granular level that they can very quickly narrow down to where the issue is coming from using the platform. Then, if it’s a genuine insight, they’ll straightaway issue a work order through the tool, which is fully integrated with their work or management system.
Q: It sounds like interpretability is a key element, is that right?
SM: Very much so. One of the key design strategies on the drawing board from day one was to build this from the bottom up, so that through means of interrogation you can get to the bottom of the problem. That, of course, allows interpretability because you are, as Alex said, building these mini experts as a building block of where things can go wrong. On top of that, we have this engineering logic layer that makes sense of all of these little insights.
Q: Tell us more about the architecture of your system.
AL: One of the biggest challenges of building such a system is the sheer volume of LSTMs that you need to maintain. One of the things that you realize quite quickly is that with the type of time series that you’re modeling, you don’t need incredible depth on some of the networks. The aim of these mini experts is to model normal to normal. No matter what they take in, they’re only capable of producing normal. Performance is measured against healthy, and you’re evaluating against healthy behavior.
We found generally that certain types of architecture were better suited to certain types of signal. Once we had a feel for that, then our training became a lot faster and we could then restrict tuning to some of the key hyperparameters that you typically would consider, making the process more efficient. Say we’re looking at a temperature, we would say, we need between 25-50 neurons in this layer. If it’s a vibration, maybe we need slightly more, say, 100-200. Then we could point SigOpt in that direction and narrow a search for this set of parameters. That helped us a lot.
Q: What other SigOpt features did you make use of?
SM: As Alex mentioned, one of the big challenges was monitoring a wide variety of sensor data with different temporal patterns. Because we couldn’t follow one blueprint for optimization, we had to focus on multiple different hyperparameters of an LSTM depending on the signal type. In that case, SigOpt certainly came in very handy because we could quickly get, for example, the prioritization of which of the hyperparameters are more relevant across a sensor type, which would help us narrow down our guess range for a similar type of sensor data. Typically we were optimizing about 50 or 40 sensor LSTMs per equipment. Narrowing down the span of hyperparameters was very key for us. Knowing which hyperparameters to focus on was one of the major uses of SigOpt.
AL: Another feature that was helpful for us was multimetric optimization. Without it, this work would have been impossible because it was really a tradeoff between some of these parameters. For example, with an LSTM, there’s a pretty strong correlation between lookback and training time, but not necessarily with performance. Understanding this dynamic by comparing two metrics at once during hyperparameter optimization really helped us to avoid pursuing networks that had such a large lookback. You can always in an optimization loop just get stuck in a network with a large lookback and all of a sudden you’ve lost two days. So that was a huge use to us.
We also found the parallelism feature useful for scaling up hyperparameter optimization jobs across the compute we had available. This feature made that process trivial, which was critical as we were training and optimizing many models at once.
Q: What are you working on next?
SM: We’re busy working on a similar project within this domain of predictive maintenance, but on a slightly different type of feature—purely looking to the future and estimating how much time a piece of equipment has until its ultimate failure. This is very complex because we’re talking about nonlinear to nonlinear mapping, and nonlinear to linear mappings. The input to space is very noisy, with lots of uncertainty.
AL: One of the things we mentioned in the talk were some of the common problems we see in maintenance. One of them being mean time to failure, and one being optimization. That’s actually part of what we’re working on now, which is time to failure—how long do you have until critical failure? It’s something which has been consuming us for the past couple of years.
Q: As you’re developing these tools to be used by experts in their domain, have you seen any cultural friction in interacting with folks who may not have as much familiarity or comfort using these AI-based tools?
SM: I think we can talk from experience, really. One of the assets that we built is a similar paradox, and it took a bit longer for the users to adopt the tool and change their workflows. I think operators on that specific asset were quite new, just 1-2 years in place. They were just getting a hold on the operation of the asset.
On the more mature assets that have been there for 10-15 years, the operators have more matured processes and procedures. They’ve gone through the pains, restrictions, and limitations of more schedule-based maintenance or condition-based monitoring type tools with lots of false alarms. So they’re actually the ones who come to us and ask for something better—in my experience.
AL: Within the Industrial Analytics team, adoption is something that’s considered on all projects—as well as change management. There’s a specific team embedded within your team, and you would have a change manager who will help put together training material. They have an eye on what needs to be done and the implications of introducing such a tool.
There are various things that we usually do to help with that, like building a tool with domain experts embedded in the team so it’s not the usual six data scientists working in isolation for two years. One of your team members is an engineer who knows the facility, so they’re involved in all the decisions from the start.
Q: What does it take to make a tool successful in the domain that you’re working on?
AL: There’s another interesting problem that is talked about less. The overconfidence of AI is another problem which you have to deal with. People sometimes just trust the predictions and will make large decisions based on them. It’s made me and Shayan realize the importance of factoring uncertainty into your modeling, translating that, and making sure that whoever’s using the tool knows how to read uncertainty from the application of predictions.
SM: We’re giving this message from day one to set the expectations that this is not an exact prediction predictive tool. It has to be looked at with uncertainty.
From SigOpt, Experiment Exchange is a podcast where we find out how the latest developments in AI are transforming our world. Host and SigOpt Head of Engineering Michael McCourt interviews researchers, industry leaders, and other technical experts about their work in AI, ML, and HPC — asking the hard questions we all want answers to. Subscribe through your favorite podcast platform including Spotify, Google Podcasts, Apple Podcasts, and more!