Free Course Video #2:

The Principles of an Effective Maintenance Program

This video is from Lesson 3 of Module 2 of the course, Developing and Improving Preventive Maintenance Programs (PM100).

Key points

In this lesson, we’ll delve deep into the Principles that underpin an Effective Maintenance Program. This is probably the most important lesson in this whole course.

If you can really understand and apply the principles of an effective preventive maintenance program, you are going to succeed on your Road to Reliability™.

The key points I want students to take away from this lesson are:

What you’ll learn

This lesson is part of Module 2 of the course which is really an introduction to the basics of Preventive Maintenance. Some of the other things we discuss in this module are:

  • The principles that underpin effective Preventive Maintenance so that you can apply these when you improve your own PM program
  • How the type of maintenance you do needs to align with the characteristics of the failure mode so that you develop effective PM programs
  • How to use Operating Context and Consequence of Failure in your decision making so you develop effective and efficient PM programs

Please note: if you are interested in the course in one of these languages either with subtitles or with a voiceover in your native language, please contact me directly. We are working hard on getting the course translated into all these languages, but this will take some time.
Video Transcript - LESSON 2.3

The Principles of an Effective Maintenance Program

All right, welcome to lesson three of module two, in which we’re going to delve deep into the principles that underpin an effective maintenance program. Now, heads up from most people. This will be quite an intense lesson where we cover a lot of ground. So you may need to watch this lesson more than once. And this is probably one of the most important lessons in this whole course. If you can really understand and apply the principles on an effective preventive maintenance program, you are definitely going to succeed on the road to reliability.


Now, although there is a lot to take in from this lesson, keep these three things in mind. First, most failures are not age-related, but instead, they have a constant probability of failure. Second, you need to align your maintenance tasks to the characteristics of the failure mode you’re dealing with. And that means if it’s a random failure mode or a hidden failure mode that is going to influence the type of maintenance that you need to use. And key point number three is that you need to focus on failure modes that matter so that you avoid wasting your scarce resources, doing maintenance that does not contribute to the success of your business. In the end, we all struggle with a limited number of resources. And the last thing you want to do is burden your scarce resources with PM tasks just don’t add any value.


All right, let’s get started. So in the last lesson, we talked about RCM and RCM nowadays is defined through international standards, but it’s the work done back in the 1960s and the 1970s that culminated in the Nolan and Heap report in 1978, that all modern-day RCM maintenance approaches can be traced back to. And that’s now some good 50 years ago. So any maintenance and reliability professional should be familiar with it, but it’s been around long enough. It’s well-documented is widely available, but unfortunately, I find that’s just not the case. The principles of modern maintenance as developed in a journey to reliability-centered maintenance are simply not well known or understood, let alone being applied. And the rest of this lesson, I’m going to outline those key principles. These principles should underpin any sound maintenance program. One of the best summaries of these principles that are found is actually in a United States Navy source.


And it’s the Navy RCM handbook in the resource section of the lesson, I’m going to provide the link to that. So you can find it. It is in the public domain, it is worth downloading and reading. All right, so let’s look at principle number one, accept failures. Not all failures can be [inaudible 00:02:47] by maintenance. Some failures are a result of events outside our control. Think lightning strikes or flooding, for events like these more better maintenance, makes no difference. Instead, the consequences of events like these should be mitigated release through design and maintenance can do little about failures that are a result of poor design, lousy construction, or bad procurement decisions.


In other cases, the impact of the failure could be low so you simply accept the failure, think a light bulb that flows. So good maintenance programs do not try to prevent all failures. Good maintenance programs accept a level of failure and are prepared to deal with the consequences of those failures. Principle number two, most failures are not age-related. As explained in lesson two of this module when we looked at the development of reliability-centered by the airline industry, we learned that most failure modes are equipment experiences are not age-related. Instead, in most failure modes, the likelihood of occurrence is random, meaning that the likelihood of failure does not increase over time as the equipment gets older. Now, this initial research was done in the 1960s and limited to airlines. In fact, initially just to a single airline being United airlines that was later repeated by research done on another airline. And then it was repeated by the United States Nave and submarine programs, and they all found very similar results.


And that is what’s shown in this chart. And what I show here are the six failure patterns that were identified as part of the development of RCM by Nolan and Haman. Now you can download this chart as a PDF from the resources section in this lesson, and I would highly recommend you do that, print it off, and display it somewhere in or around your workstation so that you can refer to it. So next time you’re talking to two people about failure modes and preventive maintenance. You can look at the chart and have a conversation about what type of failure mode and failure your pattern it is. Now, this is a fairly complex chart with a lot of information. So let me explain what you see here. First of all, you see the six failure patterns that Nolan and Heap identified it as part of their ground breaking work, developing RCM.


And the next three columns, right next to those failure patterns, you see how often that failure pattern was observed in different research studies. The first column titled UAL 1968, is the original United Airlines research conducted by Nolan Heap in 1968. That was a research of the [independent 00:05:25] development of RCM. The next column titled Broberg 1973 is another airline study conducted a few years after the RCM study and came to very similar, essentially the same conclusions about the distribution of failure patterns. The third column titled MSP 1982 is a more recent study conducted by the United States Navy. And the fourth column SUBMEPP 2001 is the latest study that I’ve seen quoted in literature and is a study on US submarines.


Now, you don’t really need to remember all detail other than the first two columns were airline-related and a third and fourth were Navy and submarine-related because when we look at some of the differences in a minute, you’ll see that the origin of this study the equipment and the operating environment does play a role in some of the findings of these studies. All right, let’s move on, in the top section of the chart. You see the age-related failure patterns, which are patterns, A, B, and C, and these patterns do show that the likelihood of failure increases over time.


The bottom section of the chart shows the non-native related failures or the random failure patterns, D E and F, where in essence, the likelihood of failure stays constant over the operating life of the equipment. All right, so let’s have a look at what the data from these studies actually telling us. First, let’s consider the original data set from United Airlines as gathered by Nolan and Heap in 1968. Apart from showing that most failure modes occur randomly, and you can see this at the bottom of the table, you see age-related 11%, well, that’s failure patterns, A, B, and C added together versus random, which is flow patterns, D, E, and F combined together 89%. So almost 90% of the equipment in the United Airlines study showed random likelihood of failure.


So what I was saying is, apart from showing that most failures occur randomly, it was also highlighting that infant mortality is common and that it typically persists. That means that the probability of failure only becomes constant after a significant amount of time in service.


Now, one thing I want to say here is don’t interpret curves, D, E, and F to mean that some items never degrade or out, everything degrades with time that’s life, right? But many items to create so slowly that wear out is not a practical concern. These items do not reach the wear out zone in normal operating life. So what do these patterns tell us about our maintenance programs? Well, historically maintenance was done in the belief that the likelihood of failure increased over time that was second-generation maintenance thinking.


It was thought that well-timed maintenance could reduce the likelihood of failure. Well, it turns out that for the majority of equipment, this is simply not the case for the majority of equipment, which has a constant probability of failure there is no point to doing a time-based life renewal task like servicing or replacement simply makes no sense to spend maintenance resources to serve as a replacement item whose reliability has not degraded and whose reliability cannot be approved by that maintenance house. And in practice, that means that the majority of our equipment would benefit from some form of condition monitoring and a small minority can be effectively managed by time-based replacement or overhaul. Yeah, most of our preventative maintenance program are full of time-based replacements and time-based overhauls. So there are some food for thought.


Now, let’s have a look at the main differences between what the airline studies found and what the United States Navy and the US submarine studies found. The first major difference occurs with failure patterns, B, C. In the airline studies, both these patterns were observed relatively infrequently, but the US Navy and the US submarine studies found that the size of the amount of equipment failures did show either a clear period of wear out towards the end of life or a steady degradation over time. And this is typically attributed to the fact that the United States Navy submarines operate in a much more corrosive environment than airlines. And of course, corrosion is typically an age-related failure mode.


Now, the other big difference is the low amount of infant mortality, which from what I see in the various literary resources had been put down to the extensive testing that both the Navy and the submarine environments have before equipment is put into service. Personally, I also wonder whether the very strict maintenance regime on us submarines around quality may also have led to fewer maintenance induced errors. But to be honest, I have no idea of evidence for that statement. So let’s go with what the literature tells us, which is that the reduction in infant mortality is most likely due to the extensive testing that both the United States Navy and submarine environments were subjected to.


If there’s one thing you take away from this chart, for now, it should be that between 70 to 90% of failure modes are not age-related meaning that the likelihood that these failure modes occur does not increase over time. Instead, the likelihood of failure is constant over time from most failure modes. In other words, most failures occur randomly. And as I mentioned before, that has major implications for our preventive maintenance program, because if most failure modes are not age-related, then we should be addressing these failure modes, this some type of condition assessment, some type of condition-based maintenance. And so a very large part of your preventive maintenance program should be condition-based maintenance. So make sure you download this chart from the resources section of this lesson, print it out, hang it somewhere near your workstation. And next time you have a conversation with some of them have an affiliate mode and a piece of equipment have to think about what type of failure pattern is failure mode is likely to exhibit and that might mean for the maintenance you need to be applied.


All right, time to move on to principle number three. Some failures matter more than others, pretty clear, but not always practiced. So when you’re deciding on whether to do a maintenance task, you need to consider the consequences of not doing it. What would be the consequence of letting that specific photo curve, but what happened? Would it really matter? Avoiding that consequences in essence the benefit of your maintenance, the return on your investment of doing maintenance, because that’s exactly how maintenance should be seen as an investment. You incur a maintenance cost in return for a benefit in sustained safety and reliability. And as with all good investment, the benefit should outweigh the original investment. So understanding the consequences of failures is key to developing a good maintenance program, one with a good return on investment.


And just as all failure modes don’t have the same probability, not all failure modes have the same consequence. Are you worried have one of the lights and the general lighting on your factory floor fails now, probably not. Are you worried if you’re a site siren fails or one of your production critical pump fails? Well, yeah, of course, you are. So some failures matter more than others, and it is your job to understand which one’s matter and which one’s matter less. And then adapting your preventive maintenance program for this difference.


Consider a leaking tank. How worried would you be? Well, that depends, right? Depends on what’s in the tank and where the tank is and many more factors. So let’s say it was a portable water tank using a non-critical service with a minor pinhole leak. Okay, so it needs to be fixed and we want to prevent it, but would it be a major problem? Probably not. We could probably run with a minor leak of potable water for quite some time, or we could make a quick temporary repair. Now, let’s say it was still a water tank. The leak was actually quite severe and the water was used for firefighting purposes in a hazardous facility. Okay, so now we’re definitely more concerned. That’s not something we want to deal with for too long, or if at all. Now, imagine this, the Lake was still quite severe, but the tank actually contained highly flammable liquid. Okay, well, in this case, we’ve got a major issue at our hand. Something we need to fix urgently pretty much immediately, but this is also a scenario we would really want to avoid.


Now, apart from the consequences of failure, you also need to think about the likelihood of the failure actually occurring. Maintenance task should really be developed for dominant failure modes only. Those failure modes that occur frequently. And those would have serious consequences, but are maybe less frequent or even were. Avoid assigning main sense to non-credible failure modes, unless the consequences are really, really, really severe, and you want to avoid analyzing non-credible failure modes if you can, it just eats up your scarce resources for no return. And maintenance program needs to consider both the consequence and the likelihood of failures. And since risk is the combination of likelihood and consequence, we can actually conclude that a good maintenance program is risk-based, good maintenance programs use the concept of risk to assess where we use our scarce resources to get the greatest benefit, the biggest return on our investment.


All right, on to principle, number four, principle, number four stays that parts might wear out, but your equipment breaks down. So what does that mean? Well, the part is usually as simple components, something that has relatively few failure modes. So examples are the timing belt in the car or the roller bearing on the drive, shaft the cable on a crane. Simple items often provide early signals of potential failure if you know where to look. And so we can often design a condition-based maintenance task to detect potential failure early on and take action prior to. And for those simple items, which do wear out, there’ll be a strong increase in the probability of failure past a certain age. And if we know the typical wear out of a component, we can change the time-based task to replace it before it fails. But when it comes to complex items, complex pieces of equipment made of many simple components, things are different.


All those simple components have their own failure modes with their own failure pattern because complex items, complex piece of equipment have so many varied failure modes. They typically do not exhibit a wear out age. Their failures do not tend to be a function of age but occur randomly. And they’re probably failures, generally constant that’s represented by curves, ENF. Most modern machinery consists of many, many components, and read it needs to be treated as complex items. And that means no clear wear out age and without a clear route aid performing time-based overhauls, it’s just ineffective. It’s a waste of your scarce resource. Only where we can really prove that an item has aware of aid does performing a time-based overhaul or a component replacement makes sense. And if not, you need to look towards condition-based maintenance.


Principle number five states that hidden failures must be found. Hidden failures are failures that remain undetected during normal operation. They only become evident when you need the item to piece of equipment to work. And it doesn’t. This is often referred to as failure on demand. The only other time when a hidden failure becomes evident is when you actually conduct a test to reveal the failure. It’s a failure finding task. And hidden failures are often associated with equipment with productive functions, something like a [inaudible 00:17:58]. Protective functions like these are not normally active. They’re only required to function by exception to protect your people from injury or death, or to protect the environment for a major impact, or to protect your assets from major damage. And that means we pretty much always conduct failure-finding task on equipment with protective functions, to be clear, a failure-finding task does not prevent a failure. Instead, a failure-finding task does exactly what its name implies. It seeks to find a failure, the failure that has already happened just hasn’t been revealed to us. It has remained hidden.


So we must go and find those hidden failures and fix them before the equipment is required to operate. Otherwise, would have failure on demand and a potentially dangerous situation can occur as our protective function is not functioning when required. Principle six tells us that identical equipment does not mean identical maintenance, just because you’ve got two pieces of equipment that are the same, that does not mean they need the same maintenance. In fact, they may need completely different maintenance type. The classic example is two, exactly the same pumps in a duty standby setup, same manufacturer, same model, both pumps process, the exact same fluid under the same operating conditions, but pump a pump and pump B is a standby. Pump A normally runs and pump B is only used when pump A fails.


When it comes to failure modes, pump B has an important hidden failure mode. It might not start on demand. In other words, when pump a fails or is under maintenance, you suddenly find that pump B on start. Now, pump B doesn’t normally run, so it would know they couldn’t start until you actually came to start it. That’s the classic definition of a hidden failure mode and hidden failure modes like this require affiliate finding tasks i.e You go and test to see the pump B will start, but you don’t need to do this for pump A because pump A is always running. So when building a maintenance program, you must consider the operating context. If you run identical equipment in a different way are under different operating conditions, it may require different maintenance. And a difference in criticality can also lead to different maintenance needs, safety or production, critical equipment may need more monitoring and testing than the same equipment in a low criticality service.


It is important to reinforce that identical equipment may need different maintenance required. Look, It’s really important to reinforce this point that identical equipment may need different maintenance requirement. At this point is far too often forgotten, or just simply ignored for convenience, but if you do, you could find yourself facing some critical failures by ignoring this basic concept. Especially, if you’re using a library approach for preventive maintenance strategies. Now, I’m all for using generic maintenance strategies that you roll across many different pumps, but you really do need to take into consideration different operating contexts and how that may impact the maintenance you need to do.


All right, principle number seven, you can’t maintain your way to a liability. Now, I love this quote. The quote came from Terrence O’Hanlon and it is so very true, “Maintenance can only preserve your equipment’s inherent design, reliability, and performance.” If the equipment’s inherited liability or performance is poor, doing more maintenance will not help. No amount of maintenance can raise the inherent reliability over design. To improve poor reliability, poor performance that drew to poor design and poor construction, you need to change the design or you need to fix the installation simple. And when you encounter failures, defects that relate back to those design issues, you need to eliminate them. Sure, the more proactive and more efficient approaches to ensure that the design is right to begin with, but all plans start up with design defects, even proactive plans. And that’s why the most reliable plants in the world have an effective defect elimination program in play.


Principle eight tells us that good maintenance programs don’t waste your scarce resources. It seems pretty obvious, but when I’m reviewing PM programs, I often find many, many tasks that add very little value, tasks that waste resources and actually reduce reliability and availability. It’s so common for people to say, “Well, why we’re here and doing this, let’s also check this, it only takes five minutes and maybe also do this,” but five minutes here and five minutes there every week, every month. And suddenly, we wasted a huge amount of time and potentially introduced a lot of defects that can impact equipment reliability down the line.


Another source of waste in our preventive maintenance program is trying to maintain the level of performance and functionality that we don’t actually need. You see, equipment’s often designed to do more than what it is required to do in its actual operating conditions. And as maintainers, we should be very careful about maintaining to design capacities versus operating capacities. In most cases, we should maintain our equipment to deliver operating requirements. Maintenance done to ensure equipment capacity greater than actually needed is simply a waste of resources. Another quite common cause for waste is assigning multiple tasks to a single failure mode. It’s wasteful, and it actually makes it hard to determine which task is actually effective. In general, I recommend you stick to the row one single effective task per one failure mode as much as you can, only for very high consequence failure modes, should you consider having multiple diverse tasks to a single failure mode.


Most organizations have more maintenance to do than resources to do it with use resources on unnecessary maintenance, and you’re just not completing necessary maintenance. And not completing necessary maintenance or completing and late increases the risk of failures. And when the unnecessary maintenance is intrusive, it gets even worse. And experience shows that intrusive maintenance leads to increased failures because of human error. That could be simple mistakes or because defective materials or parts or areas and technical documentation. And a lot of equipment is done with equipment offline. So doing unnecessary maintenance can also actually increase production losses. So make sure you remove all that unnecessary maintenance from your system, make sure you have clear and legitimate reasons for every task in your maintenance program, and make sure you link all those tasks to a dominant failure mode.


And finally, principle nine tells us that good maintenance programs become better maintenance programs. The most effective preventive maintenance programs are dynamic. They are changing and they’re continuously improving. Always making better use of our scarce resources, always becoming more effective at preventing those failures that matter to our business. When improving your preventive maintenance program, you need to understand that not all improvements have the same leverage. First, focus on eliminating unnecessary maintenance time. This not only eliminates the direct maintenance, labor, and materials, but it also removes the effort required to plan, schedule, manage, and report on this work.


Second, change time-based overhauls or replacements into condition-based tasks instead of replacing a component every so many hours use a condition monitoring technique to assess how much life to components has left, and only replaced a component when actually required. And third, extend task interval. Now, you can do this based on analyzing your data, you can do this based on talking to your operators and maintainers, you can use good engineering judgment. Remember to observe the results. The shorter the current interval, the greater the impact when extending that interval. For example, changing a daily task to a weekly reduces the required PM workload for that task by more than 80%. Changing task frequencies is often the simplest and one of the most impactful improvements that you can make on your preventive maintenance program.


All right, so that brings me to the end of this first lesson. Now, I know I’ve covered a lot in this lesson and that some of these topics can be a little bit hard to grasp, but don’t worry. We’re going to cover all of this in more detail as we go through the course. And don’t forget, you can always watch this video again. You can review the slides or even listen to the audio on your way to work. All right, in terms of summarizing what we’ve talked about this lesson, keep these three things in mind.


First, most failures are not age-related instead they have a constant probability of failure. Second, you need to make sure you align your maintenance tasks to the characteristics of the failure mode you’re dealing with. So if it’s a random failure mode or a hidden failure mode, that’s going to influence the type of maintenance you need to use. And a third key point is that you’d need to focus on failure modes that matter for your business, and you need to make sure you avoid wasting resources. We all struggle with a limited number of resources, especially in this day and age. And the last thing you want to do is to burden your scarce resources with preventive maintenance tasks that don’t add any value.

PM100: Developing and Improving Preventive Maintenance Programs

Achieve higher reliability and availability whilst doing less maintenance. Acquire the knowledge and tools you need to create a highly effective and efficient Preventive Maintenance Program.

This course includes:

How much value can you create with Planning & Scheduling?

Leave a comment below telling us what types of maintenance you use and why. Have you had great results with one specific type of maintenance let us know: