I’ve heard from users new to the automation profession that it can be overwhelming to learn about the product technologies used in industrial automation alone, while also learning about the concepts, and having to deliver on the job results. This is the the first post of our Automation 101 blog series which is designed to help professionals that are new to the industrial automation space whether at the start of their careers, or moving into the operations technology (OT) world from an IT or other background.
In this post, I am going to discuss redundancy. The term is often used broadly, and sometimes it can be a simple application, but it can also quickly unravel into a complex discussion with lots of technical details, more than can be covered in a single blog post. Whenever a client contacts us to discuss redundancy, there are many questions we ask to understand what redundancy means in their operation.
My focus will be on providing insight to the reasons why redundancy is used, the business factors that drive how far you go with redundancy, the types or levels found in automation systems, and considerations in the implementation of automation software systems.
What is redundancy? Redundancy is about having backup components in a system that can take over if and when the primary components fail. But a backup of what components? I would say any component that is critical to your operations and could be a single point of failure that could stop your entire process.
Some of the components we have seen people implement in redundant pairs include but are not limited to:
- Physical Networks
- Network Interface Cards (NIC)
- Input sensors
- Output control devices
- OPC Server software
- Physical servers
- HMI/SCADA operator stations & servers
- Power sources/UPS
Points of Failure, Bumps, and Consequences
The goal with redundancy is to eliminate single points of failure and provide reliable uptime for your process. When thinking through your needs for redundancy you should review the system as a whole and understand the consequences if any one piece fails. Some primary consequences of un-planned downtime are measured in lost production capacity, scrap product produced, and worker and facility safety, amongst others. The higher the business costs of these factors, the more likely it is redundancy will be essential to maximizing production for your systems.
There’s no limit to what you can spend on redundancy but all decisions must start with considering what the price of failure will be without redundant systems in place.
A common term in redundancy is “bump”. A “bump” is an interruption in the process such as an unplanned line stoppage, machine shutdown. Every industrial process is different in terms of what the consequences of a bump are. If you are in a continuous sheet steel production process with a constant flow of input raw materials and output of a continuous sheet of steel, with downstream machines that slit the steel into different sizes, you can imagine how a stoppage of any one part could cause havoc!
Continuous processes will often use accumulators that can buffer up a certain amount of product to allow for short stops or pauses, but even those can only go so far. In paper production, which involves a continuous feed of wet paper pulp onto a web moving at extremely high speeds, a bump of more than a few hundred milliseconds might be unacceptable. The shorter the “bump” that your process can withstand, the more important redundancy will be in your automation system.
Another factor that affects redundancy requirements is how long it takes to restart the production process if it stops. There are some continuous processes that can take hours, or even days, to restart after a shutdown. The longer the time to restart, the more important redundancy will be.
Similarly, process shutdown requirements matter. There are processes in chemicals & oil refining that cannot be shut down in a disorderly manner without catastrophic consequences. With those systems, you will find specialist suppliers that make triple redundant safety shutdown systems with triple redundant inputs and outputs and complex 2 out of 3 voting schemes to insure that, no matter what, that process is brought down in an orderly fashion! Taken to its logical extent, in a fully redundant system there are no single points of failure. Each hardware or software component would need to be redundant or, as required, support the redundant architecture.
As a controls engineer, it’s your job to seek to understand what the right level of redundancy is for your process, and then identify the system components that are most likely to cause failure and that will need to be redundant to achieve the desired level of system reliability.
Failover, Failback, Visibility
Once you know what you want to make redundant, you next have to decide when and how you failover between systems. You’ll need to determine what signals and criteria will you use to determine that a primary system is not available? Can those signals tell you fast enough to insure you failover without causing an unacceptable bump in your process? Picking the right criteria and signals is a careful balance. If you pick things that are too sensitive you could get false positives and generate unnecessary failovers. If they aren’t sensitive enough, you’ll get unnecessary bumps.
You will need to decide which system will make the decision to failover to the backup system. Where that decision occurs will be specific to each process & application and the acceptable bump time.
For something so critical you’ll obviously want to ensure the right people know that a failover has occurred, when, and why, so that they can take action to fix the primary system and to prepare for possible bigger issues of the backup system then fails before the primary is returned to service. To do this, systems that are making the decisions to failover should provide data that can be monitored by alarming systems that generate email, SMS or other callouts to the right people. Those systems may even have their own built in ability to generate notifications. The key is that you must have visibility into the system status at all time, and some means of pro-active notification of an abnormal state.
Next you will need to decide, how do I know the primary system is back? Do you want the primary system when it returns to become the new secondary system, or do you want the process to automatically fail back to the primary once the primary is back for a specified period of time? Would you rather be notified the primary is back and then manually failback to the primary system? These are all factors you’ll want to know before you start choosing software and hardware components.
Next, let’s explore a few common redundancy areas that we commonly hear about from users as they integrate their software applications into their automation systems.
The goal of network redundancy is to prevent loss of connectivity to other systems over the network. Whether this matters to you depends on the type of system you are running. A standalone HMI talking to a single PLC with a direct connection to the PLC, that otherwise only has network access as a “nice to have” feature, may not need redundancy.
Systems that have multiple operator stations, multiple PLCs connected over Ethernet, PLCs connected to drives over Ethernet, multiple servers, or inter-system communications may have serious redundancy requirements. If losing network connectivity means your process could stop, go out of control, and the costs associated with that are unacceptable, then you’ll have a redundant network.
That may involve redundant network wiring, switches, possible a ring network, and any other network infrastructure to ensure that network traffic will always work even if one piece of the network fails. The goal is to provide redundant communications paths.
At the PC or Server level you may need redundant network cards (NICs) in the computer. This way if the primary NIC fails your network traffic can use the backup NIC.
Whatever level of redundancy you have, your software used for data collection, logging, and reporting will need to be able to work with your architecture smoothly through failovers. Some parts of the network infrastructure are transparent to the computer and software and may not need any special capability in the software. Typically redundant network switches fall into this category.
When you start having redundant network cards in the computer, and redundant network paths with different IP addresses to reach the same target device, your data collection software such as OPC server software may need to have settings to know when it should failover to the backup network.
Control Hardware Redundancy
It is common for PLCs running tasks critical to production to also be redundant. The idea is to provide redundancy for the control devices. Typically this is an area where your data collection software needs to also support redundancy.
If you're using OPC server software and have redundant PLC's then the software communicating to those PLC's must be able to support this architecture. The software needs to know which PLC is primary and which is secondary and what condition will cause the failover to the secondary PLC, and when to switch back to the primary PLC.
Ideally the software will also provide ways for you to monitor and display in an HMI, SCADA, or alarming system which PLC, primary or secondary, the OPC server is communicating with.
Software & IT Hardware Systems Redundancy
We do not want to stop here and let the computer that our OPC server is running on be the single point of failure. This means we need redundant OPC servers on redundant PCs. For this to work, our OPC clients to be able to support redundant OPC servers.
Supporting redundant OPC servers means that the OPC client – your HMI/SCADA system, Alarming system, MES system or other systems – will need to know which OPC server to communicate with. You’ll also need to decide if you want both OPC servers to be polling your devices at the same time, or if you want one OPC server to be polling, and the backup on standby but not polling. Through management of the active state of items in the OPC servers, which is the responsibility of the OPC client in the OPC client-server relationship, it’s possible to achieve both of these scenarios.
Most OPC clients do not handle these details of redundancy management. Some HMI/SCADA systems support redundancy but sometimes it involves scripting and other custom written code. So in many cases you need a piece of software to manage and optimize the connections to your two OPC servers. Your OPC client talks to the redundancy management software as if it is the actual OPC server.
Clearly there are plenty of details associated with these configurations that are beyond what I’ve been able to cover in today’s blog. Redundancy can be simple or it can quickly become complex depending on your business needs. It is important to think it through, know your business needs, and understand what your current applications can and can’t handle with regards to redundancy.
I'd like to hear about your experiences and challenges. Please do comment below – what types of redundancy are you using? What types of failure are you trying to protect against and why? What areas around redundancy would you like us to explore in more detail in future blog posts?