Anticipating failures within a redundant environment

There are three important potential failures in a redundant environment:


Failure of the primary OpenSwitch

The failure of the primary OpenSwitch, which means the loss of the connection between the two RCM instances and the primary OpenSwitch server, causes the following changes to the environment:

NoteThe RCM administrative login does not roll over during an OpenSwitch server failure. See “Setting up the sql.ini or interfaces file” for more information about connection rollover.


Failure of the secondary OpenSwitch

The failure of the secondary OpenSwitch, which means the loss of the connection between the two RCM instances and the secondary OpenSwitch server, causes the following changes to the environment:


Failure of the primary and redundant RCM instances

Failure of an RCM instance is unlikely; however, you should be prepared for its potential failure because it can mean the loss of failover capability of the environment. The failure of the primary RCM can mean that your environment no longer has the capability of failing over in a catastrophic situation because the primary RCM is no longer running and no longer aware of the status of the system. Similarly, the failure of the redundant RCM can mean the loss of the RCM’s overall ability to detect the failure of the primary OpenSwitch server because the redundant RCM could not then assume control of failover if the primary OpenSwitch server fails.

To gain some protection from an RCM failure, you must set the COORD_MODE parameter to “ALWAYS” in the OpenSwitch configuration file. This ensures that any logins to an OpenSwitch server after an RCM failure. This login failure notifies users of a problem so that you can take steps to recover, such as stopping and restarting servers.

See “Setting up the sql.ini or interfaces file” for more information about OpenSwitch configuration parameters.