Anticipating failures within a redundant environment

There are three important potential failures in a redundant environment:

Failure of the primary OpenSwitch
Failure of the secondary OpenSwitch
Failure of the primary or redundant RCM

Failure of the primary OpenSwitch

The failure of the primary OpenSwitch, which means the loss of the connection between the two RCM instances and the primary OpenSwitch server, causes the following changes to the environment:

After trying to reestablish the connection and failing, the primary RCM instance ceases execution.

After trying to reestablish the connection to the primary OpenSwitch and failing, the redundant RCM instance assumes control of the failover operation.

Users who connect to the environment through the primary OpenSwitch server (both application end users and DSS users) lose their connection to the primary OpenSwitch server and must log in again.

When these users log in again, they are connected to the secondary OpenSwitch server because it is the next entry in the sql.ini (Windows) or interfaces (UNIX) file record that describes the primary OpenSwitch server to these users. This multiple query entry in the sql.ini (Windows) or interfaces (UNIX) file enables user login connections to seamlessly roll over, or to change from the primary to the secondary OpenSwitch server.

See “Setting up the sql.ini or interfaces file” for more information.

The RCM administrative login does not roll over during an OpenSwitch server failure. See “Setting up the sql.ini or interfaces file” for more information about connection rollover.

Failure of the secondary OpenSwitch

The failure of the secondary OpenSwitch, which means the loss of the connection between the two RCM instances and the secondary OpenSwitch server, causes the following changes to the environment:

After trying to reestablish the connection and failing, the primary RCM instance notes the failure of the secondary OpenSwitch server in its log.

After trying to reestablish the connection to the secondary OpenSwitch and failing, the redundant RCM instance ceases execution.

Because DSS users connect to the environment through the secondary OpenSwitch server, they lose their connection to the secondary OpenSwitch server and must log in again.

When these users log in again, they are typically connected through the primary OpenSwitch server because it is the next entry in the sql.ini (Windows) or interfaces (UNIX) file record that describes the secondary OpenSwitch server to these users. The multiple query entry in the sql.ini (Windows) or interfaces (UNIX) file enables user logins to seamlessly roll over to the primary OpenSwitch.

See “Setting up the sql.ini or interfaces file” for more information.

The RCM administrative login does not roll over during an OpenSwitch server failure. See “Setting up the sql.ini or interfaces file” for more information.

Failure of the primary and redundant RCM instances

Failure of an RCM instance is unlikely; however, you should be prepared for its potential failure because it can mean the loss of failover capability of the environment. The failure of the primary RCM can mean that your environment no longer has the capability of failing over in a catastrophic situation because the primary RCM is no longer running and no longer aware of the status of the system. Similarly, the failure of the redundant RCM can mean the loss of the RCM’s overall ability to detect the failure of the primary OpenSwitch server because the redundant RCM could not then assume control of failover if the primary OpenSwitch server fails.

To gain some protection from an RCM failure, you must set the COORD_MODE parameter to “ALWAYS” in the OpenSwitch configuration file. This ensures that any logins to an OpenSwitch server after an RCM failure. This login failure notifies users of a problem so that you can take steps to recover, such as stopping and restarting servers.

See “Setting up the sql.ini or interfaces file” for more information about OpenSwitch configuration parameters.