Reconnecting clients when a node loses power

If the network cable is removed from a machine or if a node to which a client is connected loses its power, the client-side socket becomes unreachable. The client socket waits, without results, for a reply from the server or waits for the cluster to issue a send operation.

In the situation in Figure 2-1, a client application is connected to “big_cluster”, which consists of “Node1” and “Node2” on which instances “ASECE1” and “ASECE2” are running, respectively. A client application is connected to instance “ASECE1” running on “Node1”.

If the power is disconnected from “Node1”, the client application waits for contact from the node. The only way to avoid this situation is to configure the client application to assume the node is down after a specified amount of time. It then connects to another node in the cluster.

Figure 2-1: Unplugged node

Image shows a client connecting to an unplugged cluster

The operating system network detects a crash, disconnects the clients, and fails over the sockets from the remote side of the connection.

To reduce the time required to detect when a cluster loses a host or when a public network is disconnected from a node running an instance, you can:

Setting TCP keepalive to a shorter value

The TCP keepalive parameter eventually marks the client socket as failed. However, because the default value of the TCP keepalive value is a long amount of time (in some systems it may be set to as long as two hours), it may be three or more hours before the client-side sockets fail over. Setting keepalive to a small value (several minutes) may not be practical for large organizations, but you can set keepalive to a period of time that is appropriate for your site, that works with the HAFAILOVER capabilities.

Set TCP keepalive on client machines. The appropriate values vary, depending on the operating system you use. See your client’s operating system documentation for more information.

If you are testing for client timeouts, set the values for the parameters in the first two columns of Table 2-1 to a few minutes, and set the values for the parameters in the third column to a low number.

Table 2-1: Setting TCP keepalive

Operating system

Amount of time parameter waits before probing the connection

Amount of time between probes for parameter

Maximum amount of time or attempts for parameter to probe connections before dropping them

Solaris

N/A

tcp_keepalive_interval

Measured in milliseconds

N/A

Linux

tcp_keepalive_time

Measured in seconds

tcp_keepalive_intvl

Measured in seconds

tcp_keepalive_probes

Measured as absolute number

Windows XP

KeepAliveTime

Measured in seconds

KeepAliveInterval

Measured in seconds

TCPMaxDataRetransmissionsions

Measured as absolute number

HP-UX

tcp_time_wait_interval

Measured in milliseconds

tcp_keepalive_interval

Measured in milliseconds

tcp_keepalive_kill

Measured in milliseconds

Set client’s timeout value

There are two different timeout properties you can set for Client-Library program connections:

Based on how you configure the timeout event, the client either fails or fails over to another node.

You can configure clients to set the Client-Library CS_TIMEOUT parameter to determine how long to wait before they time out.

You must set the CS_TIMEOUT and CS_LOGIN_TIMEOUT parameters or the isql -t and -l parameters for clients to fail over during a sudden loss of power to the node.

For more information about the Client-Library parameters CS_TIMEOUT and CS_LOGIN_TIMEOUT, see the Client-Library/C Reference Manual. For information about CS_HAFAILOVER, see “Client/server interaction”.

For information about using the isql -t and -l parameters, see the Adaptive Server Enterprise Utility Guide.