Loss detection after rebuilding stable queues

To determine if any messages could not be recovered after the stable queues were rebuilt, the Replication Server performs loss detection. By checking Replication Server loss-detection messages, you can determine what kind of user intervention, if any, is necessary to restore all data to the system.

Replication Server detects two types of losses after rebuilding stable queues:

SQM loss, which refers to data lost between two Replication Servers, detected at the next downstream site
DSI loss, which refers to data lost between a Replication Server and a replicate database that the Replication Server manages

Both kinds of loss detection are addressed in the following sections.

If all data is available, no intervention is necessary and the replication system can return to normal operations. For example, if you know that the save interval for the route or connection is set for a longer length of time than the failure, you can recover all messages with no intervention. However, if the save interval is not set or is set too low, some messages may be lost.

A Replication Server that has detected a loss does not accept messages from the source. Loss detection prevents the source from truncating its stable queues. For example, if Replication Server RS2 detects that replicate data server DS2.RDB has lost data from primary data server DS1.PDB, Replication Server RS1 cannot truncate its queues until you decide how to handle the loss. As a result, RS1 may run out of stable storage. Before a loss is detected (that is, after the “Checking Loss” message is reported), you can choose to ignore losses for a source and destination pair.

SQM loss between two Replication Servers

Every time you rebuild stable queues during a recovery procedure, Replication Server requests backlogged messages from sites that send its distributions. If the Replication Server manages primary databases, it instructs their RepAgents to send messages from the beginning of the online transaction logs. The backlogged messages repopulate the emptied stable queues.

Replication Server enables loss detection mode at those sites you are rebuilding that have a direct route from the Replication Server. In Figure 7-3, Replication Server RS3 detects losses if you rebuild the queues of Replication Server RS2. Similarly, RS2 detects losses if you rebuild the queues of Replication Server RS1.

Figure 7-3: Replication system loss detection example

Figure 7-3 illustrates an example of replication system loss detection. It consists of a primary database and a replicate database. It also includes three Replication Server, R S 1, R S 2, and R S 3 with their respective R S S D. The Replication Server enables loss detection mode at those sites you are rebuilding that have a direct route from the Replication Server. In the example shown, R S 3 detects losses if you rebuild the queues of R S 2. Similarly, R S 2 detects losses if you rebuild the queues of R S 1.

When you execute the rebuild queues command at RS2, RS3 performs loss detection for all primary databases whose updates are routed to RS3 through RS2. RS3 logs messages for each of these databases. If you rebuild queues at RS3, no SQM loss detection is performed, because there are no routes originating from RS3.

Replication Server detects loss by looking for duplicate messages. If RS3 receives a message that it had received before the rebuild queues command, then no messages were lost. If the first message RS3 receives after rebuild queues has not been seen before, then either messages were lost, or no messages were in the stable queue.

Even if there are no messages in the stable queue from a specific source, RS3 identifies them as lost because it has no duplicate messages to use for a comparison. You can prevent this false loss detection by creating a heartbeat with an interval that is less than the save interval. This guarantees that there will always be at least one message in the stable queue.

SQM example

When RS3 performs SQM loss detection for the rebuilt RS2, it logs in to its log file messages similar to the following “Checking Loss” message examples. These messages mark the beginning of the loss detection process. Subsequent messages are logged with the results. Each message contains a source and destination pair.

The first example message indicates that RS3 is checking loss for the RSSD at RS3 from the RSSD at RS2:

Checking Loss for DS3.RS3_RSSD from DS2.RS2_RSSD
date=Nov-01-95 10:15 am
qid=0x01234567890123456789

The second example message indicates that RS3 is checking loss for the replicate database RDB at RS3, from the primary database PDB at RS1:

Checking Loss for DS3.RDB from DS1.PDB
date=Nov-01-95 11:00am
qid=0x01234567890123456789

The third example message indicates that RS3 is checking loss for the RSSD at RS3 from the RSSD at RS1:

Checking Loss for DS3.RS3_RSSD from DS1.RS1_RSSD
date=Nov-01-95 10:00am
qid=0x01234567890123456789

RS3 reports whether it detects a loss. For example, the results of such loss-detection tests might read as follows:

No Loss for DS3.RS3_RSSD from DS2.RS2_RSSD

Loss Detected for DS3.RDB from DS1.PDB

No Loss for DS3.RS3_RSSD from DS1.RS1_RSSD

DSI loss between a Replication Server and its databases

Some messages in Replication Server queues are destined for databases, rather than for other Replication Servers. The DSI performs loss detection in a way that is similar to stable queue loss detection.

If you rebuild queues at a Replication Server that has no originating routes, no SQM loss detection is performed, but the Replication Server performs DSI loss detection for its messages.

DSI example

The DSI at Replication Server RS2 generates the following message for the RSSD at RS2:

DSI: detecting loss for database DS2.RS2_RSSD from origin DS1.RS1_RSSD
date=Nov-01-95 10:58pm
qid=0x01234567890123456789

When retained messages begin arriving from previous sites, the DSI detects a loss, depending on whether the first message from the origin has already been seen by the DSI. If it detects no loss, a message similar to the following one is generated:

DSI: no loss for database DS2.RS2_RSSD from origin DS1.RS1_RSSD

If the DSI does detect a loss, a message like the following one is generated:

DSI: loss detected for database DS2.RS2_RSSD from origin DS1.RS1_RSSD

Handling losses

When Replication Server detects a loss, no further messages are accepted on the connection to the SQM or the DSI.

For example, when RS3 detects an SQM message loss for the RDB database from the PDB database, it rejects all subsequent messages from the PDB database to the RDB database.

Recovering a loss

To recover the loss, you need to choose one of the following options:

Ignore the loss and continue, even though some messages may be lost. You can use the rs_subcmp program with the -r flag to reconcile primary and replicate data.

To run rs_subcmp, see “Subscription comparison procedure”. See also Chapter 11, “Managing Subscriptions” in the Replication Server Administration Guide Volume 1. Also, refer to Chapter 7, “Executable Programs,” in the Replication Server Reference Manual for more information about rs_subcmp command.
Ignore the loss, then drop and re-create the subscriptions.
Recover by replaying transactions from off-line logs (primary Replication Server loss only). In this case, you are not ignoring the loss.

Ignoring a loss

You must execute an ignore loss command in the following situations:

If you choose to recover the lost messages by re-creating subscriptions or replaying logs.
For an SQM loss, at the Replication Server that reported that loss, to force the Replication Server to begin accepting messages again. For example, to ignore a loss at Replication Server RS3 detected from DS1.PDB, enter the following command at RS3:
```
ignore loss from DS1.PDB to DS3.RDB
```
For a DSI loss, at the database on the Replication Server where the loss was detected. For example, to ignore a loss reported in DS2.RS2_RSSD from origin DS1.RS1_RSSD, enter the following command at RS2:
```
ignore loss from DS1.RS1_RSSD to DS2.RS2_RSSD
```
For both an SQM and a DSI loss that is detected by a Replication Server at the destination of the route when you rebuild two Replication Servers in succession.

In this case, you need to execute ignore loss twice, once for SQM losses and once for DSI losses. The ignore loss command that you execute to ignore DSI loss at the destination Replication Server is the same command you use to ignore SQM loss from the previous site.