SQM Loss Between Two Replication Servers

Learn how Replication Server detects data loss between two Replication Servers.

Every time you rebuild stable queues during a recovery procedure, Replication Server requests backlogged messages from sites that send its distributions. If the Replication Server manages primary databases, it instructs their RepAgents to send messages from the beginning of the online transaction logs. The backlogged messages repopulate the emptied stable queues.

Replication Server enables loss detection mode at those sites you are rebuilding that have a direct route from the Replication Server. In the figure, Replication Server RS3 detects losses if you rebuild the queues of Replication Server RS2. Similarly, RS2 detects losses if you rebuild the queues of Replication Server RS1.

Replication System Loss Detection Example
Figure 7-3 illustrates an example of replication system loss detection. It consists of a primary database and a replicate database. It also includes three Replication Server, R S 1, R S 2, and R S 3 with their respective R S S D. The Replication Server enables loss detection mode at those sites you are rebuilding that have a direct route from the Replication Server. In the example shown, R S 3 detects losses if you rebuild the queues of R S 2. Similarly, R S 2 detects losses if you rebuild the queues of R S 1.

When you execute the rebuild queues command at RS2, RS3 performs loss detection for all primary databases whose updates are routed to RS3 through RS2. RS3 logs messages for each of these databases. If you rebuild queues at RS3, no SQM loss detection is performed, because there are no routes originating from RS3.

Replication Server detects loss by looking for duplicate messages. If RS3 receives a message that it had received before the rebuild queues command, then no messages were lost. If the first message RS3 receives after rebuild queues has not been seen before, then either messages were lost, or no messages were in the stable queue.

Even if there are no messages in the stable queue from a specific source, RS3 identifies them as lost because it has no duplicate messages to use for a comparison. You can prevent this false loss detection by creating a heartbeat with an interval that is less than the save interval. This guarantees that there will always be at least one message in the stable queue.

SQM Example

When RS3 performs SQM loss detection for the rebuilt RS2, it logs in to its log file messages similar to the following Checking Lossmessage examples. These messages mark the beginning of the loss detection process. Subsequent messages are logged with the results. Each message contains a source and destination pair.

The first example message indicates that RS3 is checking loss for the RSSD at RS3 from the RSSD at RS2:
Checking Loss for DS3.RS3_RSSD from DS2.RS2_RSSD
date=Nov-01-95 10:15 am
qid=0x01234567890123456789
The second example message indicates that RS3 is checking loss for the replicate database RDB at RS3, from the primary database PDB at RS1:
Checking Loss for DS3.RDB from DS1.PDB
date=Nov-01-95 11:00am
qid=0x01234567890123456789
The third example message indicates that RS3 is checking loss for the RSSD at RS3 from the RSSD at RS1:
Checking Loss for DS3.RS3_RSSD from DS1.RS1_RSSD
date=Nov-01-95 10:00am
qid=0x01234567890123456789
RS3 reports whether it detects a loss. For example, the results of such loss-detection tests might read as follows:
No Loss for DS3.RS3_RSSD from DS2.RS2_RSSD
Loss Detected for DS3.RDB from DS1.PDB
No Loss for DS3.RS3_RSSD from DS1.RS1_RSSD