Recovering from partition loss or failure

When a Replication Server detects a failed or missing partition, it shuts down the stable queues that are using the partition and logs messages about the failure. Restarting Replication Server does not correct the problem. You must drop the damaged partition and rebuild the stable queues.

Complete recovery depends on the volume of messages cleared from the queue and on how soon you apply the recovery procedure after the failure occurs. If a Replication Server maintains minimal latency in the replication system, only the most recent messages are lost when its queues are rebuilt.

If a partition fails in a primary Replication Server, you can usually resend lost messages from their source using an off-line database log. If partitions fail in a replicate Replication Server, you need to recover from the stable queue of the upstream Replication Server.

In some cases, using an off-line log may be the only way you can recover your messages. If the Replication Server has suspended routes or connections, or if a network or data server connection goes down, a backlog may have accumulated in the Replication Server stable queues. Unless you have specified a save interval setting that can cover the backlog, your chance of recovering these messages decreases with time. Source Replication Servers may have already deleted messages from their stable queues and may have truncated the database logs.

NoteFor details on setting and displaying the save interval for recovery purposes, see “Recovering from partition loss or failure”.

Table 7-2 summarizes when to use and where to locate the appropriate recovery procedure for partition loss or failure.

Table 7-2: Overview of symptoms and procedures

Symptom

Use this procedure

Replication Server detects lost, damaged, or failed stable queue.

“Procedure for recovering from partition loss or failure”.

Message loss occurred because a backlog existed in the failed Replication Server and there were insufficient messages saved at the previous site.

“Message recovery from off-line database logs”.

In addition to message loss, database logs have been truncated. Either the secondary truncation point is invalid or the dbcc settrunc('ltm', 'ignore') command, was executed to truncate log records that have not been transferred by RepAgent to the Replication Server.

Use “Truncated message recovery from the database log” to recover the database log. Then use “Message recovery from off-line database logs” to rebuild the stable queues and recover lost messages.