If the Replication Server has suspended routes, or if a network or data server connection is down, a backlog of messages may accumulate in the Replication Server stable queues. The chance of recovering these messages decreases with time. Source Replication Servers may already have deleted messages from their stable queues and database logs may already have been truncated.
When you set the save_interval for each route between Replication Servers, you allow each Replication Server to retain messages for a minimum period of time after the next site in the route acknowledges that it has received the messages. The availability of these messages increases the chance of recovering online messages after queues are rebuilt.
For example, in Figure 7-1, Replication Server TOKYO_RS maintains a direct route to MANILA_RS, and MANILA_RS maintains a direct route to SYDNEY_RS.
TOKYO_RS retains messages for a period of time after MANILA_RS has received them. If MANILA_RS experiences a partition failure, it requires that TOKYO_RS to resend the backlogged messages. MANILA_RS can also retain messages to allow SYDNEY_RS to recover from failures.
When all of the messages stored on a stable queue segment are at least as old as the save_interval setting, Replication Server deletes the segment so it can be reused.
Figure 7-1: Save interval example
To set the save_interval for a route, execute the alter route command at the source Replication Server. Using as an example the replication system in Figure 7-1, here is the command to set Replication Server TOKYO_RS to save for one hour any messages destined for MANILA_RS:
alter route to MANILA_RS set save_interval to '60'
By default, the save_interval is set to 0 (minutes). For systems with low volume, this may be an acceptable setting for recovery, since Replication Server does not delete messages immediately after receiving acknowledgment from destination servers. Rather, messages are deleted periodically in large chunks.
However, to accommodate the volume and activity of sites that receive distributions from the Replication Server and to increase the chance of full recovery from database or partition failures, you may want to change the save_interval setting.
In case of a partition failure on the stable queues, be sure your setting allows adequate time to restore your system. Consider also the size of the partitions that are allocated for backlogged messages. Partitions must be large enough to hold the extra messages.
Refer to the Replication Server Design Guide capacity planning guidelines for help in determining queue space requirements.