13045: replication suspended because RSSD restarted

Symptom

These error messages are displayed in the Replication Server error log:

E. 2006/06/13 14:50:16. ERROR #13045 SQT(101:1 DIST westss.eastlp) -
seful/cm.c(3914)
Failed to connect to server 'westss' as user 'westrs_rssd_prim'. See CT-Lib
and/or server error messages for more information.
I. 2006/06/13 14:50:17. Trying to connect to server 'westss' as user
'westrs_rssd_prim' ......

After the Adaptive Server with the RSSD has been started again, the following error messages are displayed in the Replication Server error log:

E. 2006/06/13 17:04:52. ERROR #1027 dSUB( ) -
seful/cm.c(3909)
Open Client Client-Library error: Error: 84083972,
Severity 5 -- 'ct_connect():
network packet layer: internal net library error: Net-
Lib protocol driver call to connect two endpoints
failed', Operating System error 0 -- 'Socket connect
failed - errno 146 Connection refused'.
E. 2006/06/13 17:04:52. ERROR #13045 dSUB( ) -
seful/cm.c(3914)
Failed to connect to server 'westss' as user 'amerttp'.
See CT-Lib and/or server error messages for more
information.
I. 2006/06/13 17:04:52. Trying to connect to server
'westss' as user 'westrs_rssd_prim' ......
E. 2006/06/13 17:04:57. ERROR #1027 dSUB( ) -
seful/cm.c(3909)
Open Client Client-Library error: Error: 84083972,
Severity 5 -- 'ct_connect():
network packet layer: internal net library error: Net-
Lib protocol driver call to connect two endpoints
failed', Operating System error 0 -- 'Socket
connectfailed - errno 146 Connection refused'.
E. 2006/06/13 17:05:56. ERROR #13043 USER(westss_ra) - ul/cmapp.c(888)
Failed to execute the 'USE westss_rssd' command on
server 'westss'. See CT-Lib and SQL Server error
messages for more information.
E. 2006/06/13 17:05:56. ERROR #1028 USER(westss_ra) -
ul/cmapp.c(888)
Message from server: Message: 911, State 2, Severity 11
-- 'Attempt to locate entry in sysdatabases for database
'westss_rssd' by name failed - no entry found under that
name. Make sure that name is entered properly.'.
I. 2006/06/13 17:05:56. Message from server: Message:
5701, State 1, Severity 10 -- 'Changed database context to 'master'.'.
E. 2006/06/13 17:05:56. ERROR #13045 USER(westss_ra) - seful/cm.c(3318)
Failed to connect to server 'westss' as user
'westrs_rssd_prim'. See CT-Lib and/or server error
messages for more information.
E. 2006/06/13 17:05:56. ERROR #1028 USER(westss_ra) -
seful/cm.c(3318)
Message from server: Message: 911, State 2, Severity 11
-- 'Attempt to locate entry in sysdatabases for database
'westss_rssd' by name failed - no entry found under that
name. Make sure that name is entered properly.'.
I. 2006/06/13 17:05:56. Message from server: Message:
5701, State 1, Severity 10
-- 'Changed database context to 'master'.'.
E. 2006/06/13 17:05:56. ERROR #13043 dREC(dREC)--
ul/cmapp.c(888)
Failed to execute the 'USE westss_rssd' command on
server 'westss'. See CT-Lib and SQL Server error
messages for more information.

Explanation

The Adaptive Server that controls the RSSD was shut down and restarted while the Replication Server was running. The DIST and SQT threads to the databases controlled by the Replication Server were terminated. Replication to those databases was terminated and will not resume even after the RSSD becomes available again.

Running the admin who_is_down command at the Replication Server shows that both DIST and SQT threads are down as follows:

Spid    Name      State      Info
----    ------    -------    ----------------------
        DIST      Down       westernDS.westDB
        SQT       Down       105:1 westernDS.westDB

Solution

To solve the problem:

  1. At the Replication Server, execute resume distributor for each database to resume SQT and DIST threads.

  2. Run admin who_is_down at each database to verify that the SQT and DIST threads are up.