Sun Fire[TM] 15K/12K/E20K/E25K Servers: RCM Daemon hanging, causing DR operations to hang

Asset ID:	1-72-1002125.1
Update Date:	2009-12-01
Keywords:

Solution Type Problem Resolution Sure

Solution 1002125.1 : Sun Fire[TM] 15K/12K/E20K/E25K Servers: RCM Daemon hanging, causing DR operations to hang

Related Items


Sun Fire E25K Server
 Sun Fire 15K Server

Related Categories


GCS>Sun Microsystems>Servers>High-End Servers

PreviouslyPublishedAs
203025

Symptoms
Remote Dynamic Reconfiguration(DR) operation(from the System Controller), or
local DR operation(from the domain), works fine until a DR operation does not
respond; reporting in the $SMSLOGGER/domain_Id/messages file, messages like:

   DCA/DCS communication error

 and/or

   dca[...]-S(): [... ERR DCSInterface.cc 378] message receive failed:
DCSInterface :: receiveResponse errCode:502

In some cases, it may not be possible to kill the associated process(cfgadm,
rcfgadm, deleteboard, showdevices).

Note :
For background information on the DR mechanism, please refer to:

Technical Instruction <Document: 1003582.1> - What Happens in a Sun Fire[TM] 15K/12K DR Slot0 Detach Operation

Note :
If none of the symptoms described below is true, and the signature of the
rcm_daemon stack is not the same as described in this article, then it's more
likely that you are facing a different issue.
See the References section below for more troubleshooting steps.

Resolution
In the case of a remote DR operation(from the SC), it's possible to confirm(by
trussing the commands), that this command is waiting for a Domain Configuration
Agent(dca), which is waiting for a Domain Configuration Server(dcs) :
For Example :

A showdevices command, is waiting for an update from the dca process, via the
door to dca, and the pipe(scdrN) to dca. A truss shows:

   18541/4:    0.3145 creat("/var/opt/SUNWSMS/SMS1.4.1/pipes/C/scdr0", 0666) = 8
18541/4:    0.3149 pipe()                              = 8 [9]
[...]
18541/4:    1.5339 ioctl(8, I_RECVFD, 0xFE77BF24)   (sleeping...)
18541/4:                fd=9     uid=11    gid=20

   18541/4:    0.3517 open("/var/opt/SUNWSMS/SMS1.4.1/doors/H/dca", O_RDONLY) = 7
18541/4:        door_call(7, 0x00048CD8)        (sleeping...)
18541/4:        door_call(7, 0x00048CD8)        (sleeping...)

Then, the dca process is waiting for a dcs process:

   29675/232:    13.3519 poll(0xFE3FBAF0, 1, 43200000)   (sleeping...)
29675/232:            fd=12 ev=POLLIN rev=0

The nature of fd=12 can be determined by using the pfiles command :

# pfiles 29675
29675:  dca -d C
[...]
12: S_IFSOCK mode:0666 dev:308,0 ino:30682 uid:0 gid:0 size:0
O_RDWR
sockname: AF_INET 10.2.1.1  port: 39601
peername: AF_INET 10.2.1.4  port: 665

fd=12 is the socket connection between DCA and DCS.
Note that dcs always uses the TCP port 665, as shown by the following :

# grep sun-dr /etc/inetd.conf
sun-dr  stream  tcp     wait    root    /usr/lib/dcs    dcs
sun-dr  stream  tcp6    wait    root    /usr/lib/dcs    dcs
# grep sun-dr /etc/services
sun-dr          665/tcp                 # Remote Dynamic Reconfiguration

In this situation it's more likely that many dcs processes are running.
Most of them are stuck, waiting for the rcm_daemon.

This situation may be easily confirmed by :
* trussing the dcs process(es),
* getting a pstack(1) output from the rcm_daemon process.

Trussing the dcs process(es) should confirm that they are all waiting for update
from the Reconfiguration Coordination Manager daemon(rcm_daemon) :

For Example :

# ptree 432
155   /usr/sbin/inetd -s
432   dcs
7122  dcs

       7122/1:    10.2993 open("/var/run/rcm_daemon_door", O_RDONLY)    = 8
7122/1:    55.7444 door_call(8, 0xFFBEE978)        (sleeping...)

# pfiles 7122
7122:   dcs
[...]
8: S_IFDOOR mode:0400 dev:305,0 ino:40644 uid:0 gid:1 size:0
O_RDONLY  door to rcm_daemon[7053]

This confirms that dcs is waiting for the rcm_daemon via the door between the 2
processes.

Getting a pstack output from the rcm_daemon should report the following stack :

# pgrep rcm_daemon
7053
# pstack 7053

7053:   /usr/lib/rcm/rcm_daemon
[...]
-----------------  lwp# 5 / thread# 4  --------------------
ff09f3d8 lwp_mutex_lock (ff29cd10)
ff287698 fork1    (ff29c000, a, 35cc8, ff29d670, 534d, 1) + 50
0001ac1c run_script (0, 35cc0, 0, 0, 2, 35ca0) + 154
0001b4c4 do_cmd   (30910, fea0b62c, 30910, fea0b62c, 0, 35ca0) + 34
0001bf2c script_register_interest (35cd8, ffffffff, 0, 35ca0, 354c0, 0) + 98
000173fc rcmd_db_sync (308e8, 35c68, ffffffff, 19598, 19fbc, 0) + 7c
000195c0 rcmd_thr_incr (30e06, 89200, 6, fea0b798, 35f80, 0) + c4
00012bd8 event_service (fea0bc50, fea0bc54, 0, fea0bc88, 0, 0) + f4
ff2b40dc door_service (31f28, ff2c6000, b0, 31f28, 0, 4) + 64
ff09c9ec _door_return (0, 38, e0000, 1, 11, 72636d2e) + 68
[...]

This thread is blocked in the kernel, waiting for a lock.

Killing the rcm_daemon should help. The next DR operation should complete
successfully but the same symptom might come back.

For system running rcm_daemon patch 116991-03 and later, rcm_daemon is now linked with the alternate libthread.

Relief/Workaround
This can be done with a script:

#!/bin/sh
LD_LIBRARY_PATH=/usr/lib/lwp
export LD_LIBRARY_PATH
LD_LIBRARY_PATH_64=/usr/lib/lwp/64
export LD_LIBRARY_PATH_64

/usr/lib/rcm/rcm_daemon

or via a command line

# pkill -9 rcm_daemon
# LD_LIBRARY_PATH=/usr/lib/lwp
LD_LIBRARY_PATH_64=/usr/lib/lwp/64
/usr/lib/rcm/rcm_daemon

On Solaris[TM] 8 Operating System, to verify that the rcm_daemon is using the
alternate libthread, a pldd(1) command against the process, should report:
"/usr/lib/lwp/libthread.so.1" instead of "/usr/lib/libthread.so.1".

For Example :

# pgrep rcm
7204
# pldd 7204
7204:   /usr/lib/rcm/rcm_daemon
/usr/lib/libgen.so.1
/usr/lib/libelf.so.1
/usr/lib/libdl.so.1
/usr/lib/libcmd.so.1
/usr/lib/libdoor.so.1
/usr/lib/librcm.so.1
/usr/lib/lwp/libthread.so.1
/usr/lib/libnvpair.so.1
/usr/lib/libdevinfo.so.1
/usr/lib/libnsl.so.1
/usr/lib/libsocket.so.1
/usr/lib/libc.so.1
/usr/lib/libmp.so.2
/usr/platform/sun4u-us3/lib/libc_psr.so.1
/usr/lib/rcm/modules/SUNW_cluster_rcm.so
/usr/lib/rcm/modules/SUNW_dump_rcm.so
/usr/lib/rcm/modules/SUNW_filesys_rcm.so
/usr/lib/rcm/modules/SUNW_ip_rcm.so
/usr/lib/rcm/modules/SUNW_network_rcm.so
/usr/lib/rcm/modules/SUNW_swap_rcm.so

For more details about the Alternate Libthread, see:
Sun Alert <Document: 1000512.1> Applications Linking to libthread May Hang

Additional Information
More technical details in:

* CR4825286 - RCM Daemon hanging causing DR operations to hang

As a summary, this situation is due to a problem in the Solaris 8 OS libthread.

Waiting for the fix of the original rootcause, SunOS 5.8 rcm_daemon patch 116991-03 has been released and it now links rcm_daemon with alternate libthread.

Product
Sun Fire E25K Server
Sun Fire 15K Server

Internal Comments
The following is strictly for the use of Sun employees:

References :

. Problem Resolution <Document: 1008803.1>

Title: Sun Fire[TM] 12K/15K: showdevices can hang if sd.conf is large or misleading

. Problem Resolution <Document: 1008805.1>

Title: Sun Fire[TM] 12K/15K/E20K/E25K: Remote Dynamic Reconfiguration (DR) generates "DCA/DCS Communication Error" and showdevices is 'Unable to get device information from domain'

. Technical Instruction <Document: 1003582.1>

Title: What Happens in a Sun Fire[TM] 15K/12K DR Slot0 Detach Operation

. Technical Instruction < Solution: 208934 >

Title: Sun Fire[TM] 15K/12K Servers: SMS's signature of aborted DR operations

. Technical Instruction <Document: 1004922.1>

Title: Sun Fire[TM] servers: Trouble-shooting RCM failures events in DR operations

. Problem Resolution <Document: 1009124.1>

Title: Sun Fire[TM] 12K/15K: showdevices takes a long time to return

. FAQ: 2957

Synopsis: Why do we have two set of thread libraries on Solaris 8?

. Problem Resolution <Document: 1012320.1>

Title: Sun Fire[TM] 12K/15K/20K/25K: Domain reports "sun-dr/tcp: bind: Address already in use"

stephane.dutilleul@sun.com

Also, see: Internal BugID 6234740 - libthread`_co_timerset() may attempt to

acquire _calloutlock twice

errCode:502, rcm, dcs, showdevices, rcm_daemon, DCA/DCS communication error
Previously Published As
80582

Change History
Date: 2006-01-19
User Name: 7058
Action: Update Canceled
Comment: *** Restored Published Content *** SSH AUDIT
Version: 0
Date: 2006-01-19

Attachments

This solution has no attachment