Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1018867.1
Update Date:2011-02-04
Keywords:

Solution Type  Problem Resolution Sure

Solution  1018867.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: Mailbox Framework Failures, Even Without DR Operation  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
230667


Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server - Version: Not Applicable and later    [Release: N/A and later]
Sun Fire E25K Server - Version: Not Applicable and later    [Release: N/A and later]
All Platforms

Symptoms

Symptoms

Sun Fire[TM] 12K/15K domains can experience the following error messages, even
if there are no DR operations in progress:
drmach: [ID 757311 kern.warning] WARNING: mboxsc_putmsg failed: 0x91
drmach: [ID 355606 kern.warning] WARNING: cmd =0xb, exb = 0, slot = 0
drmach: [ID 702911 kern.warning] WARNING: Mailbox framework failure: outgoing
drmach: [ID 109634 kern.warning] WARNING: reinitializing DR mailbox

Cause

see solution

Solution

Resolution Steps

- Check if any "cfgadm based" programs running on the Sun Fire 12K/15K domain(s) are reporting the warnings.
Examples :

  • cfgadm : Configuration and Service Tracker - '/opt/SUNWcst/bin/cstd -b'
  • libcfgadm : Sun Management Center - 'esd'.
  • Sun Explorer - when it runs the cfgadm command.
  • ...

- Check for signs of lock contention reported on the System Controllers.
The locking issues can be identified in the platform messages files /var/opt/SUNWSMS/SMS/adm/platform/messages*:

  • "attempting to lock Global I2C at ..."
  • "Failed(1215) to get system lock SC at ..."
  • "No Functioning Network to Remote SC" from remote SC ...
- Look for any known issue, due to lock contention, for hwad in the knowledge database and apply fix or stop/start SMS daemons.

Although each SMS version can be affected, it is strongly recommended that you upgrade to SMS 1.6, and use any available patches for the version you are currently using.

Relief/Workaround

A short term workaround may be to failover the SC's.

The following steps will be needed to perform the failover safely and prevent any uncontrolled state (no SC available).

  • Check if a spare SC is available and healthy
    • 'showfailover -v'.
  • Make sure there are no more files to propagate from main to spare SC
    •  showdatasync
  • Force a failover to the spare SC
    • setfailover force
  • As soon as the new main SC is ready and the new spare SC has been reset, enable failover.
    • setfailover on
  • and check if everything is back to a normal state
    • showfailover -v

You can rerun the same procedure to fail back to the former main SC.

Additional Information
Even when there is no DR operation in progress, cfgadm or libcfgadm can be
in use. Usage for cfgadm or libcfgadm implies communication with the System
Controller via mailboxes and requests to hwad (managing locks to protect HW
accesses).

If, for any reason (under certain loads, it takes a long time to access
shared hardware resource -DARBs, ...) hwad experiences lock contention,
there could be delays in responding to the requests, including requests
from cfgadm.

Sun Fire 12K/15K domains that are running programs based on 'cfgadm' or
'libcfgadm' display these warnings because such requests are made on a
regular basis.

The lock contention is not necessarily a permanent situation. It can be a
consequence of hwad being very busy at that moment.


@Internal Comments
For Internal Use Only

References and bug IDs :
Escalations : 549864, 543000
Internal Bug Reports : 4931012 - memory allocation failure from hwad
Internal Bug Reports : 4966974 - WARNING: Mailbox
framework failure:outgoing
Failure 0x91 "ERESTART" stands for restartable system call.
This subject has been discussed on the starcat-support alias
and logged to the Techmail Archives
Note: SRDB 73129 was combined with this document.
stephane.dutilleul@sun.com

Product
Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server

Keywords

dr, mailbox, framework, failure, cfgadm, cst, lock, outgoing, 0x91, mboxsc,
hwad, 15K, 12K, SF15K, SF12K, Starcat, E25k, E20K

@Previously Published As 73116

@Change History
Date: 2010-05-04
User Name: Cootware
Action: Content Team Review
Comment: Updated products list, content, and keywords.

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback