Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1000008.1 : Solaris Reboot Triggers Spurious SYSTEM Error in Adjacent Domain
PreviouslyPublishedAs 200010 Product Sun Fire 3800 Server Sun Fire 4800 Server Sun Fire 4810 Server Sun Fire 6800 Server Sun Fire E6900 Server Sun Fire E4900 Server Bug Id <SUNBUG: 6300392> Date of Workaround Release 27-JUL-2005 Date of Resolved Release 13-FEB-2006 Impact Hardware error pause for AR L2CheckError may be asserted, causing an abrupt halt to processing within a domain, and hardware replacement will not resolve the issue. Note: Internal testing has shown that L2CheckErrors of the type described in this alert can be reproduced with any firmware version lower than 5.19.7 or 5.20.3 by simulating an IO board DC-DC converter failure. Contributing Factors This issue can occur on the following platforms:
Notes:
To determine the version of ScApp on a system, the following command can be run (from the platform shell): sc0:SC> showsc ... ScApp version: 5.19.4 Build_01 RTOS version: 45 Symptoms A Solaris reboot will cause the adjacent domain to fail with error pause. (Adjacent domains are those running within the same partition, either A and B or C and D). For a case where a Solaris reboot of Domain A causes a failure in Domain B, messages similar to the following may be seen on the SC Platform shell: Domain Reboot A: Initiating keyswitch: on, domain A. ErrorMonitor: Domain B has a SYSTEM ERROR [AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf These messages may be seen on the SC Domain B shell: ErrorMonitor: Domain B has a SYSTEM ERROR /N0/SB3 encountered the first error /N0/IB8 encountered the first error ArAsic reported first error on /N0/SB3 /partition0/domain1/SB3/ar0: >>> L2CheckError[0x6150] : 0x01808100 AccIncSyncErr [24:21] : 0xc accumulated incoming mismatch FE [15:15] : 0x1 INCSyncErr [08:05] : 0x8 Ports [9:6] incoming mismatched against internal expected incoming ArAsic reported first error on /N0/IB8 /partition0/domain1/IB8/ar0: >>> L2CheckError[0x6150] : 0x18189010 CMDVSyncErr [12:09] : 0x8 Ports [9:6] command valid mismatched against internal expected command valid PreqSyncErr [04:01] : 0x8 Ports [9:6] prereq mismatched against internal expected prereq AccCMDVSyncErr [28:25] : 0xc accumulated valid command mismatch FE [15:15] : 0x1 AccPreqSyncErr [20:17] : 0xc accumulated prerequisite mismatch [AD] Event: SF6800.ASIC.AR.CMDV_SYNC_ERR.102420cf In this case, each SB and IB in the failing domain will report AR L2CheckError with either INCSyncErr or CMDVSyncErr. The adjacent domain which was being rebooted may reboot just fine. Note: ArAsic indicates that this error was detected by the Address Repeater (AR) ASIC (Application-Specific Integrated Circuit) within the Sun Fireplane Switch. The AR L2CheckError indicates unexpected behavior of the switch's distributed arbitration protocol. The error will be repeatable, a reboot of one domain causing the adjacent domain to fail, until the master system controller (SC) has been rebooted. A failover to the spare SC will have the same effect. Hardware replacement of the various FRUs which contain the Sun Fireplane Switch have no effect. Workaround To temporarily work around the described issue, reboot the primary SC with the "reboot" command. Resolution This issue is addressed on the following platforms:
Modification History Date: 13-FEB-2006 13-Feb-2006:
Date: 05-DEC-2006 05-Dec-2006:
References<SUNPATCH: 114526-09><SUNPATCH: 114527-04> Previously Published As 101819 Internal Comments Bug(s) added per Thomas.Favara@sun.com There are a number of problems which manifest as L2CheckError. The key to this one is: 1) The error is reported by the AR, not the SDC. 2) A reboot of one domain causes the other domain within the partition to error pause. 3) The condition persists until the SC is rebooted The more common type of AR L2CheckError includes DiffErr and is usually coupled with SafariPortError--the parity error (SafariPortError with AdrPErr) is the cause of the problem. Another similar AR L2CheckError is described in SunAlert 101857. If possible, please collect the following data for the problem described in this SunAlert. As these steps involve engineering mode commands, you will need to open an escalation with PTS to get an engineering mode password and to review the commands to be executed. First, script (capture) the output of: 1. The tty of the master SC 2. A telnet session to the platform shell 3. A telnet session to the domain shell for each affected domain 4. Any platform shell session you will be using Then, make a note of and then disable any domain error recovery. We also set diag-level to quick to minimize the time spent in POST while we gather data: for each domain: setupdomain -p boot diag-level = quick reboot-on-error = false hang-policy = notify OBP.error-reset-recovery = none showdomain (to confirm settings) This will help to preserve the error condition and also prevent any ping-pong where the error recovery triggers another instance of the error in the adjacent domain. With both domains in the affected partition up and running, gather the output of: showboards -ev showplatform -v for each SB and RB: dumpregs //sb1 dumpregs //sb3 etc... dumpregs //rp0 dumpregs //rp1 etc... for each IB: dumpregs //ib6/ar0 dumpregs //ib6/sdc0 dumpregs //ib6/dx0 dumpregs //ib6/dx1 dumpregs //ib8/ar0 dumpregs //ib8/sdc0 dumpregs //ib8/dx0 dumpregs //ib8/dx1 etc... nvci showdate history Note that just using dumpregs //ibX will cause the Schizo ASIC on that IB to be scanned, and that will cause a domain failure. JTAG scan on a Schizo in an active domain remains a problem in Serengeti. Now, start a trace of the Safari ASIC programming sequence. The messages produced will only appear on the involved tty/platform/domain shell; that is why we have scripted the output of all of the shells and the tty port. showdate print ConsoleComm.setDebugLevel(1) showdate (it is important to issue a showdate every once and a while so we can sort out all of the output later on) Once the trace has started, issue a reboot in one of the affected domains in order to cause the error. Keep detailed notes about when the reboot was issues, on which domain, etc. When the failure is occurs, re-issue the dumpregs to get a picture of the ASICs in the failed state. (This is why we turned off the domain error recovery earlier): showdate showboards -ev showplatform -v for each SB and RB: dumpregs //sb1 dumpregs //sb3 etc... dumpregs //rp0 dumpregs //rp1 etc... for each IB: dumpregs //ib6/ar0 dumpregs //ib6/sdc0 dumpregs //ib6/dx0 dumpregs //ib6/dx1 dumpregs //ib8/ar0 dumpregs //ib8/sdc0 dumpregs //ib8/dx0 dumpregs //ib8/dx1 etc... nvci showdate history Now, reboot the SC to correct the problem. This will also turn off tracing and take you out of engineering mode. reboot (from the SC platform shell) Bring the failed domain back up and attempt to recreate the problem. Once both domains are up and stable, collect a final set of dumpregs data: showdate showboards -ev showplatform -v for each SB and RB: dumpregs //sb1 dumpregs //sb3 etc... dumpregs //rp0 dumpregs //rp1 etc... for each IB: dumpregs //ib6/ar0 dumpregs //ib6/sdc0 dumpregs //ib6/dx0 dumpregs //ib6/dx1 dumpregs //ib8/ar0 dumpregs //ib8/sdc0 dumpregs //ib8/dx0 dumpregs //ib8/dx1 etc... nvci showdate history Re-establish the customer's original domain recovery settings: for each domain: setupdomain -p boot diag-level reboot-on-error hang-policy OBP.error-reset-recovery showdomain (to confirm settings) Again, it is very important to script all of the ScApp shells and take notes throughout the data collection so that the persons analyzing the data can follow the sequence of events. Internal Contributor/submitter Hal.Mounce@sun.com Internal Eng Business Unit Group SSG ES (Enterprise Systems) Internal Eng Responsible Engineer Hal.Mounce@sun.com Internal Services Knowledge Engineer david.mariotto@sun.com Internal Escalation ID 1-9647455, 1-10144987, 1-9568590, 1-10502561 Internal Resolution Patches 114526-09, 114527-04 Internal Sun Alert Kasp Legacy ID 101819 Internal Sun Alert & FAB Admin Info Critical Category: Availability ==> Diagnosis Significant Change Date: 2005-07-27, 2006-02-13 Avoidance: Patch, Workaround Responsible Manager: Peter.Gonscherowski@sun.com Original Admin Info: [WF 05-Dec-2006, dave m: patch added, republish] [WF 13-Feb-2006, Dave M: patch added, BugID removed per Mgmt and Tom Favara, rerelease] [WF 27-Jul-2005, Dave M; correction to Synopsis] [WF 27-Jul-2005, Dave M; corrections made per submitter; send for release] {WF 26-Jul-2005, Dave M; sent for 24hr review] [WF 25-Jul-2005, Dave M; corrected draft copy sent by submitter] [WF 19-Jul-2005, Dave M; draft created] Product_uuid 29d05214-0a18-11d6-92b2-a111614865b5|Sun Fire 3800 Server 29d3a694-0a18-11d6-92da-df959df44cdd|Sun Fire 4800 Server 29d6f808-0a18-11d6-8aa8-943929fbbdd8|Sun Fire 4810 Server 29da7938-0a18-11d6-8a41-9ed1ad6d6779|Sun Fire 6800 Server 4fe39727-0599-11d8-84cb-080020a9ed93|Sun Fire E6900 Server bed24aa9-0598-11d8-84cb-080020a9ed93|Sun Fire E4900 Server ReferencesSUNPATCH:114526-09SUNPATCH:114527-04 Attachments This solution has no attachment |
||||||||||||
|