Document Audience:	INTERNAL
Document ID:	I0570-3
Title:	Provide instructions and guidelines to CORRECTLY identify Ecache panic fault diagnosis.
Copyright Notice:	Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:	2000-07-07

---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)

FIN #: I0570-3

Synopsis: Provide instructions and guidelines to CORRECTLY identify Ecache panic fault diagnosis.

Create Date: Jul/08/00

Keywords:

Provide instructions and guidelines to CORRECTLY identify Ecache panic fault diagnosis.

Top FIN/FCO Report: Yes

Products Reference: Ecache panic fault diagnosis

Product Category: Desktop / System CPU Module ; Server / System CPU Module

Product Affected:

Mkt_ID   Platform    Model   Description              Serial Number
------   --------    -----   -----------              -------------
Systems Affected
----------------
  -       A12         ALL    Ultra 1                        -  
  -       A14         ALL    Ultra 2                        -
  -       A23         ALL    Ultra 60                       -
  -       A27         ALL    Ultra 80                       -  
  -       Netra t1    1120   Netra t1 Server                -
  -       Netra t1    1125   Netra t1 Server                -
  -       A26         ALL    Enterprise 250                 -
  -       A25         ALL    Enterprise 450                 -
  -       A34         ALL    Enterprise 220R                -
  -       A33         ALL    Enterprise 420R                -   
  -       E3000       ALL    Ultra Enterprise 3000          -
  -       E3500       ALL    Ultra Enterprise 3500          -
  -       E4000       ALL    Ultra Enterprise 4000          -
  -       E4500       ALL    Ultra Enterprise 4500          -
  -       E5000       ALL    Ultra Enterprise 5000          -
  -       E5500       ALL    Ultra Enterprise 5500          -
  -       E6000       ALL    Ultra Enterprise 6000          -
  -       E6500       ALL    Ultra Enterprise 6500          -
  -       E10000      ALL    Ultra Enterprise 10000         -
  -       E450-HPC    ALL    Ultra Enterprise 450 HPC       -
  -       E3500-HPC   ALL    Ultra Enterprise 3500 HPC      -
  -       E4500-HPC   ALL    Ultra Enterprise 4500 HPC      -
  -       E5500-HPC   ALL    Ultra Enterprise 5500 HPC      -
  -       E6500-HPC   ALL    Ultra Enterprise 6500 HPC      -
  -       E10000-HPC  ALL    Ultra Enterprise 10000 HPC     -
 
Mkt_ID   Platform   Model   Description   Serial Number
------   --------   -----   -----------   -------------
 
X-Options Affected
------------------

X2580A     -         -      400MHz UltraSPARC II Module 8MB cache  -
X2570A     -         -      400MHz UltraSPARC II Module 4MB cache  -
X2560A     -         -      336MHz UltraSPARC II Module 4MB Cache  -
X2550A     -         -      250MHz UltraSPARC II Module 4MB Cache  -
X2530A     -         -      250MHz UltraSPARC II Module 1MB Cache  -
X2510A     -         -      167MHz UltraSPARC II Module 1MB Cache  -
X2500A     -         -      167MHz UltraSPARC II Module .5MB Cache -

Parts Affected:

Part Number   Description   Model
-----------   -----------   -----
 
501-2941-0X   167 MHz UltraSPARC II Module .5MB Cache     -
501-2959-0X   167 MHz UltraSPARC II Module 1MB Cache      -
501-4178-0X   250 MHz UltraSPARC II Module 1MB Cache      -
501-4249-0X   250 MHz UltraSPARC II Module 4MB Cache      -
501-4836-0X   250 MHz UltraSPARC II Module 4MB Cache      -
501-4363-0X   336 MHz UltraSPARC II Module 4MB Cache      -
501-2702-03   167 MHz UltraSPARC Module                   -
501-2942-0X   167 MHz UltraSPARC Module                   -
501-3041-0X   200 MHz UltraSPARC Module                   -
501-4791-0X   200 MHz UltraSPARC Module                   -
501-4278-0X   250 MHz UltraSPARC II Module                -
501-4857-0X   250 MHz UltraSPARC II Module                -
501-4196-0X   300 MHz UltraSPARC II Module                -
501-4849-0X   300 MHz UltraSPARC II Module                -
501-4363-0X   333 MHz UltraSPARC II Module                -
501-4363-0X   336 MHz UltraSPARC II Module                -
501-5129-0X   360 MHz UltraSPARC II Module                -
501-4781-0X   360 MHz UltraSPARC II Module                -
501-5237-0X   400 MHz UltraSPARC II Module                -
501-5541-0X   400 MHz UltraSPARC II Module                -
501-5682-0X   440 MHz UltraSPARC II Module                -
501-5539-0X   450 MHz UltraSPARC II Module                -
501-4995-0X   400 MHz UltraSPARC II Module                -
501-5425-0X   400 MHz UltraSPARC II Module                -
501-5585-0X   400 MHz UltraSPARC II Module                -
501-5235-0X   400 MHz UltraSPARC II Module                -
501-5661-0X   400 MHz UltraSPARC II Module                -
501-5762-0X   400 MHz UltraSPARC II Module                -
501-5344-0X   450 MHz UltraSPARC II Module                -
501-4477-0X   270 MHz UltraSPARC IIi Module               -
501-5039-0X   270 MHz UltraSPARC IIi Module               -
501-4379-0X   300 MHz UltraSPARC IIi Module               -
501-5040-0X   300 MHz UltraSPARC IIi Module               -
501-5090-0X   333 MHz UltraSPARC IIi Module               -
501-5568-0X   333 MHz UltraSPARC IIi Module               -
501-5148-0X   360 MHz UltraSPARC IIi Module               -
501-5222-0X   360 MHz UltraSPARC IIi Module               -
501-5149-0X   440 MHz UltraSPARC IIi Module               -

References:

URL:   http://cte-www.uk/cgi-bin/afsr/afsr.pl
       http://cte-www.eng/cgi-bin/afsr/afsr.pl

Issue Description:

Summary
=======
 
The current diagnostic procedure to identify the faulty CPU module from
CPU Ecache parity errors is flawed, and could easily lead to the wrong
CPU being replaced. This might lead to an FFA resulting in "no trouble
found (NTF)", and the customer continuing to experience Ecache parity
panics.
 
An Ecache parity error in a particular CPU's asynchronous fault status
register (AFSR) does not necessarily indicate that this CPU's Ecache
module is faulty.
 
Detail
======
 
The Spitfire and Blackbird (UltraSPARC I/II) chips have 5-different
modes of Ecache data parity failure:
 
type Solaris panic string
============================
ETP    Ecache Tag Parity Error
WP     [Ecache]Writeback Data Parity Error
EDP    Ecache SRAM Data Parity Error
CP     [Ecache]Copyout Data Parity Error	
UE CP  UE Error: Ecache Copyout on CPUnn
 
Current Field procedures are as follows:
========================================
ETP, WP, EDP, CP	replace panic'ing CPU
UE CP			replace indicated CPU "CPUnn"
 
which is based on current understanding that if a CPU's asynchronous
fault status register (AFSR) indicates that an Ecache parity error has
been detected, then that CPU's Ecache module was the one that directly
suffered the parity error.
 
This understanding turns out to be false: a parity error in one CPU's
AFSR can also indicate that the CPU module has detected an ECC error in
incoming data from the UPA bus. This in turn can indicate that an
Ecache parity error has been detected on another CPU.
 
Since the handling of this uncorrectable ECC error is done at a
relatively low priority (as a level 2 software interrupt), it is
possible for another thread on this CPU to interrupt or preempt this
handling, notice the bad Ecache and panic, giving a false failure
indication.
 
In order to be more precise about where the Ecache error occurred, it
is necessary for the detecting (panic'ing) CPU to query all the other
CPU's, retrieve their AFSRs and decode them. Currently, Solaris does
not do this in every case (this will be corrected in a future version
of Solaris).  In the UE CP case, the other CPU's are queried but their
AFSRs are not always correctly decoded.
 
For ETP, bits <19:16> of the AFSR contain the tag parity syndrome
(ETS), which indicates which groups of the 25-bit Ecache Tag bus
experienced the parity error(s).
 
For all but the ETP type, the bottom 16 bits of the AFSR contain the
data parity syndrome (P_SYND), which indicates which 8-bit groups of
the 128-bit Ecache Data bus experienced the parity error(s).
 
For a given CPU's AFSR, if only a single bit is set in P_SYND or ETS,
this indicates a single-bit parity error:  In this case it is likely
(although not guaranteed), that this CPU's Ecache is the one which
experienced the original parity error.
 
If many adjacent bits in PSYND[15:8] or PSYND[7:0]
are set, it becomes less likely that this CPU's Ecache is at fault.
 
Note, however, that even if a CPU's Ecache experiences a single-bit
parity error, it is quite likely that the error will never recur.
 
A CPU receives data from the system via the UPA bus, which is protected
via ECC; the UDB-UPA data bus is 128 bits wide, and is split into two
halves, each half having its own ECC bits. The data is received onto
the CPU module by the two UltraSPARC Data Buffers (variously known as
UDB, SDB, BDB), each of which deals with one half of the UPA data &
ECC.
 
If one of the UDBs detects an uncorrectable ECC error for this incoming
data, it sends 8 bits of bad parity to the CPU, Ecache SRAM, or both.
This stops the bad data being subsequently used by the system; the
resulting AFSR will indicate an uncorrectable (UE).  (Bad parity is
seen by the CPU, but because the UDB is signaling a UE the CPU does not
set the LDP bit, nor does it store anything in P_SYND.)
 
The uncorrectable ECC error itself may have resulted from a variety of
causes:  One such is related to Ecache parity errors. When a UDB reads
data from the Ecache SRAM (or from the CPU itself) to send out over the
UPA bus and if it detects a parity error, it sends bad ECC over the UPA
bus to the requester.  This mechanism can be thought of as notification
of the bad parity to the other CPUs in the system.
 
NOTE: If a recordstop dump occurred on E10k systems, "redx" can usually
      be used to further narrow down the faulty CPU module to one of two
      possibilities.

Update for FIN I0570-2;
-----------------------
In this -2, the following has been updated to FINI0570-1;

  1) The CORRECTIVE ACTION has been updated as follows:
     . Change the recommendation of "replace the CPU module along with 
       containing system board" to "replace just the CPU module."
     . The number of cases referenced have been reduced from 5 to 2.

  2) Starfire Specific section has been moved from the CORRECTIVE
     ACTION section to the COMMENT section.

  
  3) Adding a word "uncorrectable" in front of ECC error on final 
     2 paragraphs of PROBLEM DESCRIPTION.
 
  4) Added new secondary URL http://cte-www.eng/cgi-bin/afsr/afsr.pl
     which will be developed as a backup link to currently existing
     primary URL http://cte-www.uk/cgi-bin/afsr/afsr.pl and contains
     same AFSR Decoder information.  

Update for FIN I0570-3;
-----------------------
In this -3 revision, the following changes have been made;

  1) The CORRECTIVE ACTION has been updated as follows:
     
     . The FRU replacement guideline in section A has been changed to
       reflect the most current guideline on CPU module.  The original
       recommendation to replace the CPU module along with the
       containing system board is technically correct.  However it
       should be implemented where possible.

Implementation:

---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
          
   
         ---
        |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
          
                                 
         ---
        | X |   REACTIVE (As Required)
         ---

Corrective Action:

Enterprise Customers and authorized Field Service Representatives may
avoid the above mentioned problems by following the recommendations
as shown below:

A] Please note that all of the ACTIONs require the following guidelines
   to be followed;
  
      If this is the first such failure for this CPU, do nothing.
    
      If a second e-cache error occurs on the same CPU module, it indicates
      that the error is no longer transient.  Therefore, if a second error
      occurs, follow the instructions below:
    
      For Sunfire, replace the CPU assembly or, WHERE AVAILABLE, the 
      system board.
    
      For Starfire, blacklist the module, then replace the CPU assembly
      or, WHERE AVAILABLE, the system board when capacity is needed.
    
      PLEASE NOTE:  FULL SYSTEM BOARD LEVEL SUNFIRE AND STARFIRE FRUS
      WILL BE PHASED-IN DURING Q2FY01 (P/NS TBD).
    
      The above guidelines were extracted from the Best Practice website.  
      To see the most current FRU replacement guidelines at anytime, visit 
      the website.
      
      FRU Replacement Guideline: for guidelines as to which FRU(s) to replace, 
      if any, please refer to the Best Practices website:

	   http://bestpractices.central
       
B]  To aid diagnosis of an Ecache-related panic, Enterprise Services engineers
    should refer to the Computer Systems CTE AFSR Decoder, available at either 
    of the following locations:

           http://cte-www.uk/cgi-bin/afsr/afsr.pl
        
           http://cte-www.eng/cgi-bin/afsr/afsr.pl

These webpages will allow an AFSR taken from a Solaris panic string to be
entered, and will give in return a concise summary.  It may ask you to
e-mail the full panic string to CTE: afsr-decode@uk.sun.com for further
investigation.  In any event, it will refer you back to this FIN for
further action.

C] Recommended Action

    1) If only one or two bits are set in P_SYND, and:

	1a) CP is not set:

		- The panicing CPU is likely to be faulty.
		- ACTION: Use above guideline.

	1b) CP is set:

		- The CPU indicated in the panic message is likely to be faulty,
		NOT the panicing CPU.  That is, if the message is
		"Ecache Copyout on CPUx", then CPUx is the faulty CPU.
		- ACTION: Use above guideline.

    2) If three or more bits are set in P_SYND, and:

	2a) UE is set, and P_SYND is 0x00ff or 0xff00:

		- The panicing CPU is almost definitely not faulty.
		- ACTION: Do NOT replace any CPU's.
	
	2b) otherwise:

		- The panicing CPU is probably not faulty.
		- ACTION: Do NOT replace any CPU's.

NOTE: In the latter two cases (2a & 2b), we have no way of telling which
      CPU *is* faulty; Solaris does not record the AFSRs for the other
      CPUs.  In these two cases it is recommended that Field await a
      failure which can be correctly diagnosed before replacing any
      CPU; mass swap-outs may merely aggravate the problem.

Comments:

Starfire-specific
-----------------
 
If a recordstop dump was generated on the Starfire systems,  it can be
used to narrow down the faulty CPU module to one of two possibilities.
 
The "wfail" command in "redx" identifies the XDB that detected incoming
bad ECC from a CPU's UDB, which in turn  identifies a pair of CPUs, one
of which will have suffered the original Ecache parity error.
 
In the example below, CPU31 panic'ed thus;
 
panic[cpu31]/thread=0x71a15ba0: CPU31 Ecache SRAM Data Parity Error:
 
	AFSR 0x00000000 8060ff00 AFAR 0x0000000c 021d0a20
 
However, the "redx" utility shows the incoming UE ECC on XDB 7.0, which
implicates either CPU28 (proc 7.0) or CPU29 (proc 7.1) as having the
original Ecache parity error:
 
	ssp0:dom2% redx -lc
	redxl> dumpf load Edd-Record-Stop-Dump-02.11.08:03
	redxl> wfail
		
	LAARB 7     ErrorCSR3[63:0]: Hist: 0 N 0000    Flgs = 000 00100000
		ErrCSR3[20]: Recordstop Requested by XDB0 (LAARB)
	XDB   7.0   EccErrFlags[11:0] = 308
		EccFlg[3]: Uncorrectable error in  psi bus hi half, bits      
                           [143:72]
		EccFlg[11:8]: Error count = 3
	psi [143:72]= 75 00000000 00400000 (xmux_par[5:0]= 1D)  syn= 03: D 
	FAIL proc 7.0: Arbstop/Recordstop detected by xdb.
	FAIL proc 7.1: Arbstop/Recordstop detected by xdb.
		
 
In particular, the "syn= 03" indicates that the UDB->XDB UE ECC error
was deliberately generated by the CPU's UDB in response to detecting an
Ecache parity error during a copyout operation.  Only a "psi" reported
error can be trusted in this scenario.  In this example, we can say
that either CPU28 or CPU29 had the Ecache fault, based on XDB 7.0's
"psi" reported error.
 
Note of Caution: Conversely, an XDB could report an "ldat" error with a
syndrome of 03, which includes the same data pattern and xmux_par values.
In these cases, the XDB which reports the "ldat" error is the XDB for the
"victim" CPU.  For example, XDB 7.1 for CPU30 (proc7.2) and CPU31 (proc7.3)
reflects the error below:
 
	redxl> shxdb -e 7 1
        XDB	7.1	EccErrFlags[11:0] = B80
        	EccFlg[7]: Uncorrectable error in ldat bus hi half, bits
        		[143:72]
         	EccFlg[11:8]: Error count = B
	ldat[143:72]= 75 00000000 00400000 (xmux_par[5:0]= 1D)  syn= 03: D 
 
In essence, an "ldat" error reported by an XDB will actually prove the CPUs
which it services are victims of another CPU's Cache Parity Error, and
therefore, can be used to exonerate the attached CPUs: CPU30 (proc7.2) and
CPU31 (proc7.3) in this example). 
 
This 2nd XDB report is not usually present in the wfail output, but could
be due to other variables in a Starfire platform.  It is also possible
that a recordstop dump has no XDB-reported "psi" error, but does include
an XDB-reported "ldat" error with the 03 syndrome.  For these rare cases,
the "ldat" error exonerates the attached CPUs, and might be traced back
through the Centerplane X-Bar to the System Board where the data originated
from.  However, it cannot be used to implicate which CPU or CPUs on that
System Board might have actually sourced the data with the 03 syndrome.

Note: If returning the system board for E10000, it is still necessary to 
      complete the Fault Tag and E10K Failure Information Form (reference 
      process in FIN I0375-1).
 
--------------------------------------------------------------------------

Implementation Footnote:

i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
    
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/
 
* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
  
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/
 
* From there, select the appropriate link to browse the FIN or FCO index.
 
Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
   
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@sdpsweb.EBay
---------------------------------------------------------------------------

Status

inactive