Document Audience:INTERNAL
Document ID:I0958-1
Title:Replacement of 900MHz System Boards by 1200MHz boards in Sun Fire 12K/15K platforms may fail if the proper installation procedure is not followed.
Copyright Notice:Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:2003-05-02

---------------------------------------------------------
            - Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                        FIELD INFORMATION NOTICE
               (For Authorized Distribution by SunService)
FIN #: I0958-1
Synopsis: Replacement of 900MHz System Boards by 1200MHz boards in Sun Fire 12K/15K platforms may fail if the proper installation procedure is not followed.
Create Date: May/02/03
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun Fire 12K/15K
Product Category: Server / Service
Product Affected: 
Systems Affected:
-----------------  
Mkt_ID     Platform     Model      Description          Serial Number
------     --------     -----      -----------          -------------
  -          F12K        ALL       Sun Fire 12000             -
  -          F15K        ALL       Sun Fire 15000             -


X-Options Affected:
-------------------
Mkt_ID        Platform   Model   Description                     Serial Number    
------        --------   -----   -----------                     ------------- 
X4006A           -         -     ASSY CPU 2PROC USIIIP 900+MHZ         -
X4007A           -         -     ASSY CPU 4PROC USIIIP 900+MHZ         -
Parts Affected: 
Part Number      Description                       Model
-----------      -----------                       -----
540-5051-05      ASSY CPU 2PROC USIIIP 900+MHZ       -
540-5052-06      ASSY CPU 4PROC USIIIP 900+MHZ       -
References: 
N/A
Issue Description: 
If the physical replacement of a 900MHz System Board by a 1200MHz
System Board in a Sun Fire 12K/15K system is performed too quickly, the
System Monitoring Software (SMS) 'esmd' daemon will not be able to
properly acknowledge the change.  This will result in the 'esmd' daemon
failing the 1200MHz board.  This failure could be interpreted as a
hardware fault, resulting in unnecessary replacement of the 1200MHz
System Board.

This issue can occur with any Sun Fire 12K/15K system where a 900MHz
System Board is being upgraded to a 1200MHz System Board.

Upon removal of an existing 900MHz System Board, 'esmd' will
acknowledge and log the event in /var/adm/platform/messages.  Upon
insertion of the 1200MHz System Board, 'esmd' will acknowledge and log
the event. The timeframe required by 'esmd' to acknowledge each
individual event is thirty (30) seconds. 

For example:
  
   . 900MHz System Board is removed, and the event is logged:
  
       esmd[7167]: [0 4824421445907014 NOTICE Boards.cc 1646] CPU at 
                   SB16 removed
  
   . 1200MHz System Board is inserted, and the event is logged:
  
       esmd[7167]: [0 4824886762342552 NOTICE Cabinet.cc 860] CPU at 
                   SB16 inserted
  
If the 900MHz System Board is removed, and the 1200MHz board is
inserted in less then 30 seconds, the two events will not be
acknowleged, nor logged, by 'esmd'.  After 'poweron' of the new System
Board, 'esmd' will report the following errors:
  
   esmd[23597]: [1919 4873257798654128 ERR DetectorV.cc 448] A low 
   voltage or power supply has been detected on Core0, located on CPU 
   at SB2.  The voltage detected is 1.36v; should be 1.53v to 1.70v. 
   PROCPAIR at SB2/PP0 is being removed from the domain and powered 
   off.  Check all hardware for the cause.

   esmd[23597]: [1919 4873258029915842 ERR DetectorV.cc 448] A low
   voltage or power supply has been detected on Core1, located on CPU
   at SB2.  The voltage detected is 1.37v; should be 1.53v to 1.70v.
   PROCPAIR at SB2/PP0 is being removed from the domain and powered
   off.  Check all hardware for the cause.

   esmd[23597]: [1919 4873258399829778 ERR DetectorV.cc 448] A low
   voltage or power supply has been detected on Core2, located on CPU
   at SB2.  The voltage detected is 1.37v; should be 1.53v to 1.70v.
   PROCPAIR at SB2/PP1 is being removed from the domain and powered
   off.  Check all hardware for the cause.

   esmd[23597]: [1919 4873258489747583 ERR DetectorV.cc 448] A low
   voltage or power supply has been detected on Core3, located on CPU
   at SB2.  The voltage detected is 1.36v; should be 1.53v to 1.70v.
   PROCPAIR at SB2/PP1 is being removed from the domain and powered
   off.  Check all hardware for the cause.

   esmd[23597]: [0 4873258534792163 NOTICE SysControl.cc 3358] 
                Component PROCPAIR at SB2/PP0 has been blacklisted
   esmd[23597]: [0 4873258549903380 NOTICE SysControl.cc 3358] 
                Component PROCPAIR at SB2/PP1 has been blacklisted
   esmd[23597]: [1930 4873258557519333 NOTICE SysControl.cc 4162] 
                PROCPAIR at SB2/PP0 has been powered off: ecode=0


POST will confirm that the components have been blacklisted by ASR:
-------------------------------------------------------------------
Reading system ASR blacklist file 
/etc/opt/SUNWSMS/config/asr/blacklist ...
portpair     2.0.0      # ESMD Low-Minumum Voltage 0321.0216.42
portpair     2.0.1      # ESMD Low-Minumum Voltage 0321.0216.42
slot         2.0        # ESMD Sensor Read Failure 0321.0230.57
-------------------------------------------------------------------
 
The Environmental Status Monitoring Daemon (esmd) maintains a look-up
table that is populated with the Vcore values of the resident System
Boards. These values are:

    ----------------------------------------------------
   | CH   |  US III    750MHz, 900MHz         |  1.7100 |
   |------+-----------------------------------+---------|
   | CH+  |  US III Cu 900MHz  1050Mhz        |  1.6150 |
   |------+-----------------------------------+---------|
   | CH++ |  US III Cu 1050Mhz 1200MHz        |  1.3775 |
    ----------------------------------------------------  

When a 900MHz System Board is removed without 'esmd' acknowledgment,
followed by insertion of a 1200MHz System Board which is also not
logged, the Vcore value in the look-up table is retained at 1.6150.
This will result in the newly installed System Board failing with "The
voltage detected is 1.37v; should be 1.53v to 1.70v."

The failure scenario can be avoided by allowing 'esmd' 30 seconds to
acknowledge each event, the System Board removal, and the new System
Board insertion.  These events can be verified in
/var/adm/platform/messages:
 
   esmd[7167]: [0 4824421445907014 NOTICE Boards.cc 1646] CPU at 
               SB16 removed
   esmd[7167]: [0 4824421445907014 NOTICE Boards.cc 1646] CPU at 
               SB16 inserted
Implementation: 
---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
Corrective Action: 
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem. 

Use the following example procedure to correctly replace a 900MHz 
System Board with a 1200MHz System Board:
                     
   1.  >From the SMS command line interface enter the command:

       poweroff sb16
      
   2.  SB16 is now powered off and ready for removal.

   3.  Physically remove the System Board from the platform.  'esmd' 
       will acknowledge and log this event in /var/adm/platform/messages.
       It takes 'esmd' 30 seconds to recognize and log this event.  
       The line below will be logged:
    
          esmd[7167]: [0 4824421445907014 NOTICE Boards.cc 1646] CPU at 
                      SB16 removed
    
   4. Physically install the 1200MHz System Board. 'esmd' will acknowledge 
      and log this event in /var/adm/platform/messages.  It takes 'esmd' 
      30 seconds to recognize and log this event.  The entry below will be 
      logged:
   
         esmd[7167]: [0 4824886762342552 NOTICE Cabinet.cc 860] CPU at 
                     SB16 inserted
   
   5. Board replacement is complete.

If the failure scenario described above does occur, do not assume the
newly inserted 1200MHz System Board is faulty.  Instead, remove the
System Board, verify that esmd has logged the removal event, then
insert the System Board and verify that esmd has logged the insertion
event.  Finally, utilize the 'enablecomponent' command to remove the
components from the ASR blacklist.
Comments: 
None.

============================================================================
Implementation Footnote: 
i)   In case of MANDATORY FINs, Sun Services will attempt to contact   
     all affected customers to recommend implementation of the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Sun Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Sun Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://spe.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to finfco-manager@Sun.COM
--------------------------------------------------------------------------
Statusactive