Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1239993.1
Update Date:2010-10-21
Keywords:

Solution Type  FAB (standard) Sure

Solution  1239993.1 :   SPARC Enterprise T5440 with at least one memory module having 12 DIMMs may experience intermittent silent HOST Hardware hangs at the OS level.  


Related Items
  • Sun SPARC Enterprise T5440 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  




In this Document
  Symptoms
  Changes
  Cause
  Solution


Oracle Confidential (PARTNER). Do not distribute to customers
Reason: FABs available to Internals and Partners only

Applies to:

Sun SPARC Enterprise T5440 Server - Version: Not Applicable to Not Applicable - Release: NA to NA
Information in this document applies to any platform.
__________

SUNBUG 6856773
SUNBUG 6844624
SUNBUG 6925700
__________

Affected Parts:

541-2551-04 - Memory Module, T5440
541-3791-02 - Memory Module, T5440, 800MHz
541-3908-03 - Service Processor Assembly (SP+)
541-2751-09 - Service Processor Assembly (SP)

Symptoms

Host not responding and all applications stop.

- Ping Host (no response)
- Try logging into host (no response)

Service Processor (SP) operations are not affected.

- Ping Service Processor will succeed
- Service Processor login will succeed

There will be no indication of this type failure in the output of Service Processor commands such as;

sc>showpower
sc>showenvironment
sc>showfmerptlog
sc> showlogs
sc>showfaults.

A HOST power cycle is required to get the system back to normal operation.

Impact

Intermittent silent HOST Hardware hangs at the OS level.

These silent hangs are random and will cause the HOST to stop functioning (the Service Processor (SP) will not be affected). There is no data that can be captured from any ilom/host log files to indicate a hardware or software problem.

Changes

Contributing Factors

Only SPARC Enterprise T5440 systems with at least one memory module having 12 DIMMs are impacted by this issue.

Below are the two known configs to date that have experienced this failure mode.

- SEVPBJF1Z (602-4158-0x): 2 x 1.2GHz, 8 core, 64GB (16 x 2GB) 667 MHz
  FB-DIMMs, 2 x 146GB SAS 2.5" HDD, 4 x 1120W PSU, Slim SATA DVD RW,
  2 Memory Expansion Boards plus 16x SESY2C1Z to get to a fully loaded
  256GB.

- SEVPGSF1U (602-4183-0x): 4 x 1.2GHz, 8 Core, 128GB (32 x 4GB) 667 MHz
  FB-DIMMS, 2 x 146GB SAS 2.5" HDD, 4 x 1120W PSU, Slim SATA DVD RW,
  4 Memory Expansion Boards plus 16x SESY2C1Z to get to a fully loaded
  256GB.

Cause

Root Cause

A small number of memory modules as listed in the Affected Parts section above have been seen to generate an intermittent memory module (Power Ok) POK fault, which could lead to the HOST system silently hanging.

The root cause of the issue has been isolated to the DC-DC Converters (DC208) on the memory module intermittently reporting false POK glitches, which can in turn lead to the system being reset.

Engineering has improved the reporting and resilience in dealing with the false POK glitches by making changes to the FPGA code (4.1.7.4) and SysFW (7.2.9.a) firmware that resides on the systems SP module.

New revisions of the two SP modules F541-3908-04 (SP+) and F541-2751-10 (SP) containing the new firmware are now available, although the SysFW for module F541-2751-10 will need to be upgraded manually once this SP is installed.

Engineering has also updated the memory module DC-DC converters. New versions of the memory module are F541-2551-05 and F541-3791-03.

Solution

Workaround

No workaround available - see Resolution section below.

Resolution

If the HOST is experiencing a hang as described, then either Service Processor (part number F541-3908-04 (SP+) or part number F541-2751-10 (SP)) modules are required to perform fault isolation. The latter Service Processor (part number F541-2751-10) requires manual update to FW 7.2.9.a.

With F541-3908-04 (SP+) or F541-2751-10 (SP) with FW 7.2.9.a installed, should a system hang continue to be observed, then in addition to normal troubleshooting you should pay close attention to the output of the 'showlogs' command on the SP, ie;

   sc>showlogs
   Mar 22 15:35:25: Chassis |major : "Host has been powered on"
   Mar 22 15:38:59: Chassis |major : "Host is running"
   Mar 23 15:13:32: Chassis |minor : "POK Glitch: /SYS/MB/MEM3"

If the system hang occurred at the same time as the above messages were reported then memory module #3 should be replaced.

In this particular example the memory module should be replaced with either F541-2551-05 or F541-3791-03 modules - which ever is required for the specific customer configuration.

If you have any questions regarding implementing of this FAB for a customer under an existing Service Request, please open a CollaborationTask for GL-VSP.

Comments

This issue was fully evaluated and determined not to meet FCO criteria due to the extremely low failure rate that has been experienced to date.

References


   BugID:6856773, 6844624, 6925700
   Escalation ID: IBIS SR 71048600
   Resolution Patches: SysFw 7.2.9.a 139446-11
   Reference Manual: SPARC Enterprise T5440 Server Service Manual 820-3801-11
   ECO:42858, 42844, 42934
   GSAP: 5252, 5253, 5268, 5282
   Related URL(s):

    https://support.us.oracle.com/handbook_internal/Devices/Memory/MEM_SE_T5440_Memory_Module.html

    https://support.us.oracle.com/handbook_internal/Systems/SE_T5440/components.html#SystemServiceProcessor


For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL:

* http://tns.central/fab

In addition to the above you may email:

* FAB-Manager@sun.com


@Contacts

@Contributor: fernando.bonaventura@oracle.com, donald.palko@oracle.com, joe.carr@oracle.com, dencho.kojucharov@oracle.com, matt.finch@oracle.com

Responsible Engineer: bruce.alford@oracle.com
@ Responsible Manager: steve.doherty@oracle.com
Business Unit Group: Systems Group-SVS (SPARC Volume Systems, Horizontal Systems,(includes T2000/Ontario)


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback