Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1001165.1
Update Date:2010-09-01
Keywords:

Solution Type  FAB (standard) Sure

Solution  1001165.1 :   System Hangs After MCE (Machine Check Exception) Correctable Memory Errors With 1GB Micron DIMMs In Slots 1 And 2.  


Related Items
  • Sun Ultra 20 Workstation
  •  
  • Sun Fire X2100 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  

PreviouslyPublishedAs
201559


Product
Sun Fire X2100 Server
Sun Ultra 20 Workstation

Bug Id
<SUNBUG: 6408744>

Part
  • Part No: 370-7944-01
  • Part Description: 2x1GB Micron DIMMs MT18VDDT12872AY-40BD1 DIMM Micron
Part
  • Part No: 540-6465-01
  • Part Description: ASSY,2GB DDR400 ECC: 2 X 1GB
Xoption
  • Xoption Number: X8006A
  • Xoption Description: OPT MEMORY 2 GB (2x1GB)

Impact

When marginal 1GB Micron DIMMs have been placed in DIMM slots 1 and 2, correctable ECC memory errors can take place. There may also be system hangs seen after the memory errors in extreme cases. This is caused by DIMMs manufactured to the edge of tolerance levels being placed in slots 1 and 2, which are more susceptible to signal integrity issues due to bus layout and length.

Some examples of errors that can be seen when marginal 1GB Micron DIMMs are placed in slots 1 and 2:

Mar 27 13:36:04 testsystem sshd(pam_unix)[16554]: session opened for user root by (uid=0)
Mar 27 15:20:14 testsystem kernel: CPU 0: Silent Northbridge MCE
Mar 27 15:20:14 testsystem kernel: Northbridge status 946ac001:00000813
Mar 27 15:20:14 testsystem kernel:     Error ecc error
Mar 27 15:20:14 testsystem kernel:     bus error local node origin, request didn't time out
Mar 27 15:20:14 testsystem kernel:     generic read
Mar 27 15:20:14 testsystem kernel:     memory access, level generic
Mar 27 15:20:14 testsystem kernel:     link number 0
Mar 27 15:20:14 testsystem kernel:     err cpu1
Mar 27 15:20:14 testsystem kernel:     corrected ecc error
Mar 27 15:20:14 testsystem kernel:     previous error lost
Mar 27 15:20:14 testsystem kernel:     NB error address 000000004ed1a430
Mar 27 15:31:09 testsystem automount[19280]: lookup(ldap): got answer, but no first entry for
(&(objectclass=nisObject)(cn=budny))

This second one shows the hang and reboot:

Mar 28 03:33:42 testsystem kernel: CPU 0: Silent Northbridge MCE
Mar 28 03:33:42 testsystem kernel: Northbridge status 946ac002:00000813
Mar 28 03:33:42 testsystem kernel:     Error ecc error
Mar 28 03:33:42 testsystem kernel:     bus error local node origin, request didn't time out
Mar 28 03:33:42 testsystem kernel:     generic read
Mar 28 03:33:42 testsystem kernel:     memory access, level generic
Mar 28 03:33:42 testsystem kernel:     link number 0
Mar 28 03:33:42 testsystem kernel:     err cpu0
Mar 28 03:33:42 testsystem kernel:     corrected ecc error
Mar 28 03:33:42 testsystem kernel:     previous error lost
Mar 28 03:33:42 testsystem kernel:     NB error address 000000004ed0c630
Mar 28 08:19:52 testsystem syslogd 1.4.1: restart.
Mar 28 08:19:52 testsystem syslog: syslogd startup succeeded

Tests that can be run to verify the issue are PcCheck. Here is an example below of errors that will be seen with PcCheck:

Failed Microtopology test(uTL)Dimm slot A0 "Last failure
00000000:0AF5CC70 Coupled bits detected, read 0008000AH"

To run the PcCheck diagnostics follow the steps below:

  1. Boot the system with the Supplemental CD (either from a optical drive or via pxe)
  2. At the main menu select "Run Hardware Diagnostics"
  3. At the PcCheck main menu select "Advanced Diagnostic Tests"
  4. At the Advanced Diagnostic Tests menu select "Memory"
  5. Then select "Test System Memory"

Root Cause

Resolution

When 1GB Micron DIMMs are installed in slots 1 and 2 and are exhibiting correctable ECC memory errors, before replacing any DIMMs move them to slots 3 and 4 and retest. If errors continue in slots 3 and 4 then assume you have a failing DIMM and replace the pair using normal DIMM replacement procedures and policies. If the errors go away after moving the DIMMs to slots 3 &and 4 then leave the DIMMs in slots 3 and 4 and do not replace the DIMMs, as these are likely marginal DIMMs.

Note 1: The DIMMs are not defective they are just built at the edge of certain tolerance levels so when placed on the far end of the memory bus (slots 1 and 2) they can exhibit errors. When removed from slots 1 and 2 and placed in other slots, the DIMMs will function properly.

Note 2:  If customers wish to later expand memory configurations by adding another pair of memory DIMMs to slots 1 and 2, no further issues will be experienced even if marginal DIMMs are placed in slots 1 and 2. By adding another pair of DIMMs, the signal of the whole memory bus is changed enough that even with marginal DIMMs no further errors will take place regardless of which slots they occupy.


Previously Published As
102448
Internal Comments


None.


Internal Contributor/submitter
mick.tabor@sun.com

Internal Eng Business Unit Group
KE Authors

Internal Eng Responsible Engineer
mick.tabor@sun.com

Internal Services Knowledge Engineer
sean.hassall@sun.com

Internal Kasp FAB Legacy ID
102448

Internal Sun Alert & FAB Admin Info
Critical Category:
Significant Change Date:
Avoidance: Service Procedure
Responsible Manager: null
Original Admin Info: null

Product_uuid
28c0502a-fd60-11d9-a8ca-080020a9ed93|Sun Fire X2100 Server
372415be-961d-11d9-9adf-080020a9ed93|Sun Ultra 20 Workstation

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback