Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1285535.1 : Sun4v CMT Systems May Experience Storms of Events and May Stop Logging Error Telemetry for Errored Events
In this Document
Applies to:Sun SPARC Enterprise T5440 Server - Version: Not ApplicableSun Netra T6340 Server Module - Version: Not Applicable and later [Release: N/A and later] Sun Netra T5440 Server - Version: Not Applicable and later [Release: N/A and later] Sun Blade T6320 Server Module - Version: Not Applicable and later [Release: N/A and later] Sun SPARC Enterprise T5120 Server - Version: Not Applicable and later [Release: N/A and later] DescriptionSun4v CMT systems may experience the following issue when handling error events: error telemetry may stop being processed/logged by the Service Processor to the host upon processing a stream of error events. Diagnosis and FRU isolation are impacted, along with Solaris ability to perform operations such as page retire. Likelihood of OccurrenceThis issue can occur on the following platforms: Multi Socket CPU CMT Systems:
Notes: 1. No other Blade, Enterprise, or Netra systems are affected by this issue. 2. There is no specific set of conditions likely to trigger this issue, nor any method of predicting when or how frequently this issue may occur. The risk of seeing this issue is regarded as low, but the potential impact is high since this issue may occur without notice. To determine the firmware version on the system, run the following commands from the ILOM: -> show HOST Possible SymptomsWhen fault management data is being dropped, diagnosis and FRU isolation are impacted, along with Solaris ability to perform operations such as page retire, as and when the faults occur. The primary issue is the delivery of events, or more importantly the lack of events being logged to either ILOM or Solaris logs. The primary problem occurs when FMD running on ILOM core dumps, and can result in event reports not being processed. Given the nature of the fault it is the absence of events when otherwise expected that will highlight the issue. The secondary issue occurs when communication between FMD on ILOM cannot pass events to FMD on Solaris, resulting in a backlog of reports on the SP which can consume resources leading to a potential loss of ILOM service. An additional potential side effect is Solaris will not take action against an underlying event due to lack of visibility, for example Memory Page Retirement. In the event of this secondary issue causing ILOM to exhaust resources customer may see the following ILOM event: Out of Memory: executing rebooting thread..... wait 600 secs for the userlevel to complete shutdown Workaround or ResolutionTo resolve this issue, upgrade system firmware to 7.3.0.c (or above), using the appropriate patch listed below: Multi Socket CPU CMT Systems:
Note: Although the likelihood of experiencing this issue is low, upgrading to firmware 7.3.0.c (or later) is recommended as soon as possible when your schedule allows. Modification HistoryDate of Resolved Release: 20-Jan-2011@ Internal Comments: 6981373 ILOM: fmd spawning lots of processes 6983799: L2 bank not stored correctly for DSC/DSU scrub errors CRs 6983799 and 6981373 do not necessarily impact availability. CR 6983799 addresses an issue whereby there is no storm protection for DSC/DSU events (H/W DRAM scrubber events) such that the SP may be bombarded with events from the HOST relating to these, and overwhelm the SP with events triggering other issues in the SP S/W stack causing events to be dropped (See CR 6724341 as an example). CR 6724341 is being worked upon and code is in review and is planned to be addressed in the near future. CR 6983799 can be detected by looking at the contents of an ILOM snapshot and reviewing the 'fmdump' files for DSC events. Depending on Solaris patches and platform type it may be possible to review DSC events on the HOST also using the command 'fmdump' CR 6981373 addresses an issue whereby under certain conditions, fmd on the SP may coredump. CR 6981373 also triggers another CR: 7006461 which can cause fmd on the SP to consume ereports and not proceed to log them to the HOST or to the SP, thus impacting severely field diagnosis (CR 7006461 is still outstanding). CR 6981373 can only be detected by looking for a core file in /coredump on the SP which can be determined by looking at the contents of an ILOM snapshot in the field. CR 7006461 cannot be detected at all. For more indepth detail on these issues, please review the CRs referenced above. Internal Contributor/Submitter: Chuck.Forgues@oracle.com, Justin.hatch@oracle.com Internal Eng Responsible Engineer: Matt.Finch@oracle.com Internal Services Knowledge Analyst: david.mariotto@oracle.com Internal Eng Business Unit Group: Systems Group-SVS (SPARC Volume Systems, Horizontal Systems(includes T2000/Ontario) Internal Escalation ID: 2-8213384 References<SUNPATCH:145673-02><SUNPATCH:145674-02> <SUNPATCH:145675-02> <SUNPATCH:145676-02> <SUNPATCH:145677-02> <SUNPATCH:145678-02> <SUNPATCH:145679-02> <SUNPATCH:145680-02> <SUNBUG:6981373> <SUNBUG:6983799> Attachments This solution has no attachment |
||||||||||||
|