Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Sun Alert Sure Solution 1204074.1 : Sun Storage S7000 Series "Snapshot Destroy" Activity May Induce Sustained Periods of Extremely Poor Performance
In this Document
Applies to:Sun Hardware > Storage â Disk > Arrays â 7000-Series (NAS)Sun Storage 7110 Unified Storage System - Version: Not Applicable and later [Release: NA and later] Sun Storage 7210 Unified Storage System - Version: Not Applicable and later [Release: NA and later] Sun Storage 7310 Unified Storage System - Version: Not Applicable and later [Release: NA and later] Sun Storage 7410 Unified Storage System - Version: Not Applicable and later [Release: NA and later] Sun SPARC Sun OS x86 ___________________________________ Date of Workaround Release: 13-Sep-2010 Date of Resolved Release: 01-Nov-2010 ___________________________________ __________________ DescriptionFor Sun Storage appliances S7110/S7210/S7310/S7310C/S7410/S7410C with firmware releases 2009.Q2, 2009.Q3 or 2010.Q1, large amounts of ZFS filesystem activity triggered by "snapshot destroy" can result in severe performance degradation or an apparent hang of the appliance.Likelihood of OccurrenceThis issue can occur on the following platforms:
To determine the version of firmware on these systems, do the following: From any UNIX client (able to do ssh): # ssh -l root <appliance IP addr> "script run('configuration version'); print('version: '+get('version'))" Or from the BUI: Maintenance -> System -> Current Installation and match with the correct 2009 or 2010 release: 2009.Q2 <= 2009.04.10A snapshot destroy operation can be triggered in one of the following ways: - As a result of regular snapshot expiry at the end of the specified snapshot retention period - In response to user deletion or alteration of the snapshot policy (e.g. scheduled start time) via the BUI Following snapshot rollback - As a result of replication, wherein the snapshot which is created prior to start of data replication is then destroyed upon sync completion The impact of snapshot destroy activity may go undetected if the appliance is able to complete deletion of configured snapshots quickly enough, when measured against client-side I/O timeouts. However, the extent of filesystem activity triggered by snapshot destroy depends upon the number of data blocks which must be deleted, taken in conjunction with other appliance workloads, which therefore depends upon the following factors: - The number of projects/shares/luns which have the snapshot feature enabled. - The number of distinct snapshots configured against each project/share/lun. - The number of data blocks which have changed in the time between snapshot creation and deletion - Whether snapshot destroy occurs during a time of high/peak appliance I/O load. - Whether many/all snapshots have been configured with the same start time (and therefore the same deletion time) - Where iSCSI LUNs are in use, the issue is exacerbated when using small block sizes (e.g. 512 bytes, 1KB) Possible SymptomsSymptoms resulting from this scenario typically include much higher I/O latency seen by attached clients, possibly leading to I/O retries, timeouts and lost connectivity.These symptoms typically occur at fixed or regular times, which correlate with the snapshot destroy schedule configured on the appliance. In extreme cases, appliance I/O response may ultimately appear to be hung when viewed from a client perspective. In addition, the appliance BUI may appear hung during the snapshot destroy process. Such persistent symptoms will not be cleared by a reboot, although normal performance levels will return once snapshot destroy has completed. Note: Oracle support will be able to confirm the underlying cause by directly observing the relevant ZFS thread states, using dtrace(1M) from the appliance shell.
Workaround or ResolutionAs a temporary workaround for any given project/share/LUN, increasing the snapshot retention policy (measured in days) will delay the point at which snapshot destroy next occurs, providing there is sufficient space available on the appliance. Following consultation with Oracle support, this may provide additional diagnosis/planning time if this issue is suspected as the root cause.Impact may be reduced by spreading (staggering) the start times for configured snapshots (so for example they do not all begin at 01:00 or 09:00). Customers which either already have or which will have a dependency on ZFS snapshot usage are strongly advised to upgrade to firmware release 2010.Q1.1.0 (or later). This firmware release provides performance benefits to the snapshot destroy process over previous releases, and will reduce (but not altogether remove or resolve) performance impact resulting from large amounts snapshot destroy activity. Contract Customers who have either recently enabled snapshots, or who have increased the overall degree of snapshot usage on the appliance and are now seeing severe performance degradation, are advised to raise a new Service Request. This issue is addressed in the following release:
Sun Storage 7000 Software Updates are available for download at: http://wikis.sun.com/display/FishWorks/Software+Updates PatchesFirmware 2010.Q1.1.0 (already available)resolves the following contributing issue: 6949730 spurious arc_free() can significantly exacerbate 6948890 Firmware 2010.Q3 will resolve the following contributing problems : 6948890 snapshot deletion can induce pathologically long spa_sync() times 6944388 dsl_dataset_snapshot_reserve_space() causes dp_write_limit=max Responsible Engineer: frederic.payet@oracle.com Community: Sun NAS - Storage-Disk Please send technical questions to the following email: sunalert-tech-questions@sun.com and copy the Responsible Engineer Modification HistoryDate of Workaround Release: 13-Sep-2010Date of Resolved Release: 01-Nov-2010 - updated for firmware release ReferencesSUNBUG 6948890SUNBUG 6944388 Attachments This solution has no attachment |
||||||||||||
|