Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-77-1204074.1
Update Date:2010-11-01
Keywords:

Solution Type  Sun Alert Sure

Solution  1204074.1 :   Sun Storage S7000 Series "Snapshot Destroy" Activity May Induce Sustained Periods of Extremely Poor Performance  


Related Items
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
  • Sun Storage 7310 Unified Storage System
  •  
Related Categories
  • GCS>Sun Microsystems>Sun Alert>Criteria Category>Availability
  •  
  • GCS>Sun Microsystems>Sun Alert>Release Phase>Resolved
  •  




In this Document
  Description
  Likelihood of Occurrence
  Possible Symptoms
  Workaround or Resolution
  Patches
  Modification History
  References


Applies to:

Sun Hardware > Storage – Disk > Arrays – 7000-Series (NAS)
Sun Storage 7110 Unified Storage System - Version: Not Applicable and later    [Release: NA and later]
Sun Storage 7210 Unified Storage System - Version: Not Applicable and later    [Release: NA and later]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later    [Release: NA and later]
Sun Storage 7410 Unified Storage System - Version: Not Applicable and later    [Release: NA and later]
Sun SPARC Sun OS
x86
___________________________________

Date of Workaround Release: 13-Sep-2010

Date of Resolved Release: 01-Nov-2010
___________________________________



__________________

Description

For Sun Storage appliances S7110/S7210/S7310/S7310C/S7410/S7410C with firmware releases 2009.Q2, 2009.Q3 or 2010.Q1, large amounts of ZFS filesystem activity triggered by "snapshot destroy" can result in severe performance degradation or an apparent hang of the appliance.

Likelihood of Occurrence

This issue can occur on the following platforms:
  • Sun Storage S7110/S7210/S7310/S7310C/S7410/S7410C
running firmware releases 2009.Q2, 2009.Q3 or 2010.Q1.

To determine the version of firmware on these systems, do the following:

From any UNIX client (able to do ssh):

# ssh -l root <appliance IP addr> "script run('configuration version'); print('version: '+get('version'))"
version: 2010.02.09.2.1,1-1.18


Or from the BUI:

Maintenance -> System -> Current Installation

and match with the correct 2009 or 2010 release:
2009.Q2 <= 2009.04.10
2009.Q3 <= 2009.09.01
2010.Q1 <= 2010.02.09
A snapshot destroy operation can be triggered in one of the following ways:

- As a result of regular snapshot expiry at the end of the specified snapshot retention period
- In response to user deletion or alteration of the snapshot policy (e.g. scheduled start time) via the BUI Following snapshot rollback
- As a result of replication, wherein the snapshot which is created prior to start of data replication is then destroyed upon sync completion

The impact of snapshot destroy activity may go undetected if the appliance is able to complete deletion of configured snapshots quickly enough, when measured against client-side I/O timeouts.  However, the extent of filesystem activity triggered by snapshot destroy depends upon the number of data blocks which must be deleted, taken in conjunction with other appliance workloads, which therefore depends upon the following factors:

- The number of projects/shares/luns which have the snapshot feature enabled.
- The number of distinct snapshots configured against each project/share/lun.
- The number of data blocks which have changed in the time between snapshot creation and deletion
- Whether snapshot destroy occurs during a time of high/peak appliance I/O load.
- Whether many/all snapshots have been configured with the same start time (and therefore the same deletion time)
- Where iSCSI LUNs are in use, the issue is exacerbated when using small block sizes (e.g. 512 bytes, 1KB)

Possible Symptoms

Symptoms resulting from this scenario typically include much higher I/O latency seen by attached clients, possibly leading to I/O retries, timeouts and lost connectivity.

These symptoms typically occur at fixed or regular times, which correlate with the snapshot destroy schedule configured on the appliance.

In extreme cases, appliance I/O response may ultimately appear to be hung when viewed from a client perspective.  In addition, the appliance BUI may appear hung during the snapshot destroy process.  Such persistent symptoms will not be cleared by a reboot, although normal performance levels will return once snapshot destroy has completed.

Note: Oracle support will be able to confirm the underlying cause by directly observing the relevant ZFS thread states, using dtrace(1M) from the appliance shell.

Workaround or Resolution

As a temporary workaround for any given project/share/LUN, increasing the snapshot retention policy (measured in days) will delay the point at which snapshot destroy next occurs, providing there is sufficient space available on the appliance.  Following consultation with Oracle support, this may provide additional diagnosis/planning time if this issue is suspected as the root cause.

Impact may be reduced by spreading (staggering) the start times for configured snapshots (so for example they do not all begin at 01:00 or 09:00).

Customers which either already have or which will have a dependency on ZFS snapshot usage are strongly advised to upgrade to firmware release 2010.Q1.1.0 (or later).  This firmware release provides performance benefits to the snapshot destroy process over previous releases, and will reduce (but not altogether remove or resolve) performance impact resulting from large amounts snapshot destroy activity.

Contract Customers who have either recently enabled snapshots, or who have increased the overall degree of snapshot usage on the appliance and are now seeing severe performance degradation, are advised to raise a new Service Request.

This issue is addressed in the following release:
  • Sun Storage 7000 firmware 2010.Q3

Sun Storage 7000 Software Updates are available for download at:
http://wikis.sun.com/display/FishWorks/Software+Updates


Patches

Firmware 2010.Q1.1.0 (already available)
resolves the following contributing issue:

6949730  spurious arc_free() can significantly
exacerbate 6948890

Firmware 2010.Q3 will resolve
the following contributing problems :

6948890  snapshot deletion can induce
pathologically long spa_sync() times

6944388  dsl_dataset_snapshot_reserve_space()
causes dp_write_limit=max

Responsible Engineer: frederic.payet@oracle.com
Community: Sun NAS - Storage-Disk

Please send technical questions to the following email:
sunalert-tech-questions@sun.com
and copy the Responsible Engineer

Modification History

Date of Workaround Release: 13-Sep-2010
Date of Resolved Release: 01-Nov-2010 - updated for firmware release

References

SUNBUG 6948890
SUNBUG 6944388

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback