Sun Storage 7410 recovery procedure for mismatched network device names.

Asset ID:	1-73-1022238.1
Update Date:	2010-06-09
Keywords:

Solution Type FAB (standard) Sure

Solution 1022238.1 : Sun Storage 7410 recovery procedure for mismatched network device names.

Impact

If network interfaces in the two head nodes of a Sun Storage 7410 cluster have mismatched device names (between the heads), cascading problems can occur. Relocating the cards to identical slot locations after their initial discovery may not resolve the device name mismatch. If this condition is left unresolved, it will lead to cascading failure modes within the cluster software. These failure modes can include loss of network configuration ability (CR 6811589) and peer node reboots upon failback (due to the inability to import/export resources related to the mismatched network device names).

Contributing Factors

The network device name mismatch condition will occur if the following incorrect sequence is followed:

1. Power down both Head_A and Head_B.
2. Install a NIC in slot(i) on Head_A where slot(i) is a valid slot for the NIC.
   Example: 10Gb NIC installed into PCIe slot 3.
3. Boot Head_A (the newly installed devices get assigned names, e.g., the 10Gb NIC
   is assigned device names nxge0 and nxge1).
4. Power down Head_A (imagine the administrator changes his mind on the slot
   location for the card).
5. Move the NIC from slot(i) to slot(j) where j is a different (but also valid)
   slot for the NIC. The names assigned to the NIC devices on Head_A will now
   change, e.g., they may now be nxge4 and nxge5.
6. Install the same NIC card model into slot(j) of Head_B.
7. Boot both heads.
8. At this point the device names of the network devices do not match between
   the two heads. Example: The newly discovered 10Gb NIC in Head_B is assigned
   device names nxge0 and nxge1, while the relocated card in Head_A retains the
   names nxge4 and nxge5.

The correct procedure for installing NIC cards into 7410 cluster nodes is this:

1. Power down both Head_A and Head_B.
2. Install NIC in a valid slot (must be the same slot on each head).
3. Boot both Head_A and Head_B (order isn't important).
4. Confirm new network devices show up on each head and that the device names
   match, e.g., nxge0 and nxge1 appear on both heads after installing a 10Gb
   NIC card in PCIe slot 2.

Symptoms

Cluster nodes have different device names for the same corresponding NIC ports across heads. Example: There is a single 2-port 10Gb NIC installed in PCIe slot2 on each head. In one head, the devices are named nxge0 and nxge1 while on the other head, they have the names nxge4 and nxge5. This happens with the above incorrect sequence of installing cards/booting/and re-locating cards. If the failure mode is allowed to persist, cascading symptoms in the cluster software will present themselves. These include BUI exceptions in the network configuration screen which block the ability to make network changes (CR 6811589), and automatic reboot of either OWNER or STRIPPED head during Failback (due to cluster's inability to import/export resources).

Root Cause

The cluster software has the requirement that hardware be identical between both head nodes. This requirement extends to the device names which are assigned at discovery time. An incorrect installation sequence leads to mismatched device names across heads which the cluster software cannot (currently) tolerate.

Corrective Action

Workaround:

Remove the NIC cards that contain the mismatched device names. The cluster can operate in this state (without the offending cards) until a suitable window for the factory reset recovery sequence can be scheduled.

Resolution:

A factory reset plus manual modification of the /etc/devices/path_to_inst file on each head is required to recover from this failure mode. The cluster will lose all configuration (as with any factory reset), but the storage pools (projects, shares, snapshots, etc) can be preserved. The sequence is as follows:

Note: You must have a serial console connection to each head to proceed.

1. Unconfigure storage for all pools (pools can be imported after the factory reset).
2. Issue the 'factoryreset' command *simultaneously* from the maintenance system
   context in the CLI on both heads.
3. Allow both heads to reboot and come up to the "Press any key to begin configuration"
   stage.
4. Press a key on each console to start aktty setup phase (this differs from normal
   cluster install procedure but is required for the manual modification of the
   path_to_inst file in the next step).
5. Hit Esc+9 in both console windows, and enter 'bash' at the aktty# prompt to
   start a shell.
6. On each head, remove any lines from the file /etc/devices/path_to_inst that
   contain the device type of the mismatched devices, e.g., if the mismatched
   devices are nxge, then delete any lines with the string 'nxge' in them; if the
   mismatched devices are e1000g, delete any lines with the string 'e1000g' in them,
   then confirm that there are no more entries matching the offending device type.
   For example:

      bash# cd /etc/devices
      bash# cp path_to_inst path_to_inst.old
      bash# grep -v -w e1000g path_to_inst.old > path_to_inst
      bash# grep -w e1000g /etc/devices/path_to_inst
      bash#

7. Reboot each head using 'reboot'. For Example:

     bash# reboot

8. The heads should come back to the "Press any key to begin configuration" step.
   Begin normal cluster installation at this point, i.e., begin aktty configuration
   on one head while leaving the other at the "Press any key..." step. If pools
   were previously configured with projects and shares, they can be imported during
   the configure storage step.

References:

BugID: 6811589
Escalation ID: 70997414

For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL:

http://tns.central/fab

For Sun Authorized Service Providers go to:

http://tns.central/SPE/fab/

In addition to the above you may email:

FAB-Manager@sun.com

Internal Contributor/submitter
Chris.Nelson@Sun.COM

Internal Eng Responsible Engineer
Will.Harper@Sun.COM Responsible Manager: Renee.Bennett@Sun.COM

Internal Services Knowledge Engineer
Joe.Davis@Sun.COM

Internal Eng Business Unit Group
NWS (Storage)

Internal Sun Alert & FAB Admin Info

Attachments

This solution has no attachment