Path failures on ESX4 with HP storage

Since we began upgrading our clusters to ESX4, we have been having strange “failed physical path” messages in our vmkernel logs.  I don’t normally post unless I know the solution to a problem, but in this case, I’ll make an exception.  Our deployment has been delayed and plauged by the storage issues that I mentioned in an earlier post.  Even though we have fixed our major problems, the following type errors have persisted.

Our errors look like this:

vmkernel: 19:18:05:07.991 cpu6:4284)NMP: nmp_CompleteCommandForPath: Command 0x2a (0×410005101140) to NMP device “naa.6001438005de88b70000a00002250000″ failed on physical path “vmhba0:C0:T0:L12″ H:0×2 D:0×0 P:0×0 Possible sense data: 0×0 0×0 0×0.
vmkernel: 19:18:05:07.991 cpu6:4284)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device “naa.6001438005de88b70000a00002250000″ state in doubt; requested fast path state update…

After several cases with VMware and HP technical support, we are no closer to resolving the issues.  VMware support, for its part, has done a good job of telling us what ESX is reacting to and seeing.  HP support, on the other hand, has been circling around the problem but has made little progress in diagnosing the issue.  We have had an ongoing case for several months and our primary EVA resource at HP has continually examined the EVAperf information and SSSU output that we have sent to HP for analysis.  Those have turned up nothing, and yet the messages continue from VMware.

The errors in the log make sense to me – we are losing a path to a data disk (sometimes even a boot-from-SAN ESX OS disk!) – but why HP cannot see anything in our Brocade switches or within the EVA is beyond me.   Our ESX hosts, whether blade or rack-mounted hardware, are seeing the problems across the board.  The one cluster we waited to upgrade never saw the issues in ESX3.5, but sees them now in ESX4.  And perhaps it is a VMware issue that is just too sensitive in monitoring its storage, but I suspect its something else.   The messages don’t seem to affect operation on the hosts, but it certainly makes investigating problems difficult when trying to determine what is a real problem versus just another failed path message.  Anyone else seeing this?

Tags: , , , , , ,

 

About the Post

4 Responses to “Path failures on ESX4 with HP storage”

  1. Not sure if this is related but it work a shot.

    Equallogic has a knowledge article dedicated to a MPIO issue and recently updated it with a fix. The article lists a fix, but it’s only for ESX 4 not ESXi 4.

    ESX 4, ESXi 4 (including Update1) having issues with dropped “MPIO connections” using native software initiator

    This issue is seen using ESX/ESXi v4.0 and Update 1.

    VMWare as a reference number for this issue, PR484220. If you contact VMware support and give them that number they may be able to provide with more detailed info on the issue.

    UPDATED INFO: On 4/1/2010 VMWare released a patch that addresses this issue. Please see the below link for more details. Contact VMware if there are questions on installation or the issues resolved.

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1019492

    Problem description:

    Dell is prevented by NDA from discussing the exact details of the issue.
    One thing to note, is the MPIO path failures that result from the issue, only occur during low IO periods. An alternative MPIO path configuration can be implemented to avoid the issue entirely.

    In our Guide, (TR1049 Configure vSphere SW iSCSI with PS Series SAN.pdf) the methodology shows connecting all physical GbE vmnic interfaces and the Vmkernel ports into ONE vSwitch. This configuration will result in the occasional path failures during little or no IO times on various paths. Multiple vmnics in the same vswitch on the same subnet triggers the issue. When the document was written, the issue in Vmware was unknown. However, the MPIO path connection does comes back later. With that configuration the errors could still be seen, but the multiple paths would prevent the Datastore from going offline.

    To lessen the likelihood of seeing these errors, do not create one vSwitch with multiple vMkernels and all iSCSI GbE NICs. Instead, create one vSwitch with one Vmkernel port and ONE GbE NIC attached. Jumbo frames can still be enabled on the vswitch and the vmkernel port as described in the technical report. The rest of the configuration steps is the same. You get all the same features. The only downsides are more objects to manage on each ESX node and the additional vSwitches add a small memory overhead to the ESX server. So multiple vswitches can be create in this manner to provide redundancy and increased performance up to the maximum of 8 max paths per LUN. (or 8 vswitches each with a single vmkernel)

    Symptoms of Issue:
    The symptom of which is dropped MPIO patch connections.

    In the array Events, errors will be seen like the example below:

    42673:37511:NTCGEQL01:MgmtExec: 4-Dec-2009 11:18:34.110648:targetAttr.cc:939:INFO:7.2.15:iSCSI session to target ’172.30.200.14:3260, iqn.2001-05.com.equallogic:0-8a0906-954a68004-2650000523b4ab2f-esx-vm
    fs2′ from initiator ’172.30.200.160:60985, iqn.1998-01.com.vmware:NTCGESX2-59b5a6f1′ was closed.
    iSCSI initiator connection failure.
    Reset received on the connection.

    42674:37512:NTCGEQL01:MgmtExec: 4-Dec-2009 11:19:26.740649:targetAttr.cc:736:INFO:7.2.14:iSCSI login to target ’172.30.200.14:3260, iqn.2001-05.com.equallogic:0-8a0906-954a68004-2650000523b4ab2f-esx-vmfs
    2′ from initiator ’172.30.200.160:52609, iqn.1998-01.com.vmware:NTCGESX2-59b5a6f1′ successful, using Jumbo Frame length.

    The VMware initiator resets the connection causing it to be closed. Then a minute later logs back in.

    On the ESX server in the /var/log/vmkiscsid.log errors can be seen such as this:
    009-10-14-14:35:25: iscsid: Nop-out timedout after 10 seconds on connection 7:0 state (3). Dropping session.
    2009-10-14-14:35:27: iscsid: Nop-out timedout after 10 seconds on connection 1:0 state (3). Dropping session.
    2009-10-14-14:35:28: iscsid: Nop-out timedout after 10 seconds on connection 11:0 state (3). Dropping session.
    2009-10-14-14:35:28: iscsid: Nop-out timedout after 10 seconds on connection 2:0 state (3). Dropping session.
    2009-10-14-14:35:28: iscsid: Nop-out timedout after 10 seconds on connection 12:0 state (3). Dropping session.
    2009-10-14-14:35:30: iscsid: Nop-out timedout after 10 seconds on connection 3:0 state (3). Dropping session.
    2009-10-14-14:35:33: iscsid: Nop-out timedout after 10 seconds on connection 4:0 state (3). Dropping session.

    Then the paths will recover:

    2009-10-14-14:37:18: iscsid: connection7:0 is operational after recovery (8 attempts)
    2009-10-14-14:37:18: iscsid: connection5:0 is operational after recovery (8 attempts)

    When there are more than one GbE interface on the same subnet, in the same vSwitch, you can see connection drops. Especially during low IO times.
    However, MPIO is supported by Dell and VMware with ESX v4.x.
    Customers hitting this issue ultimately have to wait for VMware to release the patch for this issue. It’s not limited to iSCSI but iSCSI is where you see the issue very clearly. However, there are two methods we’ve seen work. We do not have information on when a patch will be released.
    The one being recommended by VMware is to have one vSwitch for every VMkernel port as described above. Using this method will lessen the chances it occurs but it still can occur but will have much less impact if it does. The vSwitch must only have ONE GbE interface assigned to it. No stand-by NIC or unused adapters.

    NOTE:
    If the ESX MPIO configuration has been done using the initial technical report (TR1049) GUIDE with the multiple VMNICS per vswitch and multiple vmkernels and it is desired to change to the one now recommended by VMware, the ESX server node will need to be placed into maintenance mode and remove ALL iSCSI related components and reboot. This means all vmkernel ports, vswitch, etc. The iscsi portion will then need to be reconfigured as described above.

    April 9, 2010 at 10:20 pm Reply
    • Philip #

      Brian, thanks for the information… We are looking this over as it may apply in our environment, too.

      April 27, 2010 at 11:47 am Reply
  2. Steve #

    We are also having this problem with ESXi 4.1, Brocade FC switches, and HP storage.

    Any solutions for this?

    October 20, 2011 at 11:10 am Reply

Leave a Reply

%d bloggers like this: