Twas the night of new years, and all through the house, not a creature was stirring, not even a mouse. The little one had passed out, and we’d put her to bed. We had all celebrated with Carson, Dick Clark and the rest. Mom in her kerchief and I in my cap, had just settled in for a long winter’s nap. When all of a sudden, I awoke to a clatter, it must be my text paging, I wonder what is the matter? I spring from my bed and stumble to the Mac, oh man, my VMware at work has gone all to crap.
That’s how my 2009 started… about 13 hours later, I finally left work and resumed my long-interrupted nap. We had what seems to have been a storage meltdown behind our VMware farm yesterday. Our file sharing cluster was also affected and so our few employees who were working on New Years Day, well, weren’t working at all. The short version of the story goes like this. Our scheduled backup process, using EMC Networker, kicked off VCB backups on the ESX 3.5 hosts around 1:30 am. By 2:00 am, the process was trying to create snapshots on VMs and this caused some sort of meltdown due to SCSI reservations (found the SCSI reservation problem after VMware analysis). Turns out the HP Insight Agents loaded on our VMware hosts were causing these SCSI reservation issues. The agents were checking the disks at a consistent interval and we had not upgraded the agents to the latest revision, which was supported with ESX 3.5 – so not VMware’s fault – they have a great KB article about this issue (see KB 1005009). As an immediate resolution, one of my co-workers removed the HP agents from our hosts and worked our way through rebooting the entire farm, one host at a time to remove the SCSI reservations. I cross my fingers on VCB backups working when they kick off in an hour. Had this been the only issue, we would have been fine.
Unfortunately, at around 4:30 a.m., while I was unaware, our cluster began experiencing troubles, too. And this is where our detective skills have come up short. We have been sleuthing to find the cause of some weirdness in both our file sharing and Exchange clusters for several weeks, now. The file share cluster is, dumbly enough, critical in our environment. Without it, our users home directories are inaccessible and, since these home directories are defined in Active Directory, it seemingly hoses up our employee’s workstations. Things that should otherwise be speedy, say opening a program — any program — or browsing to your local hard drive, become unbearably slow. Even running applications sometimes lock up as it attempts to access some unknown part of Windows during normal operation. It brings our entire business operation to crawl and that’s unacceptable. (BTW, if all this sounds familiar, please leave comments or send me an email with suggestions.)
So, what actually happened to our Windows file sharing cluster? We have an issue where we see the network utilization on the file share cluster drop to nothing, but the cluster nodes still respond to ping and other non-storage related network services – but what we found out later in the process – not to anything which needed IO to respond. After repeated network sniffs, we were seeing that traffic would come to the cluster, it would be acknowledge, but the node would not start sending data. The break between the request and data could be as long as 20 or 30 seconds. And that was consistent with our ‘outage’ periods. So, I decided to fail over the cluster shares from one node to the node that had been ‘solving’ the problems in the past. When attempting to fail the share, they locked and never became accessible again. After waiting for almost an hour, I rebooted a node trying to clear up the locks and let the other node take control. That was never possible either. A reboot of the second node only served to cause it to stall during boot, and never provide me a login screen. Rebooting the other node, same result. And then, a lengthy phone call with HP support after driving into the office.
The short version of this is that we are running Windows Server 2003 with SP2. Apparently, therein lies our problems with a) clustering and b) storport.sys. The StorPort driver issues are pretty well documented and it in combination with several other hotfixes are HP’s recommendation to us. The hotfixes were released outside of Microsoft’s normal patch schedule to the large number of customers having issues similiar to this. HP’s recommendation to us was to install the list of suggested hotfixes for post-SP2 (Microsoft KB 935640). My co-workers successfully completed that on the file share cluster this evening without incident. (Hallelujia!) I got the call shortly around 10:15 pm with the all clear.
The file sharing problem has been plauging us for several weeks and we have not been able to deduct the full cause. We have had theories and as soon as we seem to figure it out or know how it will act, we’re proven wrong. At least until yesterday, we hope. The next week will tell for sure.
I also mentioned issues on our Exchange cluster. We’re not actually sure its having issues. It may only be showing issues at the same time the file share cluster was having issues. We believe the above issues on file shares are causing lock ups on client machines and their programs, so we are currently thinking Exchange’s perceived network issues are just the fact that our employee’s Outlook is locked and can’t connect, so we see what looks like Exchange issues. But, then again, we’re not 100% sure. We still have some detective work to do here. Where’s Sherlock Holmes when you need him?
As for VMware, we still have a punch list for a few additional things to do – putting newer HP agents on the ESX hosts for one. There are some things we may want to customize here based on feedback from VMware. We may want to disable the storage monitoring agent on these hosts, but more research is required. Removing these agents all together for now is a our preferred fix. So, for the next hour, I just need to keep myself awake and hope that VCB backups will go well tonight.