I was called to check some strange behavior of two-node cluster and to see why the one of the nodes crashed unexpectedly.Â The two nodes were HP Integrity servers installed with HP-UX 11.31 Base OE (March 2009). Well the node did not crashed it was just restarted from the ServiceGuard with safety timer expire for some reason. System log was not up to date because /var directory was full at some point and the syslog stopped writing. Console log showed standard messages INIT occurred and safety timer expire. Analyzing the crashdumps revealed that communication with cmcld was not possible and thats why the server was rebooted probably because /var directory was full.
Anyway few days later customer called again and said that the node was restarted again, I expected to see the same reason but this time the reboot reason was "Reboot after panic: Fault when executing in kernel mode". The problem was not in the cluster this time and the reboot reason was talking about some problems in the the kernel.
What is crash anyway ? From HP documentation:
An abnormal system reboot is called a crash. There are many reasons that can cause a system to crash; hardware malfunctions, software panics or even power failures. The crash even type panic refers to crashes initiated by the HP-UX operating system (software crash event). There are two types of panics: direct and indirect. A direct panic refers to a subsystem calling directly the panic() kernel routine upon detection of an unrecoverable inconsistency. An indirect panic refers to a crash event as a result of trap interruption which could not be handled by the operating system for example when the kernel accesses a non-valid address.
I analyzed the crash dumps,Â reviewed all the advisories and release notes and was unable to figure out what is the cause of the crash. Finally Level 2 of the support of HPÂ confirmed that this is known issue with the ONCPlus bundle. ONC stands for Open Network Computing (priviously called NFS bundle in 11.23) and it consists of the following components: Network File System, AutoFS, CacheFS, and Network Information Service. We were told to implement workaround until the fix is released next month. The workaround was to add -o readdir to the mount options of the NFS share in the fstab. Well it was obvious that the problem is with the NFS component of the ONCPlus bundle.
Few days later (not month) the new product (with fixed bugs) appeared online. It can be seen from the release notes the following defect fix:
Directory related operations on NFS client with ONCplus B.11.31.06 or B.11.31.07 installed and with file system mounted with read/write size greater than 8192 bytes, may result in system panic or data corruption.
Yes, the ONCPlus bundle was 11.31.06 and we had mounted NFS share with read/write size of 32768 bytes. Both workaround and the patch seemed to fix the problem and the crash never apeared again. Keep in mind that the installation of the new ONCPlus bundle needs restart and applying the workaround does not, BUT from the support adviced us to reboot the server just to make sure that the corruption is not loaded in the memory. So if you hit this bug consider applying the new bundle.
The latest ONCPlus bundle can be downloaded from HP Software Depot portal.
Just for reference the following stack trace is dumped on the consle when the server crashes:
bad_kern_reference: 0xffff31.0x2c20486f6d65634f, fault = 0x8
Message buffer contents after system crash:
panic: Fault when executing in kernel mode Stack Trace: IP Function Name 0xe000000001f887e0Â bad_kern_reference+0xa0 0xe00000000076a3d0Â $cold_vfault+0x3b0 0xe000000000c45a10Â vm_hndlr+0x510 0xe000000001bd9780Â bubbledown+0x0 0xe000000000d00da1Â vx_iupdat_cluster+0xa1 0xe000000000d14830Â vx_async_iupdat+0x160 0xe000000000d4a530Â vx_iupdat_local+0x2c0 0xe000000000d8c020Â vx_iupdat+0xb0 0xe000000002134ed0Â vx_iflush_list+0x4d0 0xe000000000afa8c0Â vx_iflush+0x1d0 0xe000000000cf2710Â vx_worklist_thread+0x200 0xe000000000e65d70Â kthread_daemon_startup+0x90