Archive

Posts Tagged ‘bug’

Many open files on HP-UX after RAC upgrade to 10.2.0.4 – racgimon file handle leak

July 23rd, 2010 No comments

Two months after patching a customer database to 10.2.0.4 I’ve received a call, telling me that the database is hanging. Usually this happens when they missed the backup of the archive logs and the database stops. This time there was enough space available and this was not the problem. I logged to the first node and start looking around, weird things were happening, some commands were failing and other were hanging. Then I realized that this is not an ordinary case and start looking deeper. It turns out that this is a bug of Oracle with HP-UX and there is a patch and work around too.

The customer was having HP-UX 11.23 (September 2006) with patch bundles from September 2008. The database was Oracle RAC Enterprise Edition 10.2.0.2.

This problem had very big impact on the database because although the database is running in RAC the database was not accessible and there were a lot of locks. Rebooting the node or killing the processes do the job

After some reading it figure out that this happens only on HP-UX, after patching the database to 10.2.0.4 and it happens only on the first node.

Here are some symptoms:


Executing sar -v show the current-size and maximum size of the system file table:

12:00:00   N/A   N/A 328/4200  0  1374/286108 0  41906/65536 0
12:02:00   N/A   N/A 330/4200  0  1376/286108 0  41944/65536 0
12:04:00   N/A   N/A 336/4200  0  1390/286108 0  41999/65536 0
12:06:00   N/A   N/A 331/4200  0  1377/286108 0  41983/65536 0
12:08:00   N/A   N/A 330/4200  0  1376/286108 0  41976/65536 0
12:10:00   N/A   N/A 330/4200  0  1377/286108 0  41935/65536 0


With lsof the following open files are seen:

racgimon   3506 oracle   14u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   28u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   30u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   37u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat


The processes which is holding the open files:

 oracle  3506     1  0  Nov  5  ?        18:16 /oracle/ora10g/bin/racgimon startd baandb


At this log “$ORACLE_HOME/log/<NodeName>/racg/imon_<InstanceName>.log” every minute can be seen the following error:

2009-12-02 12:12:35.454: [    RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13
2009-12-02 12:13:35.474: [    RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13


When the file table gets full weird things start to happen,  in the syslog the following can be seen:

Nov  5 08:00:02 db1 vmunix: file: table is full
Nov  5 08:00:03 db1 vmunix: file: table...
Nov  5 08:00:03 db1 vmunix: file...
Nov  5 08:00:03 db1 vmunix: file...
Nov  5 08:01:13 db1 vmunix: file: table is full
Nov  5 08:11:15 db1  above message repeats 34260 times


Also in the alertlog file the following can be seen:

ORA-00603: ORACLE server session terminated by fatal error
ORA-27544: Failed to map memory region for export
ORA-27300: OS system dependent operation:socket failed with status: 23
ORA-27301: OS failure message: File table overflow
ORA-27302: failure occurred at: sskgxpcre1


Solution:
Base bug is 6931689 (SS10204-HP-PARISC64-080216.080324 HEALTH CHECK FAILED TO CONNECT TO INSTANCE), but it’s not public. It’s fixed in CRS 10.2.0.4 Bundle Patch #2, but the actual CRS bundle is PSU2 with Patch# 8705958: TRACKING BUG FOR 10.2.0.4.2 PSU FOR CRS which is around 41Mb big.
This patch# 8705958 should be applied to all Oracle homes although the bug is in the database CRS should always be a higher version.

To apply this patch OPatch version must be at least 10.2.0.4.7, which can be downloaded with patch# 6880880. At the moment of writing this the latest version was 10.2.0.4.9 and its 34Mb. To install it, simply download it and unzip it under ORACLE_HOME.

I didn’t went with the patch because I read some scary stuff at OTN and thanks to Ivan Kartik I integrated a dirty work around. He proposed very good script which is checking if opened files are more than 20000 just to kill the racgimon process:

13:56:00   N/A   N/A 307/4200  0  1352/286108 0  44102/65536 0
13:58:00   N/A   N/A 307/4200  0  1353/286108 0  44119/65536 0
14:00:01   N/A   N/A 309/4200  0  1355/286108 0  44135/65536 0
14:02:01   N/A   N/A 307/4200  0  1353/286108 0  44153/65536 0
14:04:01   N/A   N/A 301/4200  0  1336/286108 0  2583/65536 0
14:06:01   N/A   N/A 306/4200  0  1347/286108 0  2610/65536 0
14:08:01   N/A   N/A 299/4200  0  1333/286108 0  2583/65536 0
14:10:01   N/A   N/A 300/4200  0  1335/286108 0  2571/65536 0

The work around fixed the problem. This article was written half an year ago and reading MOS now they say that this bug is fixed in 10.2.0.5 which was released at the beginning of June.

Regards,
Sve

Categories: hp-ux, oracle Tags: ,

Oracle 11g R2 installer fails on HP-UX 11iv3

May 20th, 2010 7 comments

Running the installer of any of the products (client, grid, database) of Oracle Database 11g Release 2 on HP-UX 11iv3 (Itanium) fails with:
“An internal error occurred within cluster verification framework”

After starting ./runInstaller the following error window pops-up:
runInstaller error

Also at the installAction$DATE.log the following error can be seen:

SEVERE: [FATAL] An internal error occurred within cluster verification framework
Unable to get the current group.

This happens, because patch PHCO_40381 is not installed. There is a list of patches to be installed at 2.3.4 Patch Requirement of the Database Installation guide for HP-UX.

The first one is:
PHCO_40381 11.31 Disk Owner Patch

The patch is available from ITRC. It’s 205Kb big and it fixes behavior of the command diskowner. The installation of the patch does not require reboot of the server.

After the installation of the patch, runInstaller starts succesfully.

There is also MOS Doc ID regarding this problem:
HP-UX: 11gR2 runInstaller Fails with “An internal error occurred within cluster verification framework” [ID 983713.1]

Regards,
Sve

Categories: hp-ux, oracle Tags: , ,

Constant cimprovagt daemon crashing and filling the /var directory

November 23rd, 2009 No comments

We installed two nodes with HP-UX 11.31 March 2009 BOE in a ServiceGuard environment and started test applications in two packets.

Suddenly the /var directories on both nodes started to grow and respectively the cluster was crashing because of that and the syslog was never up to date. It turns out that some of the components (cimprovagt) of the OnlineDiagnostics were crashing. I reviewed few advisories and bugs about it, but none of them were having the same behaviour.

Executing file on the core dump file shows the following:
core:      ELF-64 core file – IA64 from ‘cimprovagt’ – received SIGABRT

HP analyzed the core dump files and determined that the problem is already known and the fix is already implemented in September release of DASProvider, which is now part of the DiagProdCollection bundle and can be found  here:
http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=DiagProdCollection

After installing the bundle the daemon stopped crashing and the system is stable now.

Categories: hp-ux Tags:

HP-UX software bug hidden in cluster behaviour

October 8th, 2009 1 comment

I was called to check some strange behavior of two-node cluster and to see why the one of the nodes crashed unexpectedly.  The two nodes were HP Integrity servers installed with HP-UX 11.31 Base OE (March 2009). Well the node did not crashed it was just restarted from the ServiceGuard with safety timer expire for some reason. System log was not up to date because /var directory was full at some point and the syslog stopped writing. Console log showed standard messages INIT occurred and safety timer expire. Analyzing the crashdumps revealed that communication with cmcld was not possible and thats why the server was rebooted probably because /var directory was full.

Anyway few days later customer called again and said that the node was restarted again,  I expected to see the same reason but this time the reboot reason was “Reboot after panic: Fault when executing in kernel mode”.  The problem was not in the cluster this time and the reboot reason was talking about some problems in the the kernel.

What is crash anyway ? From HP documentation:
An abnormal system reboot is called a crash. There are many reasons that can cause a system to crash; hardware malfunctions, software panics or even power failures. The crash even type panic refers to crashes initiated by the HP-UX operating system (software crash event). There are two types of panics: direct and indirect. A direct panic refers to a subsystem calling directly the panic() kernel routine upon detection of an unrecoverable inconsistency. An indirect panic refers to a crash event as a result of trap interruption which could not be handled by the operating system for example when the kernel accesses a non-valid address.

I analyzed the crash dumps,  reviewed all the advisories and release notes and was unable to figure out what is the cause of the crash. Finally Level 2 of the support of HP  confirmed that this is known issue with the ONCPlus bundle. ONC stands for Open Network Computing (priviously called NFS bundle in 11.23) and it consists of the following components: Network File System, AutoFS, CacheFS, and Network Information Service. We were told to implement workaround until the fix is released next month. The workaround was to add -o readdir to the mount options of the NFS share in the fstab. Well it was obvious that the problem is with the NFS component of the ONCPlus bundle.

Few days later (not month) the new product (with fixed bugs) appeared online. It can be seen from the release notes the following defect fix:
Directory related operations on NFS client with ONCplus B.11.31.06 or B.11.31.07 installed and with file system mounted with read/write size greater than 8192 bytes, may result in system panic or data corruption.

Yes, the ONCPlus bundle was 11.31.06 and we had mounted NFS share with read/write size of 32768 bytes. Both workaround and the patch seemed to fix the problem and the crash never apeared again. Keep in mind that the installation of the new ONCPlus bundle needs restart and applying the workaround does not, BUT from the support adviced us to reboot the server just to make sure that the corruption is not loaded in the memory. So if you hit this bug consider applying the new bundle.

The latest ONCPlus bundle can be downloaded from there:
https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=ONCplus

Just for reference the following stack trace is dumped on the consle when the server crashes:

bad_kern_reference: 0xffff31.0x2c20486f6d65634f, fault = 0x8

Message buffer contents after system crash:

panic: Fault when executing in kernel mode
Stack Trace:
IP                  Function Name
0xe000000001f887e0  bad_kern_reference+0xa0
0xe00000000076a3d0  $cold_vfault+0x3b0
0xe000000000c45a10  vm_hndlr+0x510
0xe000000001bd9780  bubbledown+0x0
0xe000000000d00da1  vx_iupdat_cluster+0xa1
0xe000000000d14830  vx_async_iupdat+0x160
0xe000000000d4a530  vx_iupdat_local+0x2c0
0xe000000000d8c020  vx_iupdat+0xb0
0xe000000002134ed0  vx_iflush_list+0x4d0
0xe000000000afa8c0  vx_iflush+0x1d0
0xe000000000cf2710  vx_worklist_thread+0x200
0xe000000000e65d70  kthread_daemon_startup+0x90

Regards,
sve

Categories: hp-ux Tags: ,

Many racgmain(check) processes at HP-UX 11iv3

August 17th, 2009 9 comments

I was called that some commands for controlling the cluster and the oracle are not working. This was two node cluster installed with Oracle 10.2.0.4 RAC on HP-UX 11.31 Data Center OE (December 2008) working for a month already.

Arriving at the customer site I noticed that there are a lot (around 500) of hanging racgmain(check) processes which obviously were blocking some of the cluster commands. Errors also can be seen at this log: $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:

2009-04-08 15:22:01.700: [  CRSEVT][90801] CAAMonitorHandler :: 0:Action Script /oracle/ora10g/bin/racgwrap(check) timed
out for ora.ORCL.ORCL1.inst! (timeout=600)
2009-04-08 15:22:01.700: [  CRSAPP][90801] CheckResource error for ora.ORCL.ORCL1.inst error code = -2
2009-04-08 15:25:42.180: [  CRSEVT][90811] CAAMonitorHandler :: 0:Could not join /oracle/ora10g/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

There are a lot of bugs at metalink, but no documents or suggestions how to fix that.

Fortunately we found a solution:

1. Stop CRS on all nodes.

2. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on all nodes

3. Edit the file racgwrap and modify the last 3 lines from:

$ORACLE_HOME/bin/racgmain “$@”
status=$?
exit $status

to:

exec $ORACLE_HOME/bin/racgmain “$@”

4. Restart CRS and make sure that all the resources are starts.

We were lucky that hit the bug just before the migration and restarting the instances/servers was easy enough. I don’t know if this really solves the problem, but we never hit the bug again.

Categories: hp-ux, oracle Tags: ,