Many open files on HP-UX after RAC upgrade to 10.2.0.4 - racgimon file handle leak
Two months after patching a customer database to 10.2.0.4 I've received a call, telling me that the database is hanging. Usually this happens when they missed the backup of the archive logs and the database stops. This time there was enough space available and this was not the problem. I logged to the first node and start looking around, weird things were happening, some commands were failing and other were hanging. Then I realized that this is not an ordinary case and start looking deeper. It turns out that this is a bug of Oracle with HP-UX and there is a patch and work around too.
The customer was having HP-UX 11.23 (September 2006) with patch bundles from September 2008. The database was Oracle RAC Enterprise Edition 10.2.0.2.
This problem had very big impact on the database because although the database is running in RAC the database was not accessible and there were a lot of locks. Rebooting the node or killing the processes do the job
After some reading it figure out that this happens only on HP-UX, after patching the database to 10.2.0.4 and it happens only on the first node.
Here are some symptoms:
Executing sar -v show the current-size and maximum size of the system file table:
12:00:00 N/A N/A 328/4200 0 1374/286108 0 41906/65536 0
12:02:00 N/A N/A 330/4200 0 1376/286108 0 41944/65536 0
12:04:00 N/A N/A 336/4200 0 1390/286108 0 41999/65536 0
12:06:00 N/A N/A 331/4200 0 1377/286108 0 41983/65536 0
12:08:00 N/A N/A 330/4200 0 1376/286108 0 41976/65536 0
12:10:00 N/A N/A 330/4200 0 1377/286108 0 41935/65536 0
With lsof the following open files are seen:
racgimon 3506 oracle 14u REG 64,0x9 1552 29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon 3506 oracle 28u REG 64,0x9 1552 29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon 3506 oracle 30u REG 64,0x9 1552 29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon 3506 oracle 37u REG 64,0x9 1552 29678 /oracle/ora10g/dbs/hc_baandb1.dat
The processes which is holding the open files:
oracle 3506 1 0 Nov 5 ? 18:16 /oracle/ora10g/bin/racgimon startd baandb
At this log "$ORACLE_HOME/log/{NodeName}/racg/imon_{InstanceName}.log" every minute can be seen the following error:
2009-12-02 12:12:35.454: [ RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13
2009-12-02 12:13:35.474: [ RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13
When the file table gets full weird things start to happen, in the syslog the following can be seen:
Nov 5 08:00:02 db1 vmunix: file: table is full
Nov 5 08:00:03 db1 vmunix: file: table...
Nov 5 08:00:03 db1 vmunix: file...
Nov 5 08:00:03 db1 vmunix: file...
Nov 5 08:01:13 db1 vmunix: file: table is full
Nov 5 08:11:15 db1 above message repeats 34260 times
Also in the alertlog file the following can be seen:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27544: Failed to map memory region for export
ORA-27300: OS system dependent operation:socket failed with status: 23
ORA-27301: OS failure message: File table overflow
ORA-27302: failure occurred at: sskgxpcre1
Solution:
Base bug is 6931689 (SS10204-HP-PARISC64-080216.080324 HEALTH CHECK FAILED TO CONNECT TO INSTANCE), but it's not public. It's fixed in CRS 10.2.0.4 Bundle Patch #2, but the actual CRS bundle is PSU2 with Patch# 8705958: TRACKING BUG FOR 10.2.0.4.2 PSU FOR CRS which is around 41Mb big.
This patch# 8705958 should be applied to all Oracle homes although the bug is in the database CRS should always be a higher version.
To apply this patch OPatch version must be at least 10.2.0.4.7, which can be downloaded with patch# 6880880. At the moment of writing this the latest version was 10.2.0.4.9 and its 34Mb. To install it, simply download it and unzip it under ORACLE_HOME.
I didn't went with the patch because I read some scary stuff at OTN and thanks to Ivan Kartik I integrated a dirty work around. He proposed very good script which is checking if opened files are more than 20000 just to kill the racgimon process:
http://ivan.kartik.sk/index.php?show_article=42
13:56:00 N/A N/A 307/4200 0 1352/286108 0 44102/65536 0
13:58:00 N/A N/A 307/4200 0 1353/286108 0 44119/65536 0
14:00:01 N/A N/A 309/4200 0 1355/286108 0 44135/65536 0
14:02:01 N/A N/A 307/4200 0 1353/286108 0 44153/65536 0
14:04:01 N/A N/A 301/4200 0 1336/286108 0 2583/65536 0
14:06:01 N/A N/A 306/4200 0 1347/286108 0 2610/65536 0
14:08:01 N/A N/A 299/4200 0 1333/286108 0 2583/65536 0
14:10:01 N/A N/A 300/4200 0 1335/286108 0 2571/65536 0
The work around fixed the problem. This article was written half an year ago and reading MOS now they say that this bug is fixed in 10.2.0.5 which was released at the beginning of June.
Regards,
Sve