bug - Svetoslav Gyurov technical blog

Grid Infrastructure 12c installation fails because of 255 in the subnet ID

Svetoslav Gyurov — Thu, 25 Aug 2016 09:15:18 GMT

I was doing another GI 12.1.0.2 cluster installation last month when I got really weird error.

While root.sh was running on the first node I got the following error:

2016/07/01 15:02:10 CLSRSC-343: Successfully started Oracle Clusterware stack
2016/07/01 15:02:23 CLSRSC-180: An error occurred while executing the command '/ocw/grid/bin/oifcfg setif -global eth0/10.118.144.0:public eth1/10.118.255.0:cluster_interconnect' (error code 1)
2016/07/01 15:02:24 CLSRSC-287: FirstNode configuration failed
Died at /ocw/grid/crs/install/crsinstall.pm line 2398.

I was surprised to find the following error in the rootcrs log file:

2016-07-01 15:02:22: Executing cmd: /ocw/grid/bin/oifcfg setif -global eth0/10.118.144.0:public eth1/10.118.255.0:cluster_interconnect
2016-07-01 15:02:23: Command output:
 > PRIF-15: invalid format for subnet
>End Command output

Quick MOS search suggested that my installation failed because I had 255 in the subnet ID:
root.sh fails with CLSRSC-287 due to: PRIF-15: invalid format for subnet (Doc ID 1933472.1)

Indeed we had 255 in the private network (10.118.255.0). Fortunately this was in our private network which was easy to change but you will still hit this issue if you public network has 255 subnet ID.

Not able to update Web service process in APEX 4.1

Svetoslav Gyurov — Wed, 21 Mar 2012 14:10:31 GMT

Last month I created a simple APEX application with enabled mobile support and latest version of jQuery, which integrates with HP Service Manager through web services. The purpose was to give option for company engineers to open and update incidents through mobile in few easy steps.

The first step was to create form and report by using the web service. At this point web service request process is created where authentication and input parameter are described. The problem appears if you try to change any parameter and update the web service process. This is the error:

Error error updating web service parameters
ORA-01403: no data found

To fix this, one option is to apply patch 12934733 on top of APEX 4.1. The other option is to apply latest patch set for APEX version 4.1.1, patch number 13331096.

At the time I got the error patch set wasn't released yet and I went with the patch only to fix this issue. Later I've decided to update the APEX to latest version 4.1.1 and I'll review the update process at glance.

To upgrade to APEX 4.1.1 make sure first to review the release notes here. The process is really simple and takes few minutes.

Before applying the patch make sure to prevent access to the APEX. In my case I'm using Oracle Database 11g Express Edition and I'm using Embedded PL/SQL gateway. Then apply the patch using apxpatch.sql and update the images directory. Because I'm using Express Edition, my images are stored in the XML DB repository and script apxldimg.sql has to be used to upload the new images within the repository.

Disabling Oracle XML DB HTTP Server:

QL> SELECT DBMS_XDB.GETHTTPPORT FROM DUAL;

GETHTTPPORT
-----------
 0

SQL> EXEC DBMS_XDB.SETHTTPPORT(0);

PL/SQL procedure successfully completed.

SQL> COMMIT;

Commit complete.

SQL> SELECT DBMS_XDB.GETHTTPPORT FROM DUAL;

GETHTTPPORT
-----------
 0

Run apxpatch.sql to patch the system:

SQL> @apxpatch.sql

.......

timing for: Complete Patch
Elapsed: 00:06:25.48

Updating the Images Directory When Running the Embedded PL/SQL Gateway:

@apxldimg.sql /tmp/patch

.......

Commit complete.

timing for: Load Images
Elapsed: 00:04:12.56

Directory dropped.

Enabling Oracle XML DB HTTP Server:

SQL> EXEC DBMS_XDB.SETHTTPPORT(8080);

PL/SQL procedure successfully completed.

SQL> COMMIT;

Commit complete.

APEX is now updated to version 4.1.1

Regards,
Sve

Database 11.2 bug causes huge number of alert log entries

Svetoslav Gyurov — Thu, 22 Dec 2011 11:25:15 GMT

Few days ago I received a call from customer about problem with their EM console and messages about file system full. They run DB 11.2.0.2 on OEL 5.7 and had only binaries installation at that file system and the database itself was using ASM. I quickly logged on to find out the file system was really full and after looking around I figure out that all the free space was eaten by alert and trace diagnostic directories. The trace directory was full of 10MB files and the alertlog file was quick growing with following messages:

WARNING: failed to read mirror side 1 of virtual extent 2917 logical extent 0 of file 271 in group [1.2242406296] from disk DATA_0000 allocation unit 24394 reason error; if possible,will try another mirror side
Errors in file /oracle/app/oracle/diag/rdbms/baandb/baandb/trace/baandb_ora_17785.trc:
WARNING: Read Failed. group:1 disk:0 AU:24394 offset:1007616 size:8192
WARNING: failed to read mirror side 1 of virtual extent 2917 logical extent 0 of file 271 in group [1.2242406296] from disk DATA_0000 allocation unit 24394 reason error; if possible,will try another mirror side
Errors in file /oracle/app/oracle/diag/rdbms/baandb/baandb/trace/baandb_ora_17785.trc:

At first I though there is a storage problem, but looking at the ASM views everything seemed to be all right and these seemed to be false messages. I deleted all the trace files, but then few minutes later the file system became again full. It turned out that generated log per minute were more than 60MBor around 7GB for two hours, because of this huge number of messages the machine was already loaded.

Then after quick MOS search I found that this is a Bug 10422126: FAILED TO READ MORROR SIDE 1 and there is a 70KB patch for 11.2.0.2.

The following MOS notes are also useful:
WARNING: 'Failed To Read Mirror Side 1' continuously reported in the alert log [ID 1289905.1]
Huge number of alert log entries: 'WARNING: IO Failed...' 'WARNING: failed to read mirror side 1 of virtual extent ...' [ID 1274852.1]

After applying the patch everything became normal and no more false messages appeared in the logs. The bug is fixed in 11.2.0.3.

Regards,
Sve

Cannot apply BP10 to Oracle Database 11.2.0.2 on Windows Server 2008 R2

Svetoslav Gyurov — Wed, 09 Nov 2011 14:25:10 GMT

This happened to be when I tryed to apply Bundle Patch 10 of Oracle Database 11.2.0.2 on Windows 2008, but I guess it could happen to any 11.x database version. I decided to apply this patch after I stepped the bug in which the heap memory is exhausted because of an CVU health checks (I described it here).

After running opatch apply I got that the following files are still active:

d:\app\11.2.0\grid\bin\oraclient11.dll
d:\app\11.2.0\grid\bin\orageneric11.dll
d:\app\11.2.0\grid\bin\orapls11.dll
d:\app\11.2.0\grid\bin\oracommon11.dll
d:\app\11.2.0\grid\bin\oci.dll
d:\app\11.2.0\grid\bin\orahasgen11.dll
d:\app\11.2.0\grid\bin\oraocr11.dll
d:\app\11.2.0\grid\bin\oraocrb11.dll
d:\app\11.2.0\grid\bin\oraocrutl11.dll
d:\app\11.2.0\grid\bin\mDNSResponder.exe
d:\app\11.2.0\grid\bin\ocssd.exe
d:\app\11.2.0\grid\bin\cssdagent.exe
d:\app\11.2.0\grid\bin\cssdmonitor.exe
d:\app\11.2.0\grid\bin\evmd.exe
d:\app\11.2.0\grid\bin\evmlogger.exe
d:\app\11.2.0\grid\bin\gipcd.exe
d:\app\11.2.0\grid\bin\gpnpd.exe
d:\app\11.2.0\grid\bin\octssd.exe

It's unlikely to have something running, because I have stopped all GI processes. Again to find out which is the process holding the dll's I've used ProcessExplorer. It seemed that process WmiPrvSE.exe had the dlls open:

Description of WMI:
The wmiprvse.exe file is otherwise known as Windows Management Instrumentation. It is a Microsoft Windows-based component that provides control and information about management in an enterprise environment. Developers use the wmiprvse.exe file in order to develop applications used for monitoring purposes.

For some reason WMI is holding the CRS dlls. Stop the WMI service or kill the process and this should release the lock on the drivers and allow the opatch to proceed.

Regards,
Sve

Exhaust of Windows 2008 heap memory with Oracle Database 11.2.0.2

Svetoslav Gyurov — Thu, 29 Sep 2011 10:51:34 GMT

Recently I had an interesting setup for one of our customers. Because they got Oracle Standard Edition and Windows 2008 Server R2 Standard Edition licenses I was asked to create HA database installation. After looking around I found few docs about installing Standard Edition with Clusterware and I had some ideas. Finally I installed Grid Infrastructure on both servers and Oracle Database binaries. Then created single instance database on the second server and replicated the configuration to the first one. Currently the relocation of the database is done manually, but one could create a start/stop/monitor scripts and integrate these with GI. Once the database starts it's registering at the scan listener so in theory it's running in HA (just the relocation is manual) :)

So during the weekend I received mail from my colleagues above error messages they received from the database: connect error, Socket read timed out. It wasn't a rush as the database is not yet in production, but it's ahead and this was the first task for the Monday. Next day I looked around and everything was up and running, except that I wasn't able to login through the listener and I also wasn't able to stop or relocate it. Looking at the logs I found at some point the following message: TNS-12531: TNS:cannot allocate memory which explains the previous message.

That was weird, the server on which error appeared was the first one and had only GI running and SCAN LISTENER. This really looked like a memory leak, it's a Windows so maybe that was obvious. I decided to look around the processes using the Resource Monitor when I found a lot of many cmd.exe processes. To confirm the problem I used Process Explorer which is a very nice tool for Windows. As could be seen below I've got plenty of cmd processes which were spawned, but not (obviously) closed after completion:

It turned out that this is a bug for 11.2.0.2 and Windows (64 bit). The Oracle CVU resource (ora.cvu), which by default is started on the first node in the cluster (this makes sense now) it's doing checks on every six hours (CHECK_INTERVAL=21600) and leaves process open. Because of this the heap memory is exhausted and that's the reason why the SCAN LISTENER is failing and giving the error message TNS-12531: TNS:cannot allocate memory

The following errors could be seen in Windows Eventlog, once the patch is applied the errors disappeared:

Faulting application lsnrctl.exe, version 11.2.0.2, time stamp 0x4cea8f55, 	faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb73957, exception code 0xc0000142, fault offset 0x00000000000b1b48, process id 0x1eac, application start time 0x01cc6ab588f992c0.

Faulting application cmd.exe, version 6.0.6001.18000, time stamp 0x47918bde, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb733e1, exception code 0xc0000142, fault offset 0x0006f1e7, process id 0x1004, application start time 0x01cc6af0fa982500.

Faulting application sclsspawn.exe, version 0.0.0.0, time stamp 0x4ce622a7, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb73957, exception code 0xc0000142, fault offset 0x00000000000b1b48, process id 0x1ca0, application start time 0x01cc6c0e5efd5380.

This is the bug at MOS:
Bug 12529945: CVU HEALTH CHECKS EXHAUST WINDOWS HEAP MEMORY

The bug should have been fixed in BP8, but I applied the latest one BP10:
Patch 12849789: ORACLE 11G 11.2.0.2 PATCH 10 BUG FOR WINDOWS (64-BIT AMD64 AND INTEL EM64)

Regards,
Sve

Unable to load Audit Vault console after login

Svetoslav Gyurov — Wed, 07 Sep 2011 13:27:55 GMT

Well, this is quick notice in case someone else got into this error. I'm having Audit Vault server, patched up to 10.2.3.2.5 and its repository database to 10.2.0.7. The problem is that I'm able to connect as av_admin into the console, but not as av_auditor. When I try to login as av_auditor I've got redirected to wrong URL, like this one:

*http://192.168.1.100:0/av/console/f?p=7700:100:::::
*
It's obvious that's wrong, port 0 does not exist and I'm getting error Unable to connect in the browser.

Just to make sure whether this is the problem, check to see if the lsnrctl status is having line like this one:

(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=hostname)(PORT=5707))(Presentation=HTTP)(Session=RAW))

Also use dbms_xdb.gethttpport to get the port on which the console is listening at:

SELECT DBMS_XDB.gethttpport() from dual;

DBMS_XDB.GETHTTPPORT()
----------------------
0

These tips are described at the documentation OracleÂ® Audit Vault Administrator's Guide, in particular A.3.6.1 Oracle Audit Vault Reports Not Displaying

The correct port of Oracle Audit Vault Reports HTTP is 5707 and running the above query should return exactly this port.Â If this is the case and you get port 0, then login as sysdba and set the correct port:

SQL> EXEC DBMS_XDB.SETHTTPPORT(5707);

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.

Make sure the changes are applied:

SQL> SELECT DBMS_XDB.gethttpport() from dual;

DBMS_XDB.GETHTTPPORT()
----------------------
5707

And finally register the database:

SQL> ALTER SYSTEM REGISTER;

You're now happy Audit Vault auditor who can login successfully to the console.

Oracle DB 10.2.0.3 LISTENER (VIP) goes down on HP-UX 11.23 without reason

Svetoslav Gyurov — Wed, 05 Jan 2011 15:57:07 GMT

Happy New Year!

For a long time I've been receiving complains that the listener at one of the nodes in two node RAC is going offline from time to time. Without obvious reason the VIP of the second node fails, the listener is stopped and VIP is relocated to the first node. Since the VIP is relocated there are no problems if all the clients are configured correctly. In this case some of the clients were connecting explicitly to the second node and were unable to connect to the database. Database version is 10.2.0.3 RAC installed on two nodes running HP-UX 11.23 with December 2008 bundle patches.

The following can be observed in $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:

2010-10-25 06:11:12.492: [ CRSAPP][8336] CheckResource error for ora.db2.vip error code = 1
2010-10-25 06:11:12.522: [ CRSRES][8336] In stateChanged, ora.db2.vip target is ONLINE
2010-10-25 06:11:12.522: [ CRSRES][8336] ora.db2.vip on db2 went OFFLINE unexpectedly
2010-10-25 06:11:12.523: [ CRSRES][8336] StopResource: setting CLI values
2010-10-25 06:11:12.527: [ CRSRES][8336] Attempting to stop `ora.db2.vip` on member `db2`
2010-10-25 06:11:13.182: [ CRSRES][8336] Stop of `ora.db2.vip` on member `db2` succeeded.
2010-10-25 06:11:13.185: [ CRSRES][8336] ora.db2.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2010-10-25 06:11:13.188: [ CRSRES][8336] ora.db2.vip failed on db2 relocating.
2010-10-25 06:11:13.231: [ CRSRES][8336] StopResource: setting CLI values
2010-10-25 06:11:13.235: [ CRSRES][8336] Attempting to stop `ora.db2.LISTENER_DB2.lsnr` on member `db2`
2010-10-25 06:12:31.183: [ CRSRES][8336] Stop of `ora.db2.LISTENER_DB2.lsnr` on member `db2` succeeded.
2010-10-25 06:12:31.211: [ CRSRES][8336] Attempting to start `ora.db2.vip` on member `db1`
2010-10-25 06:12:38.327: [ CRSRES][8336] Start of `ora.db2.vip` on member `db1` succeeded.

At alert log can be seen following:
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='oradb2';

There are couple of bugs logged about that. There is also MOS ID regarding this problem:
HP-UX Itanium: RACGMAIN Received SIGSEGV On CheckResource Causing a Crash of a Resource [ID 763724.1]

The solution is to change the executable mode which uses shared library from "delay binding" to "immediate binding" using following bash script. It has to be applied on both CRS and DB homes, all Oracle processes should be stopped:

cd $ORACLE_HOME/bin/
for i in crs_relocate.bin crs_start.bin crs_stop.bin crsd.bin evmd.bin racgons.bin racgeut racgevtf racgmain; do chatr -B immediate $i; done

cd $CRS_HOME/bin/
for i in crs_relocate.bin crs_start.bin crs_stop.bin crsd.bin evmd.bin racgons.bin racgeut racgevtf racgmain; do chatr -B immediate $i; done

For three months since implementing this solutions I haven't seen this problem again!

Regards,
Sve

Many open files on HP-UX after RAC upgrade to 10.2.0.4 - racgimon file handle leak

Svetoslav Gyurov — Fri, 23 Jul 2010 15:17:42 GMT

Two months after patching a customer database to 10.2.0.4 I've received a call, telling me that the database is hanging. Usually this happens when they missed the backup of the archive logs and the database stops. This time there was enough space available and this was not the problem. I logged to the first node and start looking around, weird things were happening, some commands were failing and other were hanging. Then I realized that this is not an ordinary case and start looking deeper. It turns out that this is a bug of Oracle with HP-UX and there is a patch and work around too.

The customer was having HP-UX 11.23 (September 2006) with patch bundles from September 2008. The database was Oracle RAC Enterprise Edition 10.2.0.2.

This problem had very big impact on the database because although the database is running in RAC the database was not accessible and there were a lot of locks. Rebooting the node or killing the processes do the job

After some reading it figure out that this happens only on HP-UX, after patching the database to 10.2.0.4 and it happens only on the first node.

Here are some symptoms:

Executing sar -v show the current-size and maximum size of the system file table:

12:00:00   N/A   N/A 328/4200  0  1374/286108 0  41906/65536 0
12:02:00   N/A   N/A 330/4200  0  1376/286108 0  41944/65536 0
12:04:00   N/A   N/A 336/4200  0  1390/286108 0  41999/65536 0
12:06:00   N/A   N/A 331/4200  0  1377/286108 0  41983/65536 0
12:08:00   N/A   N/A 330/4200  0  1376/286108 0  41976/65536 0
12:10:00   N/A   N/A 330/4200  0  1377/286108 0  41935/65536 0

With lsof the following open files are seen:

racgimon   3506 oracle   14u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   28u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   30u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   37u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat

The processes which is holding the open files:

oracle  3506     1  0  Nov  5  ?        18:16 /oracle/ora10g/bin/racgimon startd baandb

At this log "$ORACLE_HOME/log/{NodeName}/racg/imon_{InstanceName}.log" every minute can be seen the following error:

2009-12-02 12:12:35.454: [    RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

2009-12-02 12:13:35.474: [    RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

When the file table gets full weird things start to happen,Â in the syslog the following can be seen:

Nov  5 08:00:02 db1 vmunix: file: table is full
Nov  5 08:00:03 db1 vmunix: file: table...
Nov  5 08:00:03 db1 vmunix: file...
Nov  5 08:00:03 db1 vmunix: file...
Nov  5 08:01:13 db1 vmunix: file: table is full
Nov  5 08:11:15 db1  above message repeats 34260 times

Also in the alertlog file the following can be seen:

ORA-00603: ORACLE server session terminated by fatal error
ORA-27544: Failed to map memory region for export
ORA-27300: OS system dependent operation:socket failed with status: 23
ORA-27301: OS failure message: File table overflow
ORA-27302: failure occurred at: sskgxpcre1

Solution:
Base bug is 6931689 (SS10204-HP-PARISC64-080216.080324 HEALTH CHECK FAILED TO CONNECT TO INSTANCE), but it's not public. It's fixed in CRS 10.2.0.4 Bundle Patch #2, but the actual CRS bundle is PSU2 with Patch# 8705958: TRACKING BUG FOR 10.2.0.4.2 PSU FOR CRS which is around 41Mb big.
This patch# 8705958 should be applied to all Oracle homes although the bug is in the database CRS should always be a higher version.

To apply this patch OPatch version must be at least 10.2.0.4.7, which can be downloaded with patch# 6880880. At the moment of writing this the latest version was 10.2.0.4.9 and its 34Mb. To install it, simply download it and unzip it under ORACLE_HOME.

I didn't went with the patch because I read some scary stuff at OTN and thanks to Ivan Kartik I integrated a dirty work around. He proposed very good script which is checking if opened files are more than 20000 just to kill the racgimon process:

http://ivan.kartik.sk/index.php?show_article=42

13:56:00   N/A   N/A 307/4200  0  1352/286108 0  44102/65536 0
13:58:00   N/A   N/A 307/4200  0  1353/286108 0  44119/65536 0
14:00:01   N/A   N/A 309/4200  0  1355/286108 0  44135/65536 0
14:02:01   N/A   N/A 307/4200  0  1353/286108 0  44153/65536 0
14:04:01   N/A   N/A 301/4200  0  1336/286108 0  2583/65536 0
14:06:01   N/A   N/A 306/4200  0  1347/286108 0  2610/65536 0
14:08:01   N/A   N/A 299/4200  0  1333/286108 0  2583/65536 0
14:10:01   N/A   N/A 300/4200  0  1335/286108 0  2571/65536 0

The work around fixed the problem. This article was written half an year ago and reading MOS now they say that this bug is fixed in 10.2.0.5 which was released at the beginning of June.

Regards,
Sve

Oracle 11g R2 installer fails on HP-UX 11iv3

Svetoslav Gyurov — Thu, 20 May 2010 10:10:19 GMT

Running the installer of any of the products (client, grid, database) of Oracle Database 11g Release 2 on HP-UX 11iv3 (Itanium) fails with:
"An internal error occurred within cluster verification framework"

After starting ./runInstaller the following error window pops-up:

Also at the installAction$DATE.log the following error can be seen:

SEVERE: [FATAL] An internal error occurred within cluster verification framework
Unable to get the current group.

This happens, because patch PHCO_40381 is not installed. There is a list of patches to be installed at 2.3.4 Patch Requirement of the Database Installation guide for HP-UX.

The first one is:
PHCO_40381 11.31 Disk Owner Patch

The patch is available from ITRC. It's 205Kb big and it fixes behavior of the command diskowner. The installation of the patch does not require reboot of the server.

After the installation of the patch, runInstaller starts succesfully.

There is also MOS Doc ID regarding this problem:
HP-UX: 11gR2 runInstaller Fails with "An internal error occurred within cluster verification framework" [ID 983713.1]

Regards,
Sve

Constant cimprovagt daemon crashing and filling the /var directory

Svetoslav Gyurov — Mon, 23 Nov 2009 14:32:32 GMT

We installed two nodes with HP-UX 11.31 March 2009 BOE in a ServiceGuard environment and started test applications in two packets. Suddenly the /var directories on both nodes started to grow and respectively the cluster was crashing because of that and the syslog was never up to date. It turns out that some of the components (cimprovagt) of the OnlineDiagnostics were crashing. I reviewed few advisories and bugs about it, but none of them were having the same behaviour.

Executing file on the core dump file shows the following: core:

ELF-64 core file - IA64 from 'cimprovagt' - received SIGABRT

HP analyzed the core dump files and determined that the problem is already known and the fix is already implemented in September release of DASProvider, which is now part of the DiagProdCollection bundle and can be download from HP Software Depot portal.

After installing the bundle the daemon stopped crashing and the system is stable now.

HP-UX software bug hidden in cluster behaviour

Svetoslav Gyurov — Thu, 08 Oct 2009 14:12:57 GMT

I was called to check some strange behavior of two-node cluster and to see why the one of the nodes crashed unexpectedly.Â The two nodes were HP Integrity servers installed with HP-UX 11.31 Base OE (March 2009). Well the node did not crashed it was just restarted from the ServiceGuard with safety timer expire for some reason. System log was not up to date because /var directory was full at some point and the syslog stopped writing. Console log showed standard messages INIT occurred and safety timer expire. Analyzing the crashdumps revealed that communication with cmcld was not possible and thats why the server was rebooted probably because /var directory was full.

Anyway few days later customer called again and said that the node was restarted again, I expected to see the same reason but this time the reboot reason was "Reboot after panic: Fault when executing in kernel mode". The problem was not in the cluster this time and the reboot reason was talking about some problems in the the kernel.

What is crash anyway ? From HP documentation:
An abnormal system reboot is called a crash. There are many reasons that can cause a system to crash; hardware malfunctions, software panics or even power failures. The crash even type panic refers to crashes initiated by the HP-UX operating system (software crash event). There are two types of panics: direct and indirect. A direct panic refers to a subsystem calling directly the panic() kernel routine upon detection of an unrecoverable inconsistency. An indirect panic refers to a crash event as a result of trap interruption which could not be handled by the operating system for example when the kernel accesses a non-valid address.

I analyzed the crash dumps,Â reviewed all the advisories and release notes and was unable to figure out what is the cause of the crash. Finally Level 2 of the support of HPÂ confirmed that this is known issue with the ONCPlus bundle. ONC stands for Open Network Computing (priviously called NFS bundle in 11.23) and it consists of the following components: Network File System, AutoFS, CacheFS, and Network Information Service. We were told to implement workaround until the fix is released next month. The workaround was to add -o readdir to the mount options of the NFS share in the fstab. Well it was obvious that the problem is with the NFS component of the ONCPlus bundle.

Few days later (not month) the new product (with fixed bugs) appeared online. It can be seen from the release notes the following defect fix:
Directory related operations on NFS client with ONCplus B.11.31.06 or B.11.31.07 installed and with file system mounted with read/write size greater than 8192 bytes, may result in system panic or data corruption.

Yes, the ONCPlus bundle was 11.31.06 and we had mounted NFS share with read/write size of 32768 bytes. Both workaround and the patch seemed to fix the problem and the crash never apeared again. Keep in mind that the installation of the new ONCPlus bundle needs restart and applying the workaround does not, BUT from the support adviced us to reboot the server just to make sure that the corruption is not loaded in the memory. So if you hit this bug consider applying the new bundle.

The latest ONCPlus bundle can be downloaded from HP Software Depot portal.

Just for reference the following stack trace is dumped on the consle when the server crashes:

bad_kern_reference: 0xffff31.0x2c20486f6d65634f, fault = 0x8

Message buffer contents after system crash:

panic: Fault when executing in kernel mode
Stack Trace:
IP                    Function Name
0xe000000001f887e0Â  bad_kern_reference+0xa0
0xe00000000076a3d0Â  $cold_vfault+0x3b0
0xe000000000c45a10Â  vm_hndlr+0x510
0xe000000001bd9780Â  bubbledown+0x0
0xe000000000d00da1Â  vx_iupdat_cluster+0xa1
0xe000000000d14830Â  vx_async_iupdat+0x160
0xe000000000d4a530Â  vx_iupdat_local+0x2c0
0xe000000000d8c020Â  vx_iupdat+0xb0
0xe000000002134ed0Â  vx_iflush_list+0x4d0
0xe000000000afa8c0Â  vx_iflush+0x1d0
0xe000000000cf2710Â  vx_worklist_thread+0x200
0xe000000000e65d70Â  kthread_daemon_startup+0x90

Regards,
Sve

Many racgmain(check) processes at HP-UX 11iv3

Svetoslav Gyurov — Mon, 17 Aug 2009 16:38:37 GMT

I was called that some commands for controlling the cluster and the oracle are not working. This was two node cluster installed with Oracle 10.2.0.4 RAC on HP-UX 11.31 Data Center OE (December 2008) working for a month already.

Arriving at the customer site I noticed that there are a lot (around 500) of hanging racgmain(check) processes which obviously were blocking some of the cluster commands. Errors also can be seen at this log: $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:

2009-04-08 15:22:01.700: [CRSEVT][90801] CAAMonitorHandler :: 0:Action Script /oracle/ora10g/bin/racgwrap(check) timed
out for ora.ORCL.ORCL1.inst! (timeout=600)
2009-04-08 15:22:01.700: [CRSAPP][90801] CheckResource error for 	ora.ORCL.ORCL1.inst error code = -2
2009-04-08 15:25:42.180: [CRSEVT][90811] CAAMonitorHandler :: 0:Could not join /oracle/ora10g/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

There are a lot of bugs at metalink, but no documents or suggestions how to fix that.

Fortunately we found a solution:

Stop CRS on all nodes.
Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on all nodes

Edit the file racgwrap and modify the last 3 lines from:

 $ORACLE_HOME/bin/racgmain "$@"
 status=$?
 exit $status

to:

	exec $ORACLE_HOME/bin/racgmain "$@"

Restart CRS and make sure that all the resources are starts.

We were lucky that hit the bug just before the migration and restarting the instances/servers was easy enough. I don't know if this really solves the problem, but we never hit the bug again.