I was called that some commands for controlling the cluster and the oracle are not working. This was two node cluster installed with Oracle 10.2.0.4 RAC on HP-UX 11.31 Data Center OE (December 2008) working for a month already.
Arriving at the customer site I noticed that there are a lot (around 500) of hanging racgmain(check) processes which obviously were blocking some of the cluster commands. Errors also can be seen at this log: $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:
2009-04-08 15:22:01.700: [CRSEVT] CAAMonitorHandler :: 0:Action Script /oracle/ora10g/bin/racgwrap(check) timed out for ora.ORCL.ORCL1.inst! (timeout=600) 2009-04-08 15:22:01.700: [CRSAPP] CheckResource error for ora.ORCL.ORCL1.inst error code = -2 2009-04-08 15:25:42.180: [CRSEVT] CAAMonitorHandler :: 0:Could not join /oracle/ora10g/bin/racgwrap(check) category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child
There are a lot of bugs at metalink, but no documents or suggestions how to fix that.
Fortunately we found a solution:
Stop CRS on all nodes.
Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on all nodes
Edit the file racgwrap and modify the last 3 lines from:
$ORACLE_HOME/bin/racgmain "$@" status=$? exit $status
exec $ORACLE_HOME/bin/racgmain "$@"
- Restart CRS and make sure that all the resources are starts.
We were lucky that hit the bug just before the migration and restarting the instances/servers was easy enough. I don't know if this really solves the problem, but we never hit the bug again.