Home > hp-ux, oracle > Many racgmain(check) processes at HP-UX 11iv3

Many racgmain(check) processes at HP-UX 11iv3

I was called that some commands for controlling the cluster and the oracle are not working. This was two node cluster installed with Oracle 10.2.0.4 RAC on HP-UX 11.31 Data Center OE (December 2008) working for a month already.

Arriving at the customer site I noticed that there are a lot (around 500) of hanging racgmain(check) processes which obviously were blocking some of the cluster commands. Errors also can be seen at this log: $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:

2009-04-08 15:22:01.700: [  CRSEVT][90801] CAAMonitorHandler :: 0:Action Script /oracle/ora10g/bin/racgwrap(check) timed
out for ora.ORCL.ORCL1.inst! (timeout=600)
2009-04-08 15:22:01.700: [  CRSAPP][90801] CheckResource error for ora.ORCL.ORCL1.inst error code = -2
2009-04-08 15:25:42.180: [  CRSEVT][90811] CAAMonitorHandler :: 0:Could not join /oracle/ora10g/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

There are a lot of bugs at metalink, but no documents or suggestions how to fix that.

Fortunately we found a solution:

1. Stop CRS on all nodes.

2. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on all nodes

3. Edit the file racgwrap and modify the last 3 lines from:

$ORACLE_HOME/bin/racgmain “$@”
status=$?
exit $status

to:

exec $ORACLE_HOME/bin/racgmain “$@”

4. Restart CRS and make sure that all the resources are starts.

We were lucky that hit the bug just before the migration and restarting the instances/servers was easy enough. I don’t know if this really solves the problem, but we never hit the bug again.

Similar Posts:

Categories: hp-ux, oracle Tags: ,
  • Hi,
    Thanks for article. Everytime like to read you.
    Have a nice day
    GlenStef

  • sve

    Hi there,

    Thanks for the interest. I’m checking the blog everyday and I’m trying to post few articles every month.

    Regards,
    Sve

  • A.Wahab

    Hello,
    I am getting the same errors in crsd.log. i checked the racgwrap scripts in oracle and crs home. It is already
    exec $ORACLE_HOME/bin/racgmain “$@”
    !!
    any ideas?

  • sve

    Hi A.Wahab, thanks for asking.
    Just to ask you:
    – what is the output of ps -ef , do you see a lot of these processes ?
    – have you changed the value in the script and then have you restarted the crs ?

    Regards,
    sve

  • A.Wahab

    Hi sve,
    I saw only two racgmain process by ps -ef.
    I did not change anything as, it was already like what you mentioned:
    exec $ORACLE_HOME/bin/racgmain “$@”

  • sve

    Hi Wanab,
    Well, our problem was that some of the commands for controlling the cluster were not working because this script was hanging. We were having around 500 hanging processes and we were observing these errors in crsd.log. After fixing the script we never hit the bug again.

    Are you having any problems or you just see these errors in the log ?

    Regards,
    sve

  • jasr

    Such a specific fix for such a specific problem. How did you came out with that idea?
    I tried in my cluster, didn’t work.

  • Sve

    Hi, thanks for reading. Well I didn’t, actually the Oracle support came with the idea. Did you restart the whole CRS so it could load the change ?

    By the time we hit the bug there was only an internal bug and search for this now at Oracle Support I’m able to find this one:

    Many Orphaned Or Hanging “racgmain” processes Running [ID 732086.1]

    Other solution would be to apply the latest patch set (10.2.0.4) and then apply the one of CRS bundles patch from bundle #2 onwards. Anyway, if you go with this solution you should make a backup and be very careful.

    Regards,
    Sve

  • Ming

    Thank you for this note. We ran into this problem today with an old 10.1.3 cluster. Luckily, we found your post.
    Regards,
    ming