Web Analytics

Oracle DB 10.2.0.3 LISTENER (VIP) goes down on HP-UX 11.23 without reason

Happy New Year!

For a long time I've been receiving complains that the listener at one of the nodes in two node RAC is going offline from time to time. Without obvious reason the VIP of the second node fails, the listener is stopped and VIP is relocated to the first node. Since the VIP is relocated there are no problems if all the clients are configured correctly. In this case some of the clients were connecting explicitly to the second node and were unable to connect to the database. Database version is 10.2.0.3 RAC installed on two nodes running HP-UX 11.23 with December 2008 bundle patches.

The following can be observed in $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:

2010-10-25 06:11:12.492: [ CRSAPP][8336] CheckResource error for ora.db2.vip error code = 1
2010-10-25 06:11:12.522: [ CRSRES][8336] In stateChanged, ora.db2.vip target is ONLINE
2010-10-25 06:11:12.522: [ CRSRES][8336] ora.db2.vip on db2 went OFFLINE unexpectedly
2010-10-25 06:11:12.523: [ CRSRES][8336] StopResource: setting CLI values
2010-10-25 06:11:12.527: [ CRSRES][8336] Attempting to stop `ora.db2.vip` on member `db2`
2010-10-25 06:11:13.182: [ CRSRES][8336] Stop of `ora.db2.vip` on member `db2` succeeded.
2010-10-25 06:11:13.185: [ CRSRES][8336] ora.db2.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2010-10-25 06:11:13.188: [ CRSRES][8336] ora.db2.vip failed on db2 relocating.
2010-10-25 06:11:13.231: [ CRSRES][8336] StopResource: setting CLI values
2010-10-25 06:11:13.235: [ CRSRES][8336] Attempting to stop `ora.db2.LISTENER_DB2.lsnr` on member `db2`
2010-10-25 06:12:31.183: [ CRSRES][8336] Stop of `ora.db2.LISTENER_DB2.lsnr` on member `db2` succeeded.
2010-10-25 06:12:31.211: [ CRSRES][8336] Attempting to start `ora.db2.vip` on member `db1`
2010-10-25 06:12:38.327: [ CRSRES][8336] Start of `ora.db2.vip` on member `db1` succeeded.

At alert log can be seen following:
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='oradb2';

There are couple of bugs logged about that. There is also MOS ID regarding this problem:
HP-UX Itanium: RACGMAIN Received SIGSEGV On CheckResource Causing a Crash of a Resource [ID 763724.1]

The solution is to change the executable mode which uses shared library from "delay binding" to "immediate binding" using following bash script. It has to be applied on both CRS and DB homes, all Oracle processes should be stopped:

cd $ORACLE_HOME/bin/
for i in crs_relocate.bin crs_start.bin crs_stop.bin crsd.bin evmd.bin racgons.bin racgeut racgevtf racgmain; do chatr -B immediate $i; done

cd $CRS_HOME/bin/
for i in crs_relocate.bin crs_start.bin crs_stop.bin crsd.bin evmd.bin racgons.bin racgeut racgevtf racgmain; do chatr -B immediate $i; done

For three months since implementing this solutions I haven't seen this problem again!

Regards,
Sve