Archive

Posts Tagged ‘crs’

Exhaust of Windows 2008 heap memory with Oracle Database 11.2.0.2

September 29th, 2011 4 comments

Recently I had an interesting setup for one of our customers. Because they got Oracle Standard Edition and Windows 2008 Server R2 Standard Edition licenses I was asked to create HA database installation. After looking around I found few docs about installing Standard Edition with Clusterware and I had some ideas. Finally I installed Grid Infrastructure on both servers and Oracle Database binaries. Then created single instance database on the second server and replicated the configuration to the first one. Currently the relocation of the database is done manually, but one could create a start/stop/monitor scripts and integrate these with GI. Once the database starts it’s registering at the scan listener so in theory it’s running in HA (just the relocation is manual) 🙂

So during the weekend I received mail from my colleagues above error messages they received from the database: connect error, Socket read timed out. It wasn’t a rush as the database is not yet in production, but it’s ahead and this was the first task for the Monday. Next day I looked around and everything was up and running, except that I wasn’t able to login through the listener and I also wasn’t able to stop or relocate it. Looking at the logs I found at some point the following message: TNS-12531: TNS:cannot allocate memory which explains the previous message.

That was weird, the server on which error appeared was the first one and had only GI running and SCAN LISTENER. This really looked like a memory leak, it’s a Windows so maybe that was obvious. I decided to look around the processes using the Resource Monitor when I found a lot of many cmd.exe processes. To confirm the problem I used Process Explorer which is a very nice tool for Windows. As could be seen below I’ve got plenty of cmd processes which were spawned, but not (obviously) closed after completion:

It turned out that this is a bug for 11.2.0.2 and Windows (64 bit). The Oracle CVU resource (ora.cvu), which by default is started on the first node in the cluster (this makes sense now) it’s doing checks on every six hours (CHECK_INTERVAL=21600) and leaves process open. Because of this the heap memory is exhausted and that’s the reason why the SCAN LISTENER is failing and giving the error message TNS-12531: TNS:cannot allocate memory

 

The following errors could be seen in Windows Eventlog, once the patch is applied the errors disappeared:
Faulting application lsnrctl.exe, version 11.2.0.2, time stamp 0x4cea8f55, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb73957, exception code 0xc0000142, fault offset 0x00000000000b1b48, process id 0x1eac, application start time 0x01cc6ab588f992c0.

Faulting application cmd.exe, version 6.0.6001.18000, time stamp 0x47918bde, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb733e1, exception code 0xc0000142, fault offset 0x0006f1e7, process id 0x1004, application start time 0x01cc6af0fa982500.

Faulting application sclsspawn.exe, version 0.0.0.0, time stamp 0x4ce622a7, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb73957, exception code 0xc0000142, fault offset 0x00000000000b1b48, process id 0x1ca0, application start time 0x01cc6c0e5efd5380.

This is the bug at MOS:
Bug 12529945: CVU HEALTH CHECKS EXHAUST WINDOWS HEAP MEMORY

The bug should have been fixed in BP8, but I applied the latest one BP10:
Patch 12849789: ORACLE 11G 11.2.0.2 PATCH 10 BUG FOR WINDOWS (64-BIT AMD64 AND INTEL EM64)

 

Regards,
Sve

Categories: oracle, windows Tags: , , ,

Change of network interfaces in Oracle 10g RAC

July 12th, 2011 No comments

I was doing planned downtime on one of the 10.2.0.4 RAC systems and just before start of the second node I was told that during the downtime the network interfaces of the second node were aggregated. These servers are running HP-UX in which the default network interfaces are lan0 for the public network and lan1 for the interconnect. After they have been aggregated they became lan900 and lan901 respectively so I ask the guys to turn the things back and as I knew that the Clusterware would suffer from this change.

I decided to create a test scenario at the office, but with Linux OS (its was faster to deploy and test). Except the interfaces names everything else should be the same. I’m using eth0 for public and eth1 for private. Then for the purpose of demonstration at the second node I’m going to change the network interface which is used for public from eth0 to eth2. This would require also modifying nodeapps as VIP is running on this interface.

I installed Oracle 10.2.0.4 RAC on two nodes: oelvm5 and oelvm6 with orcl database. This is how the cluster configuration looks like before changing the interface:

[oracle@oelvm5 bin]$ ./oifcfg getif
eth0 192.168.143.0 global public
eth1 172.16.143.0 global cluster_interconnect

[oracle@oelvm5 bin]$ srvctl config nodeapps -n oelvm5 -a
VIP exists.: /oelvm5-vip/192.168.143.159/255.255.255.0/eth0

[oracle@oelvm5 bin]$ srvctl config nodeapps -n oelvm6 -a
VIP exists.: /oelvm6-vip/192.168.143.160/255.255.255.0/eth0

At this point I changed the interface eth0 to eth2 on the second node and restarted the node. After change of network interface on second node, listener is unable to run and VIP is relocated to the first node. I’m using a very handy script for getting the cluster resources status in formatted output and here is the output of it after the node boot:

[oracle@oelvm5 bin]$ crsstatus
HA Resource Target State
———– —— —–
ora.orcl.db ONLINE ONLINE on oelvm5
ora.orcl.orcl1.inst ONLINE ONLINE on oelvm5
ora.orcl.orcl2.inst ONLINE ONLINE on oelvm6
ora.oelvm5.ASM1.asm ONLINE ONLINE on oelvm5
ora.oelvm5.LISTENER_OELVM5.lsnr ONLINE ONLINE on oelvm5
ora.oelvm5.gsd ONLINE ONLINE on oelvm5
ora.oelvm5.ons ONLINE ONLINE on oelvm5
ora.oelvm5.vip ONLINE ONLINE on oelvm5
ora.oelvm6.ASM2.asm ONLINE ONLINE on oelvm6
ora.oelvm6.LISTENER_OELVM6.lsnr ONLINE OFFLINE
ora.oelvm6.gsd ONLINE ONLINE on oelvm6
ora.oelvm6.ons ONLINE ONLINE on oelvm6
ora.oelvm6.vip ONLINE ONLINE on oelvm5

Also following can be observed in $ORA_CRS_HOME/log/{HOST}/racg/ora.{HOST}.vip.log:
2011-05-28 16:20:39.157: [ RACG][3909306080] [4865][3909306080][ora.oelvm6.vip]: checkIf: interface eth0 is down
Invalid parameters, or failed to bring up VIP (host=node2)

So now its obvious, the VIP could not be started up on the second node, because interface eth0 is down. In order to change the public network interface, one has to use oifcfg first to delete the current interface and then add the correct one. Then for the node on which the interface is changed clusterware has to be stopped and nodeapps updated from the other node.

In case you are running in production and not using services, consider using crs_relocate on the VIP resource. It will relocate immediately the VIP address to the other node so none of the client would suffer from connection time out. In my lab VIP was easily relocated with just crs_relocate, but at the production environment ASM and LISTENER were dependant on the VIP and I had to stop them first. Not sure, but I think this was because there were two homes, one for ASM and one for DB.

Then change the public interface/subnet on the dependant node. While Clusterware is running, delete the interfaces using oifcfg and then add it with correct interface:

[oracle@oelvm6 ~]$ ./oifcfg delif -global eth0
[oracle@oelvm5 ~]$ oifcfg getif
eth1 172.16.143.0 global cluster_interconnect
[oracle@oelvm6 ~]$ oifcfg setif -global eth2/192.168.143.0:public

Now we have a correct configuration:

[oracle@oelvm5 ~]$ oifcfg getif
eth2 192.168.143.0 global public
eth1 172.16.143.0 global cluster_interconnect

Because the interface is the same on which VIP is running, nodeapps for this node has to be updated as well. For this action, stop the clusterware on the dependant node and execute srvctl from the other node. The other node has to be up and running in order to make the change:

[root@oelvm6 ~]# /etc/init.d/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
May 30 11:33:15.380 | INF | daemon shutting down
Stopping resources. This could take several minutes.
Successfully stopped CRS resources.
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
Shutdown has begun. The daemons should exit soon.

[oracle@oelvm5 ~]$ srvctl config nodeapps -n oelvm6 -a
VIP exists.: /oelvm6-vip/192.168.143.160/255.255.255.0/eth0
[root@oelvm5 ~]# srvctl modify nodeapps -n oelvm6 -A oelvm6-vip/255.255.255.0/eth2
[oracle@oelvm5 ~]$ srvctl config nodeapps -n oelvm6 -a
VIP exists.: /oelvm6-vip/192.168.143.160/255.255.255.0/eth2

Finally start the Clusterware on the second node. It will automatically relocate it’s VIP address and start all the resources:

[root@oelvm6 ~]# /etc/init.d/init.crs start
Startup will be queued to init within 30 seconds.

It could be seen that the change is reflected and now the node applications are running fine:

[oracle@oelvm6 racg]$ crsstatus
HA Resource Target State
———– —— —–
ora.orcl.db ONLINE ONLINE on oelvm5
ora.orcl.orcl1.inst ONLINE ONLINE on oelvm5
ora.orcl.orcl2.inst ONLINE ONLINE on oelvm6
ora.oelvm5.ASM1.asm ONLINE ONLINE on oelvm5
ora.oelvm5.LISTENER_OELVM5.lsnr ONLINE ONLINE on oelvm5
ora.oelvm5.gsd ONLINE ONLINE on oelvm5
ora.oelvm5.ons ONLINE ONLINE on oelvm5
ora.oelvm5.vip ONLINE ONLINE on oelvm5
ora.oelvm6.ASM2.asm ONLINE ONLINE on oelvm6
ora.oelvm6.LISTENER_OELVM6.lsnr ONLINE ONLINE on oelvm6
ora.oelvm6.gsd ONLINE ONLINE on oelvm6
ora.oelvm6.ons ONLINE ONLINE on oelvm6
ora.oelvm6.vip ONLINE ONLINE on oelvm6

Regards,
Sve

Categories: hp-ux, oracle Tags: , ,

Oracle DB 10.2.0.3 LISTENER (VIP) goes down on HP-UX 11.23 without reason

January 5th, 2011 No comments

Happy New Year!

For a long time I’ve been receiving complains that the listener at one of the nodes in two node RAC is going offline from time to time. Without obvious reason the VIP of the second node fails, the listener is stopped and VIP is relocated to the first node. Since the VIP is relocated there are no problems if all the clients are configured correctly. In this case some of the clients were connecting explicitly to the second node and were unable to connect to the database. Database version is 10.2.0.3 RAC installed on two nodes running HP-UX 11.23 with December 2008 bundle patches.

The following can be observed in $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:
2010-10-25 06:11:12.492: [ CRSAPP][8336] CheckResource error for ora.db2.vip error code = 1
2010-10-25 06:11:12.522: [ CRSRES][8336] In stateChanged, ora.db2.vip target is ONLINE
2010-10-25 06:11:12.522: [ CRSRES][8336] ora.db2.vip on db2 went OFFLINE unexpectedly
2010-10-25 06:11:12.523: [ CRSRES][8336] StopResource: setting CLI values
2010-10-25 06:11:12.527: [ CRSRES][8336] Attempting to stop `ora.db2.vip` on member `db2`
2010-10-25 06:11:13.182: [ CRSRES][8336] Stop of `ora.db2.vip` on member `db2` succeeded.
2010-10-25 06:11:13.185: [ CRSRES][8336] ora.db2.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2010-10-25 06:11:13.188: [ CRSRES][8336] ora.db2.vip failed on db2 relocating.
2010-10-25 06:11:13.231: [ CRSRES][8336] StopResource: setting CLI values
2010-10-25 06:11:13.235: [ CRSRES][8336] Attempting to stop `ora.db2.LISTENER_DB2.lsnr` on member `db2`
2010-10-25 06:12:31.183: [ CRSRES][8336] Stop of `ora.db2.LISTENER_DB2.lsnr` on member `db2` succeeded.
2010-10-25 06:12:31.211: [ CRSRES][8336] Attempting to start `ora.db2.vip` on member `db1`
2010-10-25 06:12:38.327: [ CRSRES][8336] Start of `ora.db2.vip` on member `db1` succeeded.

At alert log can be seen following:
ALTER SYSTEM SET service_names=” SCOPE=MEMORY SID=’oradb2′;

There are couple of bugs logged about that. There is also MOS ID regarding this problem:
HP-UX Itanium: RACGMAIN Received SIGSEGV On CheckResource Causing a Crash of a Resource [ID 763724.1]

The solution is to change the executable mode which uses shared library from “delay binding” to “immediate binding” using following bash script. It has to be applied on both CRS and DB homes, all Oracle processes should be stopped:

cd $ORACLE_HOME/bin/
for i in crs_relocate.bin crs_start.bin crs_stop.bin crsd.bin evmd.bin racgons.bin racgeut racgevtf racgmain; do chatr -B immediate $i; done

cd $CRS_HOME/bin/
for i in crs_relocate.bin crs_start.bin crs_stop.bin crsd.bin evmd.bin racgons.bin racgeut racgevtf racgmain; do chatr -B immediate $i; done

For three months since implementing this solutions I haven’t seen this problem again!

Regards,
Sve

Categories: hp-ux, oracle Tags: , , ,

Shared disk support for VirtualBox

August 9th, 2010 2 comments

I’m very happy to announce that VirtualBox now supports shared disks. Finally we can attach one disk to several virtual machines and run Oracle RAC and other clusters. As Oracle promised, this feature is released with the next maintenance patch (thanks!).

There is a new image write mode which is called shareable and this options is now available for the commands createhd and modifyhd of VBoxManage. To create new shared image use the command VBoxManage createhd with type shareble, creating shared disk from the GUI is not possible. To mark an existing image as a shared use the command VBoxManage modifyhd with type shareable.

Something important is that only fixed size disks are supported. If the disk is dynamic you will encounter the following error if you try to modify the image:
ERROR: Cannot change type for medium ‘/home/vm/ora11g_shared.vdi’ to ‘Shareable’ since it is a dynamic medium storage unit

There is other minor issue, if the image is already attached to two virtual machines the command modifyhd will also fail:
ERROR: Cannot change the type of medium ‘/home/vm/ora11g_shared.vdi’ because it is attached to 2 virtual machines

And finally, YES it works, I have tested it already!

sve@host:~$ VBoxManage showhdinfo /home/vm/ora11g_shared.vdi
Oracle VM VirtualBox Command Line Management Interface Version 3.2.8
(C) 2005-2010 Oracle Corporation
All rights reserved.

UUID:                     7521f059-1196-4d68-a1a6-cf0082fb446a
Accessible:               yes
Description:          
Logical size:             2048 MBytes
Current size on disk:     2048 MBytes
Type:                     shareable
Storage format:           VDI
In use by VMs:            labs1 (UUID: 25475ff4-70bc-4e2e-aa38-d8fae289273e)
                          labs2 (UUID: e4441f4c-1ef9-42e0-8e54-d2aec2c6cf4f)
Location:                 /home/vm/ora11g_shared.vdi

Regards and happy migration 😉
Sve

Categories: oracle, virtualization Tags: , , ,

Many racgmain(check) processes at HP-UX 11iv3

August 17th, 2009 9 comments

I was called that some commands for controlling the cluster and the oracle are not working. This was two node cluster installed with Oracle 10.2.0.4 RAC on HP-UX 11.31 Data Center OE (December 2008) working for a month already.

Arriving at the customer site I noticed that there are a lot (around 500) of hanging racgmain(check) processes which obviously were blocking some of the cluster commands. Errors also can be seen at this log: $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:

2009-04-08 15:22:01.700: [  CRSEVT][90801] CAAMonitorHandler :: 0:Action Script /oracle/ora10g/bin/racgwrap(check) timed
out for ora.ORCL.ORCL1.inst! (timeout=600)
2009-04-08 15:22:01.700: [  CRSAPP][90801] CheckResource error for ora.ORCL.ORCL1.inst error code = -2
2009-04-08 15:25:42.180: [  CRSEVT][90811] CAAMonitorHandler :: 0:Could not join /oracle/ora10g/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

There are a lot of bugs at metalink, but no documents or suggestions how to fix that.

Fortunately we found a solution:

1. Stop CRS on all nodes.

2. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on all nodes

3. Edit the file racgwrap and modify the last 3 lines from:

$ORACLE_HOME/bin/racgmain “$@”
status=$?
exit $status

to:

exec $ORACLE_HOME/bin/racgmain “$@”

4. Restart CRS and make sure that all the resources are starts.

We were lucky that hit the bug just before the migration and restarting the instances/servers was easy enough. I don’t know if this really solves the problem, but we never hit the bug again.

Categories: hp-ux, oracle Tags: ,