Change of network interfaces in Oracle 10g RAC
I was doing planned downtime on one of the 10.2.0.4 RAC systems and just before start of the second node I was told that during the downtime the network interfaces of the second node were aggregated. These servers are running HP-UX in which the default network interfaces are lan0 for the public network and lan1 for the interconnect. After they have been aggregated they became lan900 and lan901 respectively so I ask the guys to turn the things back and as I knew that the Clusterware would suffer from this change.
I decided to create a test scenario at the office, but with Linux OS (its was faster to deploy and test). Except the interfaces names everything else should be the same. I'm using eth0 for public and eth1 for private. Then for the purpose of demonstration at the second node I'm going to change the network interface which is used for public from eth0 to eth2. This would require also modifying nodeapps as VIP is running on this interface.
I installed Oracle 10.2.0.4 RAC on two nodes: oelvm5 and oelvm6 with orcl database. This is how the cluster configuration looks like before changing the interface:
[oracle@oelvm5 bin]$ ./oifcfg getif
eth0 192.168.143.0 global public
eth1 172.16.143.0 global cluster_interconnect
[oracle@oelvm5 bin]$ srvctl config nodeapps -n oelvm5 -a
VIP exists.: /oelvm5-vip/192.168.143.159/255.255.255.0/eth0
[oracle@oelvm5 bin]$ srvctl config nodeapps -n oelvm6 -a
VIP exists.: /oelvm6-vip/192.168.143.160/255.255.255.0/eth0
At this point I changed the interface eth0 to eth2 on the second node and restarted the node. After change of network interface on second node, listener is unable to run and VIP is relocated to the first node. I’m using a very handy script for getting the cluster resources status in formatted output and here is the output of it after the node boot:
[oracle@oelvm5 bin]$ crsstatus
HA Resource Target State
ora.orcl.db ONLINE ONLINE on oelvm5
ora.orcl.orcl1.inst ONLINE ONLINE on oelvm5
ora.orcl.orcl2.inst ONLINE ONLINE on oelvm6
ora.oelvm5.ASM1.asm ONLINE ONLINE on oelvm5
ora.oelvm5.LISTENER_OELVM5.lsnr ONLINE ONLINE on oelvm5
ora.oelvm5.gsd ONLINE ONLINE on oelvm5
ora.oelvm5.ons ONLINE ONLINE on oelvm5
ora.oelvm5.vip ONLINE ONLINE on oelvm5
ora.oelvm6.ASM2.asm ONLINE ONLINE on oelvm6
ora.oelvm6.LISTENER_OELVM6.lsnr ONLINE OFFLINE
ora.oelvm6.gsd ONLINE ONLINE on oelvm6
ora.oelvm6.ons ONLINE ONLINE on oelvm6
ora.oelvm6.vip ONLINE ONLINE on oelvm5
Also following can be observed in $ORA_CRS_HOME/log/{HOST}/racg/ora.{HOST}.vip.log:
2011-05-28 16:20:39.157: [ RACG][3909306080] [4865][3909306080][ora.oelvm6.vip]: checkIf: interface eth0 is down
Invalid parameters, or failed to bring up VIP (host=node2)
So now its obvious, the VIP could not be started up on the second node, because interface eth0 is down. In order to change the public network interface, one has to use oifcfg first to delete the current interface and then add the correct one. Then for the node on which the interface is changed clusterware has to be stopped and nodeapps updated from the other node.
In case you are running in production and not using services, consider using crs_relocate on the VIP resource. It will relocate immediately the VIP address to the other node so none of the client would suffer from connection time out. In my lab VIP was easily relocated with just crs_relocate, but at the production environment ASM and LISTENER were dependant on the VIP and I had to stop them first. Not sure, but I think this was because there were two homes, one for ASM and one for DB.
Then change the public interface/subnet on the dependant node. While Clusterware is running, delete the interfaces using oifcfg and then add it with correct interface:
[oracle@oelvm6 ~]$ ./oifcfg delif -global eth0
[oracle@oelvm5 ~]$ oifcfg getif
eth1 172.16.143.0 global cluster_interconnect
[oracle@oelvm6 ~]$ oifcfg setif -global eth2/192.168.143.0:public
Now we have a correct configuration:
[oracle@oelvm5 ~]$ oifcfg getif
eth2 192.168.143.0 global public
eth1 172.16.143.0 global cluster_interconnect
Because the interface is the same on which VIP is running, nodeapps for this node has to be updated as well. For this action, stop the clusterware on the dependant node and execute srvctl from the other node. The other node has to be up and running in order to make the change:
[root@oelvm6 ~]# /etc/init.d/init.crs stop
Shutting down Oracle Cluster Ready Services (CRS):
May 30 11:33:15.380 | INF | daemon shutting down
Stopping resources. This could take several minutes.
Successfully stopped CRS resources.
Stopping CSSD.
Shutting down CSS daemon.
Shutdown request successfully issued.
Shutdown has begun. The daemons should exit soon.
[oracle@oelvm5 ~]$ srvctl config nodeapps -n oelvm6 -a
VIP exists.: /oelvm6-vip/192.168.143.160/255.255.255.0/eth0
[root@oelvm5 ~]# srvctl modify nodeapps -n oelvm6 -A oelvm6- vip/255.255.255.0/eth2
[oracle@oelvm5 ~]$ srvctl config nodeapps -n oelvm6 -a
VIP exists.: /oelvm6-vip/192.168.143.160/255.255.255.0/eth2
Finally start the Clusterware on the second node. It will automatically relocate it's VIP address and start all the resources:
[root@oelvm6 ~]# /etc/init.d/init.crs start
Startup will be queued to init within 30 seconds.
It could be seen that the change is reflected and now the node applications are running fine:
[oracle@oelvm6 racg]$ crsstatus
HA Resource Target State
ora.orcl.db ONLINE ONLINE on oelvm5
ora.orcl.orcl1.inst ONLINE ONLINE on oelvm5
ora.orcl.orcl2.inst ONLINE ONLINE on oelvm6
ora.oelvm5.ASM1.asm ONLINE ONLINE on oelvm5
ora.oelvm5.LISTENER_OELVM5.lsnr ONLINE ONLINE on oelvm5
ora.oelvm5.gsd ONLINE ONLINE on oelvm5
ora.oelvm5.ons ONLINE ONLINE on oelvm5
ora.oelvm5.vip ONLINE ONLINE on oelvm5
ora.oelvm6.ASM2.asm ONLINE ONLINE on oelvm6
ora.oelvm6.LISTENER_OELVM6.lsnr ONLINE ONLINE on oelvm6
ora.oelvm6.gsd ONLINE ONLINE on oelvm6
ora.oelvm6.ons ONLINE ONLINE on oelvm6
ora.oelvm6.vip ONLINE ONLINE on oelvm6
Regards,
Sve