Unable to perform initial elastic configuration on Exadata X6
I had the pleasure to deploy another Exadata in the first week of 2017 and got my first issue this year.
As we know starting with Exadata X5, Oracle introduced the concept of Elastic Configuration. Apart from allowing you to mix and match the number of compute nodes and storage cells they have also changed how the IP addresses are assigned on the admin (eth0) interface. Prior X5, Exadata had default IP addresses set at the factory in the range of IP addresses was 192.168.1.1 to 192.168.1.203 but since this could collide with the customer's network they changed the way those IPs are assigned. In short - the IP address on eth0 on the compute nodes and storage cells is assigned within 172.16.2.1 to 172.16.7.254 range. The first time node boots it will assign its hostname and IP address based on the IB ports its connected to.
Now to the real problem, I was doing the usual stuff - changing ILOMs, setting cisco and IB switches and was about to perform the initial elastic configuration (applyElasticConfig.sh) so I had upload all the files I need for the deployment on the first compute node. I've changed my laptop address to an IP within the same range and was surprised when I got connection timed out when I tried to ssh to the first compute node (172.16.2.44). I thought this was an unfortunate coincidence since I rebooted the IB switches almost at the time I powered on the compute nodes but I was wrong. For some reason, ALL servers did not get their eth0 IP addresses assigned hence they were not accessible.
I was very surprised to what's causing this issue and I've spent the afternoon troubleshooting it. I thought Oracle changed the way they assign the IP addresses but the scripts haven't been changed for a long time. It didn't take long before I find out what was causing it. Three lines in /sbin/ifup script were the reason eth0 interface wasn't up with the 172.2.16.X IP address:
if ip link show ${DEVICE} | grep -q "UP"; then
exit 0
fi
This line will check if the interface is UP before proceeding further and bring the interface up. Actually, the eth0 interface is brought UP already by the elastic configuration script to check if there is a link on the interface. Then at the end of the script when ifup script is invoked to bring the interface up it will stop the execution since the interface is already UP.
The solution is really simple - comment out the three lines (line 73-75) in /sbin/ifup script and reboot each node.
This wasn't the first X6 I deploy and I never had this problem before so I did some further investigation. The /sbin/ifup scripts is part of initscripts package. It turns out that the check for the interface being UP was introduced in one minor version of the package and then removed in the latest package. Unfortunately, the last entry in the Changelog is from Apr 12 2016 so that's not very helpful but here's a summary:
initscripts-9.03.53-1.0.1.el6.x86_64.rpm 11-May-2016 19:49 947.9 K <-- not affected
initscripts-9.03.53-1.0.1.el6_8.1.x86_64.rpm 12-Jul-2016 16:42 948.0 K <-- affected
initscripts-9.03.53-1.0.2.el6_8.1.x86_64.rpm 13-Jul-2016 08:26 948.1 K <-- affected
initscripts-9.03.53-1.0.3.el6_8.2.x86_64.rpm 23-Nov-2016 05:06 948.3 K <-- latest version, not affected
I had this problem on three Exadata machine so far. So, if you are doing deployment of new Exadata in the next few days or weeks it's very likely that you will be affected, unless your Exadata has been factory deployed after 23rd Nov 2016. That's the day when the latest initscripts package was released.
Update 24.01.2017:
This problem has been fixed in 12.1.2.3.3.161208:
25143049 - ADD NEW INITSCRIPTS RPM TO EXADATA REPOSITORY TO FIX IFUP ISSUE
Where the latest package has been added to the patch (initscripts-9.03.53-1.0.3.el6_8.2.x86_64.rpm)