oracle

Exadata patch pre-check fails due to wrong server model

Svetoslav Gyurov

Oct 6, 2021 • 5 min read

I've been mostly busy with OCI-C and OCI in the past three years but occasionally I have to patch Exadata systems. That's what I have been doing every quarter for a Norwegian customer, patch their prod and non-prod Exadata systems in two consecutive weekends. The Exadata configuration was an X5-2 Quarter Rack extended with X7-2 Quarter Rack i.e. four compute nodes and six storage cells.

Oracle have greatly improved the automation and patching processes and it generally goes smooth. Gone are the days where cells patching would take 3hrs to patch. I usually stage and re-run the prechecks prior the patching window. I actually spend more time in preparation than actual execution, that is to find the required patches, check the release notes, check for conflicts and check the patch known issues since we were applying not the latest but the patch from previous quarter.

Now this time I didn't run all prechecks as they usually come up ok, funny that prechecks only take few minutes to complete and confirm system is ok. Of course the one time I didn't run precheck prior patching and I run into issues. Upon running cell patching prechecks I got the follow errors:

[root@exa01db01-c patch_18.1.30.0.0.200713.1]# ./patchmgr -cells cell_group -patch_check_prereq

2020-09-18 14:46:04 +0000        :Working: DO: Check cells have ssh equivalence for root user. Up to 10 seconds per cell ...
2020-09-18 14:46:05 +0000        :SUCCESS: DONE: Check cells have ssh equivalence for root user.
2020-09-18 14:46:09 +0000        :Working: DO: Initialize files. Up to 1 minute ...
2020-09-18 14:46:10 +0000        :Working: DO: Setup work directory
2020-09-18 14:46:33 +0000        :SUCCESS: DONE: Setup work directory
2020-09-18 14:46:35 +0000        :SUCCESS: DONE: Initialize files.
2020-09-18 14:46:35 +0000        :Working: DO: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction. Up to 40 minutes ...
2020-09-18 14:46:48 +0000        :INFO   : Wait correction of degraded md11 due to md partner size mismatch. Up to 30 minutes.
2020-09-18 14:46:49 +0000        :SUCCESS: DONE: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction.
2020-09-18 14:46:49 +0000        :Working: DO: Check space and state of cell services. Up to 20 minutes ...
FAILED for following cells

exa01cel04:  exa01cel04 172.16.0.92 2020-09-18 14:47:21 +0000: Directory /boot either not mounted properly or does not have enough space
exa01cel05:  exa01cel05 172.16.0.93 2020-09-18 14:47:21 +0000: Directory /boot either not mounted properly or does not have enough space
exa01cel06:  exa01cel06 172.16.0.94 2020-09-18 14:47:22 +0000: Directory /boot either not mounted properly or does not have enough space
FAILED With Exception for following cells

exa01cel02:  exa01cel02 172.16.0.90 2020-09-18 14:47:24 +0000: Errors were encountered when verifying state of CELLBOOT USB.
exa01cel03:  exa01cel03 172.16.0.91 2020-09-18 14:47:25 +0000: Errors were encountered when verifying state of CELLBOOT USB.
2020-09-18 14:47:29 +0000        :FAILED : For details, check the following files in the /u01/stage/2020Q3/patch_18.1.30.0.0.200713.1:
2020-09-18 14:47:29 +0000        :FAILED :  - <cell_name>.log
2020-09-18 14:47:29 +0000        :FAILED :  - patchmgr.stdout
2020-09-18 14:47:29 +0000        :FAILED :  - patchmgr.stderr
2020-09-18 14:47:29 +0000        :FAILED :  - patchmgr.log
2020-09-18 14:47:29 +0000        :FAILED : DONE: Check space and state of cell services.

Ok, I've seen storage cells pre-check to fail before but nothing like this. As you can guess there was nothing wrong with the boot volume of these storage cells.

I always check the image status and server profile, before and after patching to make sure they patching has completed successfully. However running it again I noticed there is something wrong and it this issue spans across compute nodes as well. Hand on heart they were fine during last patching, how come they were corrupted now is still unknown to me. I decided to check the hardware profile again across all servers and here's what I got:

[root@exa01db01-c ~]# dcli -g all_group -l root /opt/oracle.SupportTools/CheckHWnFWProfile
exa01cel01: Current server model ORACLE_SERVER_X5-2L_ORACLE_SERVER_X5-2L is not found in supported server list
exa01cel02: Current server model ORACLE_SERVER_X5-2L_ORACLE_SERVER_X5-2L is not found in supported server list
exa01cel03: Current server model ORACLE_SERVER_X5-2L_ORACLE_SERVER_X5-2L is not found in supported server list
exa01cel04: Current server model ORACLE_SERVER_X7-2L_ORACLE_SERVER_X7-2L is not found in supported server list
exa01cel05: Current server model ORACLE_SERVER_X7-2L_ORACLE_SERVER_X7-2L is not found in supported server list
exa01cel06: Current server model ORACLE_SERVER_X7-2L_ORACLE_SERVER_X7-2L is not found in supported server list
exa01db01-c: Current server model ORACLE_SERVER_X5-2_ORACLE_SERVER_X5-2 is not found in supported server list
exa01db02-c: [SUCCESS] The hardware and firmware matches supported profile for server=ORACLE_SERVER_X5-2
exa01db03-c: [SUCCESS] The hardware and firmware matches supported profile for server=ORACLE_SERVER_X7-2
exa01db04-c: [SUCCESS] The hardware and firmware matches supported profile for server=ORACLE_SERVER_X7-2

It's obvious that servers are not supported but I couldn't understand where is this information coming from. Running another script that runs during system boot suggests that's not production hardware:

[root@exa01db01-c ~]# /opt/oracle.cellos/validations/init.d/checkdeveachboot
2020-09-18 13:58:15 +0000
2020-09-18 13:58:15 +0000 /opt/oracle.cellos/validations/init.d/checkdeveachboot started at 2020-09-18 13:58:15 +0000
2020-09-18 13:58:15 +0000 #^#^# [WARNING] [COMMON] 5 This is non production hardware
2020-09-18 13:58:15 +0000 34:0:0:0

MOS doesn't suggest anything except some old bugs. Following the scripts and functions I got to the exadata.img.hw.cache file:

[root@exa01db01-c ~]# cat /etc/exadata/config/hardware/exadata.img.hw.cache
System Model=ORACLE SERVER X5-2
System Manufacturer=Oracle Corporation
System Serial Number=1509NM10H3
Chassis Serial Number=1509NM10H3
BIOS Vendor=American Megatrends Inc.
BIOS Version=30300200
BIOS Release Date=07/10/2019
System Model=ORACLE SERVER X5-2
ORACLE SERVER X5-2
System Manufacturer=Oracle Corporation
Oracle Corporation
System Architecture=x86_64
Cellboot USB Serial=

One could not notice the redundant line below the System model. The file is generated during system boot and is used by CheckHWnFWProfile to confirm firmware matches the hardware, exadata.img.hw command to determine the model of the cell or database server and other scripts. How come the cache got corrupted is still a puzzle. Fortunately the fix was really simple - removing the redundant lines off the exadata.img.hw.cache resolved the issue and I was able to carry on with the patching.

[root@exa01db01-c ~]# /opt/oracle.SupportTools/CheckHWnFWProfile
[SUCCESS] The hardware and firmware matches supported profile for server=ORACLE_SERVER_X5-2

It's an easy fix. I'm posting this so that if anyone runs into this issue could quickly resolve the problem and not stress or spend too much time in troubleshooting. Fortunately this was pre-prod and I had enough time to troubleshoot the issue.

(Technical) I decided to look further and understand what the sequence is:

the script cellFirstboot.sh is invoked during system boot
then the function cellFirstboot_checkAndUpdateExadataImgHwCache () is invoked
the function then calls exadata.img.hw --update
the exadata.img.hw on the other hand calls dmidecode and ipmitool to gather all the information like system model, manufacturer, serial number and BIOS information.

So it looks like it's dmidecode or ipmitool to blame. Unfortunately I don't have access to this system anymore. I suspect that it could run into this issue again upon next patching. The system might need a power off as one of Oracle notes suggests:

The dmidecode command queries the SMBIOS tables, dumping the various data contained therein, such as hardware components, serial numbers and BIOS version. A missing value for the system product name is an indication that the tables do not contain the correct information or that there's an issue in retrieving the data. This can happen if the BIOS flashing has not completed or initialized successfully. It typically follows a recent replacement of the system board.

Hope this helps someone.

Sign up for more like this.