onecommand fails to change storage cell name
It's been a busy month - five Exadata deployments in the past three weeks and new personal best - 2x Exadata X6-2 Eighth Racks with CoD and storage upgrade deployed in only 6hrs!
An issue I encountered with the first deployment was that onecommand wouldn't change the storage cells names. The default cell names (not hostnames!) are based on where they are mounted within the rack and they are assigned by the elastic configuration script. The first cell name is ru02 (rack unit 02), the second cell is ru04, third is ru06 and so on.
Now, if you are familiar with the cell and grid disks you would know that their names are based on the cell name. In other words, I got my cell, grid and ASM disks with the wrong names. Exachk would report the following failures for every grid disk:
Grid Disk name DATA01_CD_00_ru02 does not have cell name (exa01cel01) suffix
Naming convention not used. Cannot proceed further with
automating checks and repair for bug 12433293
Apart from exachk complaining, I wouldn't feel comfortable with similar names on my Exadata.
Fortunately cell, grid and ASM disk names can be changed and here is how to do it:
Stop the cluster and CRS on each compute node:
/u01/app/12.1.0.2/grid/bin/crsctl stop cluster -all
/u01/app/12.1.0.2/grid/bin/crsctl stop crs
Login to each storage server and rename cell name, cell and grid disks, use the following to build the alter commands:
You don't need cell services shut but the grid disks shouldn't be in use i.e. make sure to stop the cluster first!
cell -e alter cell name=exa01cel01
for i in `cellcli -e list celldisk | awk '{print $1}'`; do echo "cellcli -e alter celldisk $i name=$i"; done | sed -e "s/ru02/exa01cel01/2"
for i in `cellcli -e list griddisk | awk '{print $1}'`; do echo "cellcli -e alter griddisk $i name=$i"; done | sed -e "s/ru02/exa01cel01/2"
If you get the following error restart the cell services and try again:
GridDisk DATA01_CD_00_ru02 alter failed for reason: CELL-02548: Grid disk is in use.
Start the cluster on each compute node:
/u01/app/12.1.0.2/grid/bin/crsctl start crs
We've got all cell and grid disks fixed, now we need to rename the ASM disks. To rename ASM disk you need to mount the diskgroup in restricted mode i.e. running on one node only and no one using it. If the diskgroup is not in restricted mode you'll get:
ORA-31020: The operation is not allowed, Reason: disk group is NOT mounted in RESTRICTED state.
Stop the second compute node, default dbm01 database and the MGMTDB database:
srvctl stop database -d dbm01
srvctl stop mgmtdb
Mount diskgroups in restricted mode:
If you are running 12.1.2.3.0+ and high redundancy DATA diskgroup, it is VERY likely that the voting disks are in the DATA diskgroup. Because of that, you wouldn't be able to dismount the diskgroup. The only way I found around that was to force stop ASM and start it manually in a restricted mode:
srvctl stop asm -n exa01db01 -f
sqlplus / as sysasm
startup mount restricted
alter diskgroup all dismount;
alter diskgroup data01 mount restricted;
alter diskgroup reco01 mount restricted;
alter diskgroup dbfs_dg mount restricted;
Rename the ASM disks, use the following build the alter commands:
select 'alter diskgroup ' || g.name || ' rename disk ''' || d.name || ''' to ''' || REPLACE(d.name,'RU02','exa01cel01')  || ''';' from v$asm_disk d, v$asm_diskgroup g where d.group_number=g.group_number and d.name like '%RU02%';
select 'alter diskgroup ' || g.name || ' rename disk ''' || d.name || ''' to ''' || REPLACE(d.name,'RU04','exa01cel03') || ''';' from v$asm_disk d, v$asm_diskgroup g where d.group_number=g.group_number and d.name like '%RU04%';
select 'alter diskgroup ' || g.name || ' rename disk ''' || d.name || ''' to ''' || REPLACE(d.name,'RU06','exa01cel03') || ''';' from v$asm_disk d, v$asm_diskgroup g where d.group_number=g.group_number and d.name like '%RU06%';
Finally stop and start CRS on both nodes.
It's only when I thought everything was ok I discovered one more reference to those pesky names. These were the fail group names which again are based on the storage cell name. Following will make it more clear:
select group_number,failgroup,mode_status,count(*) from v$asm_disk where group_number > 0 group by group_number,failgroup,mode_status;
GROUP_NUMBER FAILGROUP                      MODE_ST   COUNT(*)
------------ ------------------------------ ------- ----------
           1 RU02                           ONLINE          12
           1 RU04                           ONLINE          12
           1 RU06                           ONLINE          12
           1 EXA01DB01                  ONLINE           1
           1 EXA01DB02                  ONLINE           1
           2 RU02                           ONLINE          10
           2 RU04                           ONLINE          10
           2 RU06                           ONLINE          10
           3 RU02                           ONLINE          12
           3 RU04                           ONLINE          12
           3 RU06                           ONLINE          12
For each diskgroup we've got three fail groups (three storage cells). The other two fail groups EXA01DB01 and EXA01DB02 are the quorum disks.
Unfortunately, you cannot rename failgroups in ASM. My immediate thought was to drop each failgroup and add it back with the intention that it will resolve the problem. Unfortunately, since this was a quarter rack I couldn't do it, here's an excerpt from the documentation:
If a disk group is configured as high redundancy, then you can do this procedure on a Half Rack or greater. You will not be able to do this procedure on a Quarter Rack or smaller with high redundancy disk groups because ASM will not allow you to drop a failure group such that only one copy of the data remains (you'll get an ORA-15067 error).
The last option was to recreate the diskgroups. I've done this many times before when the compatible.rdbms parameter was set to too high and I had to install some earlier version of 11.2. However, since oracle decided to move the voting disks to DATA this became a bit harder. I couldn't drop DBFS_DG because that's where the MGMTDB was created, I couldn't drop DATA01 either because of the voting disks and some parameter files. I could have renamed RECO01 diskgroup but decided to keep it "consistently wrong" across all three diskgroups.
Fortunately, this behvaiour might change with the January 2017 release of OEDA. The following bug fix suggests that DBFS_DG will always be configured as high redundancy and host the voting disks:
24329542: oeda should make dbfs_dg as high redundancy and locate ocr/vote into dbfs_dg
There is also a feature request to support failgroup rename but it's not very popular, to be honest. Until we get this feature, exachk will report the following failure:
failgroup name (RU02) for grid disk DATA01_CD_00_exa01cel01 is not cell name
Naming convention not used. Cannot proceed further with
automating checks and repair for bug 12433293
I've deployed five Exadata X6-2 machines so far and had this issue on all of them.
This issue seems to be caused a bug in OEDA. The storage cell names should have been changed as part of step "Create Cell Disks" of onecommand. I keep the logs from some older deployments where it's very clear that each cell was renamed as part of this step:
Initializing cells...
EXEC## |cellcli -e alter cell name = exa01cel01|exa01cel01.local.net|root|
I couldn't find that command in the logs of the deployements I did. Obviously, the solution for now, is to manually rename the cell before you run step "Create Cell Disks" of onecommand.
Update 04.02.2017:
This problem has been logged by someone else a month earlier under the following bug:
Bug 25317550 : OEDA FAILS TO SET CELL NAME RESULTING IN GRID DISK NAMES NOT HAVING RIGHT SUFFIX