Grid Infrastructure 12c installation fails because of 255 in the subnet ID

August 25th, 2016 No comments

I was doing another GI 12.1.0.2 cluster installation last month when I got really weird error.

While root.sh was running on the first node I got the following error:

2016/07/01 15:02:10 CLSRSC-343: Successfully started Oracle Clusterware stack
2016/07/01 15:02:23 CLSRSC-180: An error occurred while executing the command '/ocw/grid/bin/oifcfg setif -global eth0/10.118.144.0:public eth1/10.118.255.0:cluster_interconnect' (error code 1)
2016/07/01 15:02:24 CLSRSC-287: FirstNode configuration failed
Died at /ocw/grid/crs/install/crsinstall.pm line 2398.

I was surprised to find the following error in the rootcrs log file:

2016-07-01 15:02:22: Executing cmd: /ocw/grid/bin/oifcfg setif -global eth0/10.118.144.0:public eth1/10.118.255.0:cluster_interconnect
2016-07-01 15:02:23: Command output:
> PRIF-15: invalid format for subnet
>End Command output

Quick MOS search suggested that my installation failed because I had 255 in the subnet ID:
root.sh fails with CLSRSC-287 due to: PRIF-15: invalid format for subnet (Doc ID 1933472.1)

Indeed we had 255 in the private network subnet (10.118.255.0). Fortunately this was in our private network which was easy to change but you will still hit this issue if you public network  has 255 in the subnet ID.

Categories: oracle Tags: , ,

How to resolve missing dependency on exadata-sun-computenode-minimum

August 18th, 2016 No comments

I’ve been really busy last few months – except spending a lot of time on M25 I’ve been doing a lot of Exadata installations and consolidations. I haven’t posted for some time now but the good news is that I got many drafts and presentations ideas.

This is a quick post about an issue I had recently. I had to integrate AD authentication over Kerberos on the compute nodes (blog post to follow) but had to do compute node upgrade before that. This was Exadata X5-2 QR running 12.1.2.1.1 which had to be upgraded to 12.1.2.3.1 but I was surprised when dbnodeupdate failed with ‘Minimum’ dependency check failed. You’ll also notice the following in the logs:

exa01db01a: Exadata capabilities missing (capabilities required but not supplied by any package)
exa01db01a NOTE: Unexpected configuration - Contact Oracle Support

Starting with 11.2.3.3.0 the exadata-*computenode-exact and exadata-*computenode-minimum rpms were introduced. An update to 11.2.3.3.0 or later by default assumes the ‘exact’ rpm will be used to ‘update to’ with yum hence before running the upgrade dbnodeupdate will check if there are missing packages/dependencies.

Best way to check what is missing is to run yum check:

[root@exa01db01a ~]# yum check
Loaded plugins: downloadonly
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of elfutils-libelf-devel >= ('0', '0.158', '3.2.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of elfutils-libelf-devel(x86-64) >= ('0', '0.158', '3.2.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of glibc-devel(x86-32) >= ('0', '2.12', '1.149.el6_6.5')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libsepol(x86-32) >= ('0', '2.0.41', '4.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libselinux(x86-32) >= ('0', '2.0.94', '5.8.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of elfutils-libelf(x86-32) >= ('0', '0.158', '3.2.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libcom_err(x86-32) >= ('0', '1.42.8', '1.0.2.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of e2fsprogs-libs(x86-32) >= ('0', '1.42.8', '1.0.2.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libaio(x86-32) >= ('0', '0.3.107', '10.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libaio-devel(x86-32) >= ('0', '0.3.107', '10.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libstdc++-devel(x86-32) >= ('0', '4.4.7', '11.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of compat-libstdc++-33(x86-32) >= ('0', '3.2.3', '69.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of zlib(x86-32) >= ('0', '1.2.3', '29.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libxml2(x86-32) >= ('0', '2.7.6', '17.0.1.el6_6.1')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of elfutils >= ('0', '0.158', '3.2.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of elfutils(x86-64) >= ('0', '0.158', '3.2.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of ntsysv >= ('0', '1.3.49.3', '2.el6_4.1')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of ntsysv(x86-64) >= ('0', '1.3.49.3', '2.el6_4.1')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of glibc(x86-32) >= ('0', '2.12', '1.149.el6_6.5')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of nss-softokn-freebl(x86-32) >= ('0', '3.14.3', '18.el6_6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libgcc(x86-32) >= ('0', '4.4.7', '11.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of libstdc++(x86-32) >= ('0', '4.4.7', '11.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of compat-libstdc++-296 >= ('0', '2.96', '144.el6')
exadata-sun-computenode-minimum-12.1.2.1.1.150316.2-1.x86_64 has missing requires of compat-libstdc++-296(x86-32) >= ('0', '2.96', '144.el6')
Error: check all

Somehow all x86-32 packages and three x86-64 packages were removed. The x86-32 packages will be removed during as part of upgrade anyway, they were not present after the upgrade. I didn’t spend too much do understand why or how the packages were removed. I was told additional packages were installed before and then removed. Perhaps one had few dependencies and all got messed up when it was removed.

Anyway to solve this you need to download the patch for the same version (12.1.2.1.1). The p20746761_121211_Linux-x86-64.zip patch is still available from MOS 888828.1. So after that you unzip it, mount the iso, test install all the package to make sure nothing is missing and there are no conflicts and then finally install the packages:

[root@exa01db01a x86_64]# rpm -ivh --test zlib-1.2.3-29.el6.i686.rpm glibc-2.12-1.149.el6_6.5.i686.rpm nss-softokn-freebl-3.14.3-18.el6_6.i686.rpm libaio-devel-0.3.107-10.el6.i686.rpm libaio-0.3.107-10.el6.i686.rpm e2fsprogs-libs-1.42.8-1.0.2.el6.i686.rpm libgcc-4.4.7-11.el6.i686.rpm libcom_err-1.42.8-1.0.2.el6.i686.rpm elfutils-libelf-0.158-3.2.el6.i686.rpm libselinux-2.0.94-5.8.el6.i686.rpm libsepol-2.0.41-4.el6.i686.rpm glibc-devel-2.12-1.149.el6_6.5.i686.rpm elfutils-libelf-devel-0.158-3.2.el6.x86_64.rpm libstdc++-devel-4.4.7-11.el6.i686.rpm libstdc++-4.4.7-11.el6.i686.rpm compat-libstdc++-296-2.96-144.el6.i686.rpm compat-libstdc++-33-3.2.3-69.el6.i686.rpm libxml2-2.7.6-17.0.1.el6_6.1.i686.rpm elfutils-0.158-3.2.el6.x86_64.rpm ntsysv-1.3.49.3-2.el6_4.1.x86_64.rpm
Preparing...                ########################################### [100%]

[root@exa01db01a x86_64]# rpm -ivh zlib-1.2.3-29.el6.i686.rpm glibc-2.12-1.149.el6_6.5.i686.rpm nss-softokn-freebl-3.14.3-18.el6_6.i686.rpm libaio-devel-0.3.107-10.el6.i686.rpm libaio-0.3.107-10.el6.i686.rpm e2fsprogs-libs-1.42.8-1.0.2.el6.i686.rpm libgcc-4.4.7-11.el6.i686.rpm libcom_err-1.42.8-1.0.2.el6.i686.rpm elfutils-libelf-0.158-3.2.el6.i686.rpm libselinux-2.0.94-5.8.el6.i686.rpm libsepol-2.0.41-4.el6.i686.rpm glibc-devel-2.12-1.149.el6_6.5.i686.rpm elfutils-libelf-devel-0.158-3.2.el6.x86_64.rpm libstdc++-devel-4.4.7-11.el6.i686.rpm libstdc++-4.4.7-11.el6.i686.rpm compat-libstdc++-296-2.96-144.el6.i686.rpm compat-libstdc++-33-3.2.3-69.el6.i686.rpm libxml2-2.7.6-17.0.1.el6_6.1.i686.rpm elfutils-0.158-3.2.el6.x86_64.rpm ntsysv-1.3.49.3-2.el6_4.1.x86_64.rpm
Preparing...              ########################################### [100%]
1:libgcc                  ########################################### [  5%]
2:elfutils-libelf-devel   ########################################### [ 10%]
3:nss-softokn-freebl      ########################################### [ 15%]
4:glibc                   ########################################### [ 20%]
5:glibc-devel             ########################################### [ 25%]
6:elfutils                ########################################### [ 30%]
7:zlib                    ########################################### [ 35%]
8:libaio                  ########################################### [ 40%]
9:libcom_err              ########################################### [ 45%]
10:libsepol               ########################################### [ 50%]
11:libstdc++              ########################################### [ 55%]
12:libstdc++-devel        ########################################### [ 60%]
13:libaio-devel           ########################################### [ 65%]
14:libselinux             ########################################### [ 70%]
15:e2fsprogs-libs         ########################################### [ 75%]
16:libxml2                ########################################### [ 80%]
17:elfutils-libelf        ########################################### [ 85%]
18:compat-libstdc++-296   ########################################### [ 90%]
19:compat-libstdc++-33    ########################################### [ 95%]
20:ntsysv                 ########################################### [100%]

[root@exa01db01a x86_64]# yum check
Loaded plugins: downloadonly
check all

After that dbnodeupdate check completed successfully I upgraded the node to 12.1.3.2.1 in no time.

With Exadata you are allowed to install packages on the compute nodes as long as they don’t break any dependencies but you cannot install anything on the storage cells. Here’s oracle official statement:
Is it acceptable / supported to install additional or 3rd party software on Exadata machines and how to check for conflicts? (Doc ID 1541428.1)

Update 23.08.2016:
You might also get errors for two more packages in case you have updated to from OEL5 to OEL6 and now you try to patch the compute node:

fuse-2.8.3-4.0.2.el6.x86_64 has missing requires of kernel >= ('0', '2.6.14',
None)
2:irqbalance-1.0.7-5.0.1.el6.x86_64 has missing requires of kernel >= ('0',
'2.6.32', '358.2.1')

Refer to the following note for more information and how to fix it:

Categories: oracle Tags:

Dead Connection Detection in Oracle Database 12c

April 7th, 2016 No comments

In my earlier post I discussed what Dead Connection Detection is and why you should use it – read more here Oracle TNS-12535 and Dead Connection Detection

The pre-12c implementation of DCD used TNS packages to “ping” the client and relied on the underlying TCP stack which sometimes may take longer. Now in 12c this has changed and DCD probes are implemented by TCP Stack. The DCD probes will now use the TCP KEEPALIVE socket option to check if the connection is still usable.

To use the new implementation set the SQLNET.EXPIRE_TIME in sqlnet.ora to the amount of time between the probes in minutes. If the operating system supports TCP keep-alive tuning then Oracle Net automatically uses the new method. The new mechanism is supported on all platforms except on Solaris.

The following parameters are associated with the TCP keep-alive probes:
TCP_KEEPIDLE  – specifies the timeout of no activity until the probe is sent. The parameter takes value from SQLNET.EXPIRE_TIME.
TCP_KEEPCNT   – number of keep-alive probes to be sent, it is always set to 10.
TCP_KEEPINTVL – specifies the delay between probes if a keep-alive packets are sent and no acknowledgment is received, it is always set to 6.

If you need to revert to the pre-12c DCD mechanism (10 bytes TNS data) add the following parameters in sqlnet.ora:
USE_NS_PROBES_FOR_DCD=true

 

Categories: oracle Tags: ,

Oracle Exadata X6 released

April 5th, 2016 No comments

Oracle has just announced the next generation of Exadata Database Machine – X6-2 and X6-8.

Here are the changes for Exadata X6-2:
1) X6-2 Database Server: As always the hardware has been updated and the 2-socket database servers are now equip with latest twenty two-core Intel Xeon E5-2699 v4 “Broadwell” processors in comparison to X5 where we had eighteen-core Intel Xeon E5-2699 v3 processors. The memory is still DDR4 and the default configuration comes with 256Gb and can be expanded to 768Gb. The local storage can now be upgraded to 8 drives from default of 4 to allow more local storage in case of a consolidation with Oracle OVM.
2) X6-2 Storage Server HC: The storage server gets the new version CPUs as well and that is the ten-core Intel Xeon E5-2630 v4 processor (it was eight-core Intel Xeon E5-2630 v3 in X5). The flash cards are upgraded as well to 3.2 TB Sun Accelerator Flash F320 NVMe PCIe card for a total of 12.8 TB of flash cache (2x the capacity of X5 where we had 1.6Tb F160 cards).
2.1) X6-2 Storage Server EF – similarly to the High Capacity storage server this one gets the CPU and flash card upgraded. Also the NVMe PCIe Flash drives are now upgraded from 1.6Tb to 3.2Tb which gives you a total raw capacity of 25.6Tb per server.

This time Oracle released Exadata X6-8 together with X6-2 release. Changes aren’t many, I have to say that X6-8 compute node looks exactly the same as X5-8 in terms of specs so I guess that Exadata X6-8 actually consists of X5-8 compute nodes with X6-2 storage servers. Oracle’s vision on those big monsters is that they are specifically optimized for Database as a Service (DBaaS) and database in-memory. Indeed with 12Tb of memory we can host hundreds of databases or load a whole database in memory.

By the looks of it Exadata X6-2 and Exadata X6-8 will require the latest Exadata 12.1.2.3.0 software. This software has been around for some time now and has some new features:
1) Performance Improvements for Software Upgrades – I can confirm that, in the recent upgrade to 12.1.2.3.0 the cell upgrade took a bit more than an hour.
2) VLAN tagging support in OEDA – That’s not a fundamental new or exciting new feature but VLAN tagging was available before. Now it can be done through the OEDA hence it can be part of the deployment.
3) Quorum disk on database servers to enable high redundancy on quarter and eighth racks – You can now use database servers to deploy quorum disks and enable placement of voting disk on high redundancy disk groups on smaller (quarter and eight) rack. Here is more information – Managing Quorum Disks Using the Quorum Disk Manager Utility
4) Storage Index preservation during rebalance – The features enables Storage Indexes to be moved along the data when a disk hits predictive failure or true failure.
5) ASM Disk Size Checked When Reducing Grid Disk Size – this is a check on the storage server to make sure you cannot shrink a grid disk before decreasing the size of an ASM disk.

Capacity-On-Demand Licensing:
1) For Exadata X6-2 a minimum of 14 cores must be enabled per server.
2) For Exadata X6-8 a minumum of 56 cores must be enabled per server.

Here’s something interesting:
OPTIONAL CUSTOMER SUPPLIED ETHERNET SWITCH INSTALLATION IN EXADATA DATABASE MACHINE X6-2
Each Exadata Database Machine X6-2 rack has 2U available at the top of the rack that can be used by customers to optionally install their own client network Ethernet switches in the Exadata rack instead of in a separate rack. Some space, power, and cooling restrictions apply.

References:
Categories: oracle Tags: ,

Oracle TNS-12535 and Dead Connection Detection

March 31st, 2016 2 comments

These days everything goes to the cloud or it has been collocated somewhere in a shared infrastructure. In this post I’ll talk about sessions being disconnected from your databases, firewalls and dead connection detection.

Changes

We moved number of 11g databases from one data centre to another.

Symptoms

Now probably many of you have seen the following error in your database alertlog “TNS-12535: TNS:operation timed out” or if you haven’t you will definitely see it some day.

Consider the following error from database alert log:

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 12-MAR-2015 10:28:08
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))
Thu Mar 12 10:28:09 2015

Now this error indicate timing issues between the server and the client. It’s important to mention that those errors are RESULTANT, they are informational and not the actual cause of the disconnect. Although this error might happen for number of reasons it is commonly associated with firewalls or slow networks.

Troubleshooting

The best way to understand what’s happening is to build a histogram of the duration of the sessions. In particular we want to understand whether disconnects are sporadic and random or they follow a specific pattern.

To do so you need to parse the listener log and locate the following line from the above example:

(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))

Since the port is random you might not get same record or if you do it might be days apart.

Here’s what I found in the listener:

12-MAR-2015 08:16:52 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831)) * establish * ORCL * 0

In other words – at 8:16 the user scott established connection from host 192.168.0.10.

Now if you compare both records you’ll get the duration of the session:

Established: 12-MAR-2015 08:16:52
Disconnected: Thu Mar 12 10:28:09 2015

Here are couple of other examples:
alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620))
Thu Mar 12 10:31:20 2015 

listener.log:

12-MAR-2015 08:20:04 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620)) * establish * ORCL * 0 

alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157))
Thu Mar 12 10:37:51 2015 

listener.log:

12-MAR-2015 08:26:36 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157)) * establish * ORCL * 0 

alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618))
Tue Mar 10 19:09:09 2015 

listener.log

10-MAR-2015 16:57:54 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=root))(SERVICE_NAME=ORCL1)(SERVER=DEDICATED)) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618)) * establish * ORCL1 * 0 

As you may have noticed the errors follow very strict pattern – each one gets disconnect exactly 2hrs 11mins after it has been established.

Cause

Given the repetitive behaviour of the issue and that it happened for multiple databases and application servers we can conclude that’s definitely a firewall issue.

The firewall recognizes the TCP protocol and keeps a record of established connections and it also recognizes TCP connection closure packets (TCP FIN type packet). However sometimes the client may abruptly end communication without closing the end points properly by sending FIN packet in which case the firewall will not know that the end-points will no longer use the opened channel. To resolve this problem firewall imposes a BLACKOUT on those connections that stay idle for a predefined amount of time.

The only issues with BLACKOUT is that neither or the sides will be notified.

In our case the firewall will disconnect IDLE sessions after around 2hrs of inactivity.

Solution

The solution for database server is to use Dead Connection Detection (DCD) feature. DCD detects when a connection has terminated unexpectedly and flags the dead session so PMON can release the resources associated with it.

DCD sets a timer when a session is initiated and when the timer expires SQL*Net on the server sends a small 10 bytes probe packet to the client to make sure connection is still active. If the client has terminated unexpectedly the server will get an error and the connection will be closed and the associated resources will be released. If the connection is still active then the probe packet is discarded and the timer is reset.

To enable DCD you need to set SQLNET.EXPIRE_TIME in sqlnet.ora of you RDBMS home!

cat >> $ORACLE_HOME/network/admin/sqlnet.ora
SQLNET.EXPIRE_TIME=10 

This will set the timer to 10 minutes. Remember that sessions need to reconnect for the change to take place it won’t work for existing connections.

Firewalls become smarter and they can now inspect packages even deeper. Make sure the following settings are also disabled:
– SQLNet fixup protocol
– Deep Packet Inspection (DPI)
– SQLNet packet inspection
– SQL Fixup

I had similar issue with Dataguard already, read more here – Smart Firewalls

How to test Dead Connection Detection

You might want to test or make sure DCD really works. You’ve got multiple options here – Oracle SQL client trace, Oracle SQL Server Trace, Sniff the network with packet analyzer OR using strace to trace the server process. I used strace since I had access to the database server and it was non intrusive.

1. Establish a connection to the database through SQL*Net

2. Find the processes number for your session:

SQL>  select SPID from v$process where ADDR in (select PADDR from v$session where username='SVE');

SPID
------------------------
62761 

3. Trace the process

[oracle@dbsrv ~]$ strace -tt -f -p 62761
Process 62761 attached - interrupt to quit
11:36:58.158348 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:36:58.158485 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
....
11:46:58.240065 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:46:58.240211 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
...
11:46:58.331063 write(20, "\0\n\0\0\6\20\0\0\0\0", 10) = 10
... 

What I did was to attach to the process, simulate some activity at 11:36 and then leave the session IDLE. Then 10 minutes later the server process sent an empty packet to the client to check if the connection is still alive.

Conclusion

Errors in alertlog disappeared after I enabled the DCD.

Make sure to enable DCD if you host your databases in a shared infrastructure or there are firewalls between your database and application servers.

References
How to Check if Dead Connection Detection (DCD) is Enabled in 9i ,10g and 11g (Doc ID 395505.1)
Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)
Resolving Problems with Connection Idle Timeout With Firewall (Doc ID 257650.1)
Dead Connection Detection (DCD) Explained (Doc ID 151972.1)

Categories: oracle Tags:

Exadata onecommand fails at cell disk creation

February 3rd, 2016 No comments

I was installing another Exadata last month when I got an error on create cell disks step. I’ve seen the same error before when I was extending two to three rack Exadata configuration but thought it was one-off.

The cell disk creation failed as below:

[root@exa01db01 linux-x64]# ./install.sh -cf Customer-exa01.xml -s 8

 Initializing
 Executing Create Cell Disks
 Checking physical disks for errors before creating celldisks.........................
 Restarting cell services....................................................
 ERROR:

 Stopping the RS, CELLSRV, and MS services...
 The SHUTDOWN of services was successful.
 Starting the RS, CELLSRV, and MS services...
 Getting the state of RS services...  running
 Starting CELLSRV services...
 The STARTUP of CELLSRV services was not successful.
 CELL-01533: Unable to validate the IP addresses from the cellinit.ora file because the IP addresses may be down or misconfigured.
 Starting MS services...
 The STARTUP of MS services was successful.
 ERROR:

Going through the cell configuration is obvious why the process failed. The cell still had the default name and the IP addresses that the cell services should use are still the default ones:

CellCLI> list cell detail
         name:                   ru02
         ipaddress1:             192.168.10.1/24
         ipaddress2:             192.168.10.2/24
         cellsrvStatus:          stopped
         msStatus:               running
         rsStatus:               running

In short when you see an error like the one below then your ipaddress1 and/or ipaddress2 fields are most probably wrong:

         2       2015-12-15T17:57:03+00:00       critical        "ORA-00700: soft internal error, arguments: [main_6a], [3], [IP addresses in cellinit.ora not operational], [], [], [], [], [], [], [], [], []"

The solution to that is simple. You need to alter the cell name and IP addresses manually:

CellCLI> alter cell name=exa01cel02a,ipaddress1='192.168.10.13/22',ipaddress2='192.168.10.14/22'
Network configuration altered. Please issue the following commands as root to restart the network and open IB stack:
service openibd restart
service network restart
A restart of all services is required to put new network configuration into effect. MS-CELLSRV communication may be hampered until restart.
Cell exa01cel02a successfully altered

CellCLI> alter cell restart services all

Make sure all cells are fixed and re-run the onecommand step, this time it will succeed:

 Successfully completed execution of step Create Cell Disks [elapsed Time [Elapsed = 128338 mS [2.0 minutes] Thu Dec 17 14:26:59 GMT 2015]]

I’ve checked some older deployments and it’s the same step which should change the cell name and restart the cell services. For some reason this didn’t happened for me. For both deployments I used OEDA v15.300 (Oct 2015) so this might be a bug in this version.

Categories: oracle Tags:

Issues with Oracle Direct NFS

January 28th, 2016 No comments

This is a quick post to highlight two issues I had with Oracle dNFS. Both relate to wrong entries in the oranfstab file.

One might encounter ORA-00600 during database creation:

DBCA_PROGRESS : 7%
DBCA_PROGRESS : 8%
ORA-01501: CREATE DATABASE failed
ORA-00600: internal error code, arguments: [KSSRMP1], [], [], [], [], [], [], [], [], [], [], []
ORA-17502: ksfdcre:3 Failed to create file /oracle/ORCL/database/ORCL/controlfile/o1_mf_%u_.ctl

This was caused by having a wrong entry in oranfstab – there was difference between fstab and oranfstab for the same record:

server: zfs01
path: 192.168.10.100
export: /export/OTHERDB/database mount: /oracle/ORCL/database

The second issue was that the database wasn’t using dNFS. Simple query of v$dnfs_servers will return no rows and there were no errors in the alertlog. However looking around the tracefiles one can easily spot the following repetitive error in all trace files:

KGNFS_NFSPROC3_MNT FAIL 13

This was caused by trying to mount a share we don’t have access to or non existing share from the NFS server:

server: zfs01
path: 192.168.10.100
export: /export/NON_EXIST/database mount: /oracle/ORCL/database

The issue was fixed after correcting the typos in the oranfstab file and resting the database.

The bottom line is make sure that fstab and oranfstab match and have correct entries.

Categories: oracle Tags: , ,

Come and hear me speak at UKOUG Tech 15

December 4th, 2015 No comments

It’s this time of the year again when one of the biggest and last for the year Oracle User Groups is being held and that is UK Oracle User Group Conference.

I’m very grateful and proud that I’ll be speaking on this great conference, here are my talks:

Presentation Title: Oracle Exadata Meets Elastic Configurations
Description: With the release of Exadata X5 Oracle announced Elastic configuration to allow mixed number of db and cell nodes. This session will go through the implementation process of a X5 having two db nodes and four cells.
Date: Monday 7th December
Time: 14:10 – 15:00
Hall: Media Suite B

Presentation Title: Oracle DataGuard Fast-Start Failover: Live Demo
Description: Come and see a live demo of Oracle Fast-start failover and why a private bank moved from RAC to FSFO.
Date: Tuesday 8th December
Time: 16:30 – 17:20
Hall: Hall 11A

See you there!

Categories: oracle Tags:

How to rename ASM disk groups in Exadata

November 25th, 2015 No comments

Deployment of Exadata requires you to generate configuration using Oracle Exadata Deployment Assistant (OEDA). Within the same the default  ASM disk groups names are DBFS_DG, RECOC1 and DATAC1. I usually change those to RECO01 and DATA01 as others doesn’t make sense to me and the only place where I find the default ones is on Exadata.

I had an incident last year where the Exadata deployed half way through and names were left by default so I had to delete the configuration and start from scratch.

For my big surprise I got request recently where customer wanted to change RECO01 and DATA01 to RECOC1 and DATAC1! This was a pre-prod system, already deployed and having few databases running. The Exadata was X5-2 running ESS 12.1.2.1.2 and GI 12.1.0.2.

If this ever happens to you, here is what you need to do:

  1. Rename grid disks.
  2. Rename ASM disk groups and ASM disk names.
  3. Modify all databases to point to the new disk groups.

Rename grid disks

Since grid disks names consists of the disk group name they need to be changed too. Although this is not mandatory I strongly recommend it to avoid any confusion in the future.

The grid disks can be renamed very easily using cellcli but they should NOT be in use by GI at that time. Thus Grid Infrastructure has to be stopped on all servers, stop GI as root:

[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop cluster -all

Then run the following magic command to get the list of all grid disks and replace the disk group names with the new ones:

[root@exa01db01 ~]# for i in `dcli -g cell_group -l root cellcli -e list griddisk | awk -F":" '{print $2'} | awk '{print $1}'`; do echo "cellcli -e alter griddisk $i name=$i"; done | grep -v DBFS |sed -e "s/RECO01/RECOC1/2" -e "s/DATA01/DATAC1/2"

You’ll get a long list of cellcli commands – 12 for each cell which you need to run on the cell locally.

Once it’s done start the GI again and make sure all disk groups are mounted successfully:

[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start cluster

 

Rename ASM disk groups and ASM disk names

Next is to rename the disk groups. To do so they must be dismounted on ALL cluster nodes before running renamedg on a disk group. Connect to each ASM instance and dismount the disk groups:

SQL> alter diskgroup datac1 dismount;

Diskgroup altered.

SQL> alter diskgroup recoc1 dismount;

Diskgroup altered.

At this point you can run renamеdg to rename the disk groups, here is an example for the DATAC1 disk group:

[oracle@exa01db01 ~]$ renamedg -dgname DATA01 -newdgname DATAC1

Parsing parameters..
renamedg operation: -dgname DATA01 -newdgname DATAC1
Executing phase 1
Discovering the group
Checking for hearbeat...
Re-discovering the group
Generating configuration file..
Completed phase 1
Executing phase 2
Completed phase 2

Do the same for RECO01 and after that make sure that both disk groups can be mounted on all database nodes successfully, then dismount them again so you rename the ASM disk names. In general there is a command to rename all the disks (ALTER DISKGROUP XX RENAME DISKS ALL) but it will rename the disks to a name of the form diskgroupname_####, where #### is the disk number. However ASM disk names have different names on Exadata (RECO01_CD_01_EXA01CEL01) and that’s why we need to rename them manually.

To rename the disks the disk group has to be mounted in restricted mode (so only one node in the cluster can mount the disk group). Then run the below two statement to generate the new ASM disk names:

SQL> alter diskgroup datac1 mount restricted;

Diskgroup altered.

SQL> select 'alter diskgroup datac1 rename disk ''' || name || ''' to ''' || REPLACE(name,'DATA01','DATAC1') || ''';' from v$asm_disk where name like 'DATA%';

SQL> select 'alter diskgroup recoc1 rename disk ''' || name || ''' to ''' || REPLACE(name,'RECO01','RECOC1') || ''';' from v$asm_disk where name like 'RECO%';

Execute the alter statement generated by the above two statements and mount both disk groups on all database nodes again.

There is no command to add the disk group back to Oracle Restart. They will be automatically added first time they are mounted. However you need to remove the old disk group resources:

[oracle@exa01db01 ~]$ srvctl remove diskgroup -g DATA01
[oracle@exa01db01 ~]$ srvctl remove diskgroup -g RECO01

 

Modify all databases to point to the new disk groups

The last step is to change datafile/tempfile/redolog files on all databases to point to the new disk groups. Make sure you disable block change tracking and flashback as database might not open since the location of BCT has changed:

SQL> alter database disable block change tracking;
SQL> alter database flashback off;

Next create pfile from spfile and substitute all the occurences of RECO01 and DATA01 OR you can modify the spfile just before you shut the database. Let’s assume you have created pfile, make sure all the parameters refer to the new disk group names, here are the default ones that you need to check:

*.control_files
*.db_create_file_dest
*.db_create_online_log_dest_1
*.db_create_online_log_dest_2
*.db_recovery_file_dest

Start the database in mount state and generate all the alter statements for datafiles/tempfiles and redologs:

[oracle@exa01db01 ~]$ sqlplus -s / as sysdba
set heading off
set echo off
set pagesize 140
set linesize 140
spool /tmp/rename.sql

select 'alter database rename file ''' || name || ''' to ''' || REPLACE(name,'DATA01','DATAC1') || ''';' from v$datafile;
select 'alter database rename file ''' || name || ''' to ''' || REPLACE(name,'DATA01','DATAC1') || ''';' from v$tempfile;
select 'alter database rename file ''' || member || ''' to ''' || REPLACE(member,'DATA01','DATAC1')|| ''';' from v$logfile where member like '%DATA%';
select 'alter database rename file ''' || member || ''' to ''' || REPLACE(member,'RECO01','RECOC1')|| ''';' from v$logfile where member like '%RECO%';
exit

Start another sqlplus and run the spool file from the above operation (rename.sql). At this point you can open the database (alter database open;). Once the database is open make sure you enable block change tracking and flashback:

SQL> alter database enable block change tracking;
SQL> alter database flashback on;

Finally change the database dependencies and spfile location:

For 12c databases:

[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -nodiskgroup
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -diskgroup "DATAC1,RECOC1"
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -spfile +DATAC1/DBM01/spfiledbm01.ora

For 11g databases:

[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -z
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -x "DATAC1,RECOC1"
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -p +DATAC1/DBM01/spfiledbm01.ora
Categories: oracle Tags: ,

How to move OEM12c management agent to new location

October 29th, 2015 3 comments

While working on another Exadata project recently I found that OEM12c agents on the compute nodes were installed on different locations on each of the three Exadatas. On one of them was under /home/oracle/agent, another one had them under /opt/oracle/agent and third one had them under /oracle/agent. Obviously this was not the standard and the agents had to be moved under /u01/app/oracle/agent. The only problem with that was that the three Exadatas were already discovered along with some database targets. Fortunately this wasn’t production yet but it would still require all the agents to be reinstalled and all targets rediscovered.

Fortunately there is an easier way to move the OEM management agents to new location without all the hassle of reinstalling agents and rediscovering agents. In the following example the agent was installed in /home/oracle/agent/ and I had to move it to /u01/app/oracle/agent/.

First you need to download the ConvertToStandalone.pl utility from 2021782.1 and then upload it to the server under /home/oracle

You need to create a list of plugins, otherwise the move process will fail:

[oracle@exa01db01 ~]$ /home/oracle/agent/core/12.1.0.5.0/perl/bin/perl /home/oracle/agent/core/12.1.0.5.0/sysman/install/create_plugin_list.pl -instancehome /home/oracle/agent/core/12.1.0.5.0

This will create a file /home/oracle/agent/plugins.txt which is used by the perl script later.

Export the following variables:

export OLD_AGENT_HOME=/home/oracle/agent/core/12.1.0.5.0
export ORACLE_HOME=/u01/app/oracle/agent/core/12.1.0.5.0

Another thing is you need to do is to modify the SBIN_MODIFIED_VERSION from 12.1.0.4.0. to 12.1.0.5.0 in /home/oracle/agent/agentimage.properties, otherwise the process will fail.

Then run the perl script which will migrate the agent home to the new location:

[oracle@exa01db01 ~]$ /home/oracle/agent/core/12.1.0.5.0/perl/bin/perl /home/oracle/ConvertToStandalone.pl -instanceHome /home/oracle/agent/agent_inst -newAgentBaseDir /u01/app/oracle/agent

Pay attention that the script accepts two arguments, instanceHome is the agent instance home directory e.g. /home/oracle/agent/agent_inst/ and the newAgentBaseDir is the new base dir for the agent /u01/app/oracle/agent/

After the command completes you need to run root.sh as root:

[oracle@exa01db01 ~]# /u01/app/oracle/agent/core/12.1.0.5.0/root.sh
Finished product-specific root actions.
/etc exists

Deinstall the old agent:

[oracle@exa01db01 ~]$ /home/oracle/agent/core/12.1.0.5.0/perl/bin/perl /home/oracle/agent/core/12.1.0.5.0/sysman/install/AgentDeinstall.pl -agentHome /home/oracle/agent/core/12.1.0.5.0

Finally remove the old agent directory where a log file from the deinstall process is left:

[oracle@exa01db01 ~]$ rm -rf /home/oracle/agent

The beauty of this process is that the script will create a blackout AGT_CNT_BLK_OUT on a node level and then stop the agent. It will then migrate the agent to the new home, start the agent and finally remove the blackout. The whole process takes less than five minutes.

Categories: oracle Tags: ,