Archive

Archive for the ‘oracle’ Category

Oracle TNS-12535 and Dead Connection Detection

March 31st, 2016 2 comments

These days everything goes to the cloud or it has been collocated somewhere in a shared infrastructure. In this post I’ll talk about sessions being disconnected from your databases, firewalls and dead connection detection.

Changes

We moved number of 11g databases from one data centre to another.

Symptoms

Now probably many of you have seen the following error in your database alertlog “TNS-12535: TNS:operation timed out” or if you haven’t you will definitely see it some day.

Consider the following error from database alert log:

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 12-MAR-2015 10:28:08
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))
Thu Mar 12 10:28:09 2015

Now this error indicate timing issues between the server and the client. It’s important to mention that those errors are RESULTANT, they are informational and not the actual cause of the disconnect. Although this error might happen for number of reasons it is commonly associated with firewalls or slow networks.

Troubleshooting

The best way to understand what’s happening is to build a histogram of the duration of the sessions. In particular we want to understand whether disconnects are sporadic and random or they follow a specific pattern.

To do so you need to parse the listener log and locate the following line from the above example:

(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))

Since the port is random you might not get same record or if you do it might be days apart.

Here’s what I found in the listener:

12-MAR-2015 08:16:52 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831)) * establish * ORCL * 0

In other words – at 8:16 the user scott established connection from host 192.168.0.10.

Now if you compare both records you’ll get the duration of the session:

Established: 12-MAR-2015 08:16:52
Disconnected: Thu Mar 12 10:28:09 2015

Here are couple of other examples:
alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620))
Thu Mar 12 10:31:20 2015 

listener.log:

12-MAR-2015 08:20:04 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620)) * establish * ORCL * 0 

alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157))
Thu Mar 12 10:37:51 2015 

listener.log:

12-MAR-2015 08:26:36 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157)) * establish * ORCL * 0 

alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618))
Tue Mar 10 19:09:09 2015 

listener.log

10-MAR-2015 16:57:54 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=root))(SERVICE_NAME=ORCL1)(SERVER=DEDICATED)) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618)) * establish * ORCL1 * 0 

As you may have noticed the errors follow very strict pattern – each one gets disconnect exactly 2hrs 11mins after it has been established.

Cause

Given the repetitive behaviour of the issue and that it happened for multiple databases and application servers we can conclude that’s definitely a firewall issue.

The firewall recognizes the TCP protocol and keeps a record of established connections and it also recognizes TCP connection closure packets (TCP FIN type packet). However sometimes the client may abruptly end communication without closing the end points properly by sending FIN packet in which case the firewall will not know that the end-points will no longer use the opened channel. To resolve this problem firewall imposes a BLACKOUT on those connections that stay idle for a predefined amount of time.

The only issues with BLACKOUT is that neither or the sides will be notified.

In our case the firewall will disconnect IDLE sessions after around 2hrs of inactivity.

Solution

The solution for database server is to use Dead Connection Detection (DCD) feature. DCD detects when a connection has terminated unexpectedly and flags the dead session so PMON can release the resources associated with it.

DCD sets a timer when a session is initiated and when the timer expires SQL*Net on the server sends a small 10 bytes probe packet to the client to make sure connection is still active. If the client has terminated unexpectedly the server will get an error and the connection will be closed and the associated resources will be released. If the connection is still active then the probe packet is discarded and the timer is reset.

To enable DCD you need to set SQLNET.EXPIRE_TIME in sqlnet.ora of you RDBMS home!

cat >> $ORACLE_HOME/network/admin/sqlnet.ora
SQLNET.EXPIRE_TIME=10 

This will set the timer to 10 minutes. Remember that sessions need to reconnect for the change to take place it won’t work for existing connections.

Firewalls become smarter and they can now inspect packages even deeper. Make sure the following settings are also disabled:
– SQLNet fixup protocol
– Deep Packet Inspection (DPI)
– SQLNet packet inspection
– SQL Fixup

I had similar issue with Dataguard already, read more here – Smart Firewalls

How to test Dead Connection Detection

You might want to test or make sure DCD really works. You’ve got multiple options here – Oracle SQL client trace, Oracle SQL Server Trace, Sniff the network with packet analyzer OR using strace to trace the server process. I used strace since I had access to the database server and it was non intrusive.

1. Establish a connection to the database through SQL*Net

2. Find the processes number for your session:

SQL>  select SPID from v$process where ADDR in (select PADDR from v$session where username='SVE');

SPID
------------------------
62761 

3. Trace the process

[oracle@dbsrv ~]$ strace -tt -f -p 62761
Process 62761 attached - interrupt to quit
11:36:58.158348 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:36:58.158485 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
....
11:46:58.240065 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:46:58.240211 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
...
11:46:58.331063 write(20, "\0\n\0\0\6\20\0\0\0\0", 10) = 10
... 

What I did was to attach to the process, simulate some activity at 11:36 and then leave the session IDLE. Then 10 minutes later the server process sent an empty packet to the client to check if the connection is still alive.

Conclusion

Errors in alertlog disappeared after I enabled the DCD.

Make sure to enable DCD if you host your databases in a shared infrastructure or there are firewalls between your database and application servers.

References
How to Check if Dead Connection Detection (DCD) is Enabled in 9i ,10g and 11g (Doc ID 395505.1)
Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)
Resolving Problems with Connection Idle Timeout With Firewall (Doc ID 257650.1)
Dead Connection Detection (DCD) Explained (Doc ID 151972.1)

Categories: oracle Tags:

Exadata onecommand fails at cell disk creation

February 3rd, 2016 No comments

I was installing another Exadata last month when I got an error on create cell disks step. I’ve seen the same error before when I was extending two to three rack Exadata configuration but thought it was one-off.

The cell disk creation failed as below:

[root@exa01db01 linux-x64]# ./install.sh -cf Customer-exa01.xml -s 8

 Initializing
 Executing Create Cell Disks
 Checking physical disks for errors before creating celldisks.........................
 Restarting cell services....................................................
 ERROR:

 Stopping the RS, CELLSRV, and MS services...
 The SHUTDOWN of services was successful.
 Starting the RS, CELLSRV, and MS services...
 Getting the state of RS services...  running
 Starting CELLSRV services...
 The STARTUP of CELLSRV services was not successful.
 CELL-01533: Unable to validate the IP addresses from the cellinit.ora file because the IP addresses may be down or misconfigured.
 Starting MS services...
 The STARTUP of MS services was successful.
 ERROR:

Going through the cell configuration is obvious why the process failed. The cell still had the default name and the IP addresses that the cell services should use are still the default ones:

CellCLI> list cell detail
         name:                   ru02
         ipaddress1:             192.168.10.1/24
         ipaddress2:             192.168.10.2/24
         cellsrvStatus:          stopped
         msStatus:               running
         rsStatus:               running

In short when you see an error like the one below then your ipaddress1 and/or ipaddress2 fields are most probably wrong:

         2       2015-12-15T17:57:03+00:00       critical        "ORA-00700: soft internal error, arguments: [main_6a], [3], [IP addresses in cellinit.ora not operational], [], [], [], [], [], [], [], [], []"

The solution to that is simple. You need to alter the cell name and IP addresses manually:

CellCLI> alter cell name=exa01cel02a,ipaddress1='192.168.10.13/22',ipaddress2='192.168.10.14/22'
Network configuration altered. Please issue the following commands as root to restart the network and open IB stack:
service openibd restart
service network restart
A restart of all services is required to put new network configuration into effect. MS-CELLSRV communication may be hampered until restart.
Cell exa01cel02a successfully altered

CellCLI> alter cell restart services all

Make sure all cells are fixed and re-run the onecommand step, this time it will succeed:

 Successfully completed execution of step Create Cell Disks [elapsed Time [Elapsed = 128338 mS [2.0 minutes] Thu Dec 17 14:26:59 GMT 2015]]

I’ve checked some older deployments and it’s the same step which should change the cell name and restart the cell services. For some reason this didn’t happened for me. For both deployments I used OEDA v15.300 (Oct 2015) so this might be a bug in this version.

Categories: oracle Tags:

Issues with Oracle Direct NFS

January 28th, 2016 No comments

This is a quick post to highlight two issues I had with Oracle dNFS. Both relate to wrong entries in the oranfstab file.

One might encounter ORA-00600 during database creation:

DBCA_PROGRESS : 7%
DBCA_PROGRESS : 8%
ORA-01501: CREATE DATABASE failed
ORA-00600: internal error code, arguments: [KSSRMP1], [], [], [], [], [], [], [], [], [], [], []
ORA-17502: ksfdcre:3 Failed to create file /oracle/ORCL/database/ORCL/controlfile/o1_mf_%u_.ctl

This was caused by having a wrong entry in oranfstab – there was difference between fstab and oranfstab for the same record:

server: zfs01
path: 192.168.10.100
export: /export/OTHERDB/database mount: /oracle/ORCL/database

The second issue was that the database wasn’t using dNFS. Simple query of v$dnfs_servers will return no rows and there were no errors in the alertlog. However looking around the tracefiles one can easily spot the following repetitive error in all trace files:

KGNFS_NFSPROC3_MNT FAIL 13

This was caused by trying to mount a share we don’t have access to or non existing share from the NFS server:

server: zfs01
path: 192.168.10.100
export: /export/NON_EXIST/database mount: /oracle/ORCL/database

The issue was fixed after correcting the typos in the oranfstab file and resting the database.

The bottom line is make sure that fstab and oranfstab match and have correct entries.

Categories: oracle Tags: , ,

Come and hear me speak at UKOUG Tech 15

December 4th, 2015 No comments

It’s this time of the year again when one of the biggest and last for the year Oracle User Groups is being held and that is UK Oracle User Group Conference.

I’m very grateful and proud that I’ll be speaking on this great conference, here are my talks:

Presentation Title: Oracle Exadata Meets Elastic Configurations
Description: With the release of Exadata X5 Oracle announced Elastic configuration to allow mixed number of db and cell nodes. This session will go through the implementation process of a X5 having two db nodes and four cells.
Date: Monday 7th December
Time: 14:10 – 15:00
Hall: Media Suite B

Presentation Title: Oracle DataGuard Fast-Start Failover: Live Demo
Description: Come and see a live demo of Oracle Fast-start failover and why a private bank moved from RAC to FSFO.
Date: Tuesday 8th December
Time: 16:30 – 17:20
Hall: Hall 11A

See you there!

Categories: oracle Tags:

How to rename ASM disk groups in Exadata

November 25th, 2015 No comments

Deployment of Exadata requires you to generate configuration using Oracle Exadata Deployment Assistant (OEDA). Within the same the default  ASM disk groups names are DBFS_DG, RECOC1 and DATAC1. I usually change those to RECO01 and DATA01 as others doesn’t make sense to me and the only place where I find the default ones is on Exadata.

I had an incident last year where the Exadata deployed half way through and names were left by default so I had to delete the configuration and start from scratch.

For my big surprise I got request recently where customer wanted to change RECO01 and DATA01 to RECOC1 and DATAC1! This was a pre-prod system, already deployed and having few databases running. The Exadata was X5-2 running ESS 12.1.2.1.2 and GI 12.1.0.2.

If this ever happens to you, here is what you need to do:

  1. Rename grid disks.
  2. Rename ASM disk groups and ASM disk names.
  3. Modify all databases to point to the new disk groups.

Rename grid disks

Since grid disks names consists of the disk group name they need to be changed too. Although this is not mandatory I strongly recommend it to avoid any confusion in the future.

The grid disks can be renamed very easily using cellcli but they should NOT be in use by GI at that time. Thus Grid Infrastructure has to be stopped on all servers, stop GI as root:

[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/bin/crsctl stop cluster -all

Then run the following magic command to get the list of all grid disks and replace the disk group names with the new ones:

[root@exa01db01 ~]# for i in `dcli -g cell_group -l root cellcli -e list griddisk | awk -F":" '{print $2'} | awk '{print $1}'`; do echo "cellcli -e alter griddisk $i name=$i"; done | grep -v DBFS |sed -e "s/RECO01/RECOC1/2" -e "s/DATA01/DATAC1/2"

You’ll get a long list of cellcli commands – 12 for each cell which you need to run on the cell locally.

Once it’s done start the GI again and make sure all disk groups are mounted successfully:

[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/bin/crsctl start cluster

 

Rename ASM disk groups and ASM disk names

Next is to rename the disk groups. To do so they must be dismounted on ALL cluster nodes before running renamedg on a disk group. Connect to each ASM instance and dismount the disk groups:

SQL> alter diskgroup datac1 dismount;

Diskgroup altered.

SQL> alter diskgroup recoc1 dismount;

Diskgroup altered.

At this point you can run renamеdg to rename the disk groups, here is an example for the DATAC1 disk group:

[oracle@exa01db01 ~]$ renamedg -dgname DATA01 -newdgname DATAC1

Parsing parameters..
renamedg operation: -dgname DATA01 -newdgname DATAC1
Executing phase 1
Discovering the group
Checking for hearbeat...
Re-discovering the group
Generating configuration file..
Completed phase 1
Executing phase 2
Completed phase 2

Do the same for RECO01 and after that make sure that both disk groups can be mounted on all database nodes successfully, then dismount them again so you rename the ASM disk names. In general there is a command to rename all the disks (ALTER DISKGROUP XX RENAME DISKS ALL) but it will rename the disks to a name of the form diskgroupname_####, where #### is the disk number. However ASM disk names have different names on Exadata (RECO01_CD_01_EXA01CEL01) and that’s why we need to rename them manually.

To rename the disks the disk group has to be mounted in restricted mode (so only one node in the cluster can mount the disk group). Then run the below two statement to generate the new ASM disk names:

SQL> alter diskgroup datac1 mount restricted;

Diskgroup altered.

SQL> select 'alter diskgroup datac1 rename disk ''' || name || ''' to ''' || REPLACE(name,'DATA01','DATAC1') || ''';' from v$asm_disk where name like 'DATA%';

SQL> select 'alter diskgroup recoc1 rename disk ''' || name || ''' to ''' || REPLACE(name,'RECO01','RECOC1') || ''';' from v$asm_disk where name like 'RECO%';

Execute the alter statement generated by the above two statements and mount both disk groups on all database nodes again.

There is no command to add the disk group back to Oracle Restart. They will be automatically added first time they are mounted. However you need to remove the old disk group resources:

[oracle@exa01db01 ~]$ srvctl remove diskgroup -g DATA01
[oracle@exa01db01 ~]$ srvctl remove diskgroup -g RECO01

 

Modify all databases to point to the new disk groups

The last step is to change datafile/tempfile/redolog files on all databases to point to the new disk groups. Make sure you disable block change tracking and flashback as database might not open since the location of BCT has changed:

SQL> alter database disable block change tracking;
SQL> alter database flashback off;

Next create pfile from spfile and substitute all the occurences of RECO01 and DATA01 OR you can modify the spfile just before you shut the database. Let’s assume you have created pfile, make sure all the parameters refer to the new disk group names, here are the default ones that you need to check:

*.control_files
*.db_create_file_dest
*.db_create_online_log_dest_1
*.db_create_online_log_dest_2
*.db_recovery_file_dest

Start the database in mount state and generate all the alter statements for datafiles/tempfiles and redologs:

[oracle@exa01db01 ~]$ sqlplus -s / as sysdba
set heading off
set echo off
set pagesize 140
set linesize 140
spool /tmp/rename.sql

select 'alter database rename file ''' || name || ''' to ''' || REPLACE(name,'DATA01','DATAC1') || ''';' from v$datafile;
select 'alter database rename file ''' || name || ''' to ''' || REPLACE(name,'DATA01','DATAC1') || ''';' from v$tempfile;
select 'alter database rename file ''' || member || ''' to ''' || REPLACE(member,'DATA01','DATAC1')|| ''';' from v$logfile where member like '%DATA%';
select 'alter database rename file ''' || member || ''' to ''' || REPLACE(member,'RECO01','RECOC1')|| ''';' from v$logfile where member like '%RECO%';
exit

Start another sqlplus and run the spool file from the above operation (rename.sql). At this point you can open the database (alter database open;). Once the database is open make sure you enable block change tracking and flashback:

SQL> alter database enable block change tracking;
SQL> alter database flashback on;

Finally change the database dependencies and spfile location:

For 12c databases:

[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -nodiskgroup
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -diskgroup "DATAC1,RECOC1"
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -spfile +DATAC1/DBM01/spfiledbm01.ora

For 11g databases:

[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -z
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -x "DATAC1,RECOC1"
[oracle@exa01db01 dbs]$ srvctl modify database -d dbm01 -p +DATAC1/DBM01/spfiledbm01.ora
Categories: oracle Tags: ,

How to move OEM12c management agent to new location

October 29th, 2015 3 comments

While working on another Exadata project recently I found that OEM12c agents on the compute nodes were installed on different locations on each of the three Exadatas. On one of them was under /home/oracle/agent, another one had them under /opt/oracle/agent and third one had them under /oracle/agent. Obviously this was not the standard and the agents had to be moved under /u01/app/oracle/agent. The only problem with that was that the three Exadatas were already discovered along with some database targets. Fortunately this wasn’t production yet but it would still require all the agents to be reinstalled and all targets rediscovered.

Fortunately there is an easier way to move the OEM management agents to new location without all the hassle of reinstalling agents and rediscovering agents. In the following example the agent was installed in /home/oracle/agent/ and I had to move it to /u01/app/oracle/agent/.

First you need to download the ConvertToStandalone.pl utility from 2021782.1 and then upload it to the server under /home/oracle

You need to create a list of plugins, otherwise the move process will fail:

[oracle@exa01db01 ~]$ /home/oracle/agent/core/12.1.0.5.0/perl/bin/perl /home/oracle/agent/core/12.1.0.5.0/sysman/install/create_plugin_list.pl -instancehome /home/oracle/agent/core/12.1.0.5.0

This will create a file /home/oracle/agent/plugins.txt which is used by the perl script later.

Export the following variables:

export OLD_AGENT_HOME=/home/oracle/agent/core/12.1.0.5.0
export ORACLE_HOME=/u01/app/oracle/agent/core/12.1.0.5.0

Another thing is you need to do is to modify the SBIN_MODIFIED_VERSION from 12.1.0.4.0. to 12.1.0.5.0 in /home/oracle/agent/agentimage.properties, otherwise the process will fail.

Then run the perl script which will migrate the agent home to the new location:

[oracle@exa01db01 ~]$ /home/oracle/agent/core/12.1.0.5.0/perl/bin/perl /home/oracle/ConvertToStandalone.pl -instanceHome /home/oracle/agent/agent_inst -newAgentBaseDir /u01/app/oracle/agent

Pay attention that the script accepts two arguments, instanceHome is the agent instance home directory e.g. /home/oracle/agent/agent_inst/ and the newAgentBaseDir is the new base dir for the agent /u01/app/oracle/agent/

After the command completes you need to run root.sh as root:

[oracle@exa01db01 ~]# /u01/app/oracle/agent/core/12.1.0.5.0/root.sh
Finished product-specific root actions.
/etc exists

Deinstall the old agent:

[oracle@exa01db01 ~]$ /home/oracle/agent/core/12.1.0.5.0/perl/bin/perl /home/oracle/agent/core/12.1.0.5.0/sysman/install/AgentDeinstall.pl -agentHome /home/oracle/agent/core/12.1.0.5.0

Finally remove the old agent directory where a log file from the deinstall process is left:

[oracle@exa01db01 ~]$ rm -rf /home/oracle/agent

The beauty of this process is that the script will create a blackout AGT_CNT_BLK_OUT on a node level and then stop the agent. It will then migrate the agent to the new home, start the agent and finally remove the blackout. The whole process takes less than five minutes.

Categories: oracle Tags: ,

Introducing Oracle ASM Filter Driver

October 27th, 2015 3 comments

The Oracle ASMFD (Filter Driver) was introduced in Oracle Database 12.1.0.2 and as of the moment it is available on Linux systems only.

Oracle ASM Filter Driver is a kernel module very much like the ASMLIB that resides in the I/O path of the Oracle ASM disks. It provides an interface between the Oracle binaries and the underlying operating environment.

Here are some of the features of ASMFD:

  • Reject non-Oracle I/O

The ASM filter driver will reject write I/O operation issued by non-Oracle commands. This prevents non-Oracle applications from writing to ASM disks and protects ASM from accidental corruption.

  • Device name persistence

Similarly to ASMLIB you don’t have to configure the device name persistence using UDEV.

  • Faster node recovery

According to the documentation ASMFD allows Oracle Clusterware to perform node level fencing without a reboot. So in case of CSS is not running or nodes are fenced the Oracle stack will be restarted instead of node to be rebooted. This is greatly reduce the boot time as with some enterprise servers it might take up to 10 minutes to boot.

  • Reduce OS resource usage

ASMFD exposes a portal device that can be used for all I/O on a particular host and thus decreasing the number of open file descriptors. With it each ASM process needs to have an open descriptor to each ASM disk. I’m not sure how much this will save you but might be useful in case you got hundreds of ASM disks.

  • Thin Provisioning & Data Integrity

This is another new and cool feature which is very popular in the virtualization world. When enabled the disk space not in use can be returned to the array also known as thin-provisioning. This attribute can be set only if the ASM compatibility is greater than or equal to 12.1.0.0 and requires you to use ASMFD!

In a way ASMFD is a replacement of ASMLIB as it includes base-ASMLIB features. However ASMFD takes it one step further by protecting the ASM disks from non-oracle write I/O operations to prevent accidental damage. Unlike ASMLIB the ASMFD is installed with the Oracle Grid Infrastructure installation.

 

Brief history of ASM and the need of ASM Filter Driver

To understand ASMFD better we need to understand where the need comes from. It’s important to say that this is very specific to Linux as other platforms have other methods to fulfill the requirements. Because that’s not the purpose of this post and it’s too long I decide to keep it at the end of the post.

In Linux as in any other platform there is a user separation which implies access restrictions. In Linux we usually install Oracle Database under the oracle user and to do so we need to have writable access to the directories we plan to use. By default that would be /home/oracle/ and as you can imagine that’s not very handy, also you might want to install the database in separate partition or file system. For this reason the root user will create the required directories and change their ownership to oracle, that is usually /u01 or /opt.

That would work if you want to store your database files in a file system. However the traditional file systems were not designed for database files, they need to have a file system check on a regular basis and sometimes they might get corrupted. For that reason and performance perspective many people would move to RAW devices in the past. Another case would be if you want to run RAC – you’ll either need a cluster file system or RAW devices.

Historically with 9i and 10g we used to create RAW devices which are one to one mapping between a device file and a logical name. For example you would create partition on each device /dev/sda1, /deb/sdb1 and then map those to /dev/raw/raw1, /dev/raw/raw2 and so on. Additional because in Linux the device files are rebuild each time the system reboots you need to make sure the permissions and ownership are preserved and persist after system reboot. This was achieved by having additional rules in your last boot scripts (often rc.local). For other platforms like HP-UX for example one had to buy additional license (HP Service Guard extension for RAC) which would give you the ability to have a shared LVM groups across two or more servers.

However the support and maintenance of raw devices was really difficult and Oracle came up with the idea to create their own volume manager to simplify database administration and eliminate the need to manage thousands of database files – Automatic Storage Management, ASM for short. A simple description is that ASM is very sophisticated volume manager for Oracle data. ASM could also be used if you deploy RAC hence you don’t need cluster file systems or RAW devices anymore. Additionally it provides a redundancy so if you have JBOD you can use ASM to do the mirroring of the data. Another important feature is that you don’t need persistent device naming anymore. Upon start ASM will read all the disk drives specified by asm_diskstring and use the ones on which ASM header is found. Although ASM was released in 10.1 people were still using raw devices at the time because ASM was too new and unknown for many DBAs.

So ASM will logically group all the disks (LUNs presented from the storage) into what’s called ASM disk groups and because it’s using Oracle Managed Files you don’t really care anymore where your files are and what their names are. ASM is just another abstraction layer in the database file storage. ASM is available on all platforms so in a way it will standardize the administration of database files. Often the DBAs will also administer the ASM but it could be the storage team managing the ASM. You still had to make sure the device files have the correct permissions before ASM could use them, otherwise no diskgroup will be available hence database could not start.

At the same time back in 2004 Oracle released another product ASMLib which only purpose was to persist the device naming and preserve the device files permissions. I don’t want to go into details about ASMLib here but there is an old and very good post on ASMLib from Wim Coekaerts (HERE). Just to mention that ASMLib is also available under RHEL, more can be found HERE.

In the recent years many people like myself used UDEV to persist the permissions and ownership of the device files used by ASM. I really like to have one to one match between device files and ASM disk names for better understanding and ease any future troubleshooting.

ASM Filter Driver takes this one step further by introducing the features above. I can see people start using ASMFD to take advantage of the thin provisioning OR make sure no one will overwrite (by mistake) the ASM device files, yes this happens and it happened to me recently.

Categories: oracle Tags: ,

Database system target in pending status for standby database in OEM 12c

October 6th, 2015 No comments

That’s not really a problem but annoying issue I had with OEM 12c. Once a standby database is promoted, the database system for the same is showing as metric collections error OR Status Pending.

The standby database doesn’t need its own system since it will join the primary database system. The solution is to associate the standby database with the primary system and then remove the standby database system.

For example – we’ve got primary and standby databases – TESTDB_LON, TESTDB_RDG. Once promoted the following targets are also created in OEM – TESTDB_LON_sys and TESTDB_RDG_sys.

The second one will always be having status Pending:
Status Pending (Post Blackout)

The way to resolve that is to associate the standby database with the primary database system. I usually rename the primary database system as well to omit the location (LON and RDG):
– Go to the Targets -> Systems and choose the system you want to edit
– Then go to Database System -> Target Setup -> Edit system
– Rename the system name from TESTDB_LON_sys to TESTDB_sys
– Save changes
– Go to Database System again, Target Setup -> Edit system
– Click next to go to Step 2
– Add the standby database to the Standby Database Associations table
– Save changes

At this point we’ve got one system TESTDB_sys with two database members TESTDB_LON and TESTDB_RDG.

Next step is to remove the database system for the standby using emcli:

[oracle@oem12c ~]$ /opt/app/oracle/em12cr4/middleware/oms/bin/emcli login -username=sysman
Enter password :
Login successful

[oracle@oem12c ~]$ /opt/app/oracle/em12cr4/middleware/oms/bin/emcli delete_target -name="TESTDB_RDG_sys" -type="oracle_dbsys"
Target "TESTDB_RDG_sys:oracle_dbsys" deleted successfully

Now it’s all sorted and hopefully all targets are “green”.

Categories: oracle Tags:

Exadata X5 PDU – CLI already in use

September 18th, 2015 No comments

Exadata X5-2 and X4-8B racks are delivered with the “Enhanced” PDU metering units connected via the Cisco switch. Although the documentation says they should have static addresses, they don’t. You need to configure them manually using serial console connection, this is described in my earlier post here.

However if you forget to exit the serial console connection to the PDU and then try to login using SSH later you’ll get the following message:

login as: admin
admin@192.168.1.10's password:

CLI already in use!!!
Please try again later .....

Then someone has to go all the way to the data centre and reset the PDU or exit from the serial console.

Categories: oracle Tags:

Start of ‘ora.crf’ failed after update to 12.1.0.2 DBBP7

July 25th, 2015 2 comments

This happened to me a month ago right after I applied DBBP7 on 12.1.0.2. For some reason the ora.crf resource didn’t start automatically:

CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/12.1.0.2/grid/bin/osysmond" for action "start": details at "(:CLSN00008:)" in "/u01/app/oracle/diag/crs/exa01db01/crs/trace/ohasd_orarootagent_root.trc"
CRS-2674: Start of 'ora.crf' on 'exa01db01' failed

Checking the trace file for more details you can immediately spot where the problem is:

2015-06-04 10:35:51.156513 :CLSDYNAM:3286230784: [ ora.crf]{0:0:8275} [start] (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2015-06-04 10:35:51.156520 :CLSDYNAM:3286230784: [ ora.crf]{0:0:8275} [start] (:CLSN00008:) category: -1, operation: fail, loc: canexec2, OS error: 0, other: no exe permission, file [/u01/app/12.1.0.2/grid/bin/osysmond]

Indeed the osysmond is owned by the oracle user where it should be owned by root:

[root@exa01db01 ~]# ls -al /u01/app/12.1.0.2/grid/bin/osysmond
-rwxr-x--- 1 oracle oinstall 9441 Jun  4 10:42 /u01/app/12.1.0.2/grid/bin/osysmond

The fix for that is simple – you need to unlock and lock the GI:

[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/crs/install/rootcrs.pl -unlock
[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/crs/install/rootcrs.pl -patch

The osysmond has the correct permissions now and the resource ora.crf starts sucessfully:

[root@exa01db01 ~]# ls -al /u01/app/12.1.0.2/grid/bin/osysmond
-rwxr-x--- 1 root oinstall 9533 Jun  4 10:48 /u01/app/12.1.0.2/grid/bin/osysmond

 

For reference:

Categories: oracle Tags: