Archive for the ‘oracle’ Category

Smart firewalls

April 15th, 2015 No comments

It’s been a while since my last post but I was really busy working on a number of projects.

The purpose of this post is to highlight an issue I had while building a standby database. The environment we had – three databases at host A (primary) and same were restored from backup on another host B (standby), both hosts were running Linux. It’s important to mention that both hosts were located in different Data Centers.

Once a standby database was mounted we would start shipping archive log files from the primary without adding it to the DataGuard Broker config as of that moment. We wanted to touch the production as little as possible and would add the database to the broker config just before doing the switchover. In the meanwhile we would manually recover the standby database to reduce the apply lag once the database is being added to the broker config. This approach worked fine for two of the databases but we got this error for the third one:

Fri Mar 13 13:33:43 2015
RFS[33]: Assigned to RFS process 29043
RFS[33]: Opened log for thread 1 sequence 29200 dbid -707326650 branch 806518278
CORRUPTION DETECTED: In redo blocks starting at block 20481count 2048 for thread 1 sequence 29200
Deleted Oracle managed file +RECO01/testdb/archivelog/2015_03_13/thread_1_seq_29200.8481.874244023
RFS[33]: Possible network disconnect with primary database
Fri Mar 13 13:42:45 2015
Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb/trace/testdb_rfs_31033.trc:

Running through the trace file the first thing which I noticed was:

Corrupt redo block 5964 detected: BAD CHECKSUM

We already had two databases running from host A to host B so we rulled out the firewall issue. Then tried couple of other things – manually recovered the standby with incremental backup, recreated the standby, cleared all the redo/standby log groups but nothing helped. I found only one note in MOS with similar symptom for Streams in 10.2.

At the end the network admins were asked to check the config of the firewalls one more time. There were two firewalls – one where host A was located and another one where host B was located.

It turned out that the firewall at host A location had SQLnet class inspection enabled which was causing the corruption. The logs were successfully shipped from the primary database once this firewall feature was disabled. The strange thing was that we haven’t had any issues with the other two databases running on the same hosts, well what can I say – smart firewalls.


Categories: oracle Tags:

RHEL6 udev and EMC PowerPath

January 26th, 2015 No comments

I’m working on Oracle database migration project where customer have chosen commodity x86 hardware with RHEL6 and EMC storage.

I’ve done many similar installations in the past and I always used the native MPIO in Linux (DM-Multipath) to load balance and failover I/O paths. This time however I’ve got EMC PowerPath doing the load balance and failover and got the native MPIO disabled. From my point of view it’s the same, whether I’ll be using /dev/emcpower* or /dev/mapper/* it’s the same. Obviously PowerPath has some advantages over the native MPIO which I really can’t tell yet. That’s a good paper from EMC giving a comparison between the native MPIO in different operating systems.

As mentioned before the aggregated logical names (pseudo names) with EMC PowerPath could be found under /dev/emcpowerX. I partitioned the disks with GPT tables and aligned the first partition to match the storage sector size. Also added to following line to udev rules to make sure my devices will get the proper permissions:

ACTION=="add", KERNEL=="emcpowerr1", OWNER:="oracle", GROUP:="dba", MODE="0600"

I restarted the server and then later udev to make sure ownership and permissions were picked up correctly. Upon running asmca to create ASM with the first disk group I got the following errors:

Configuring ASM failed with the following message:
One or more disk group(s) creation failed as below:
Disk Group DATA01 creation failed with the following message:
ORA-15018: diskgroup cannot be created
ORA-15031: disk specification '/dev/emcpowerr1' matches no disks
ORA-15025: could not open disk "/dev/emcpowerr1"
ORA-15056: additional error message

Well that’s strange, I’m sure the file had to correct permissions. However listing the file proved that it didn’t have the correct permissions. I repeated the process several times and always got the same result, you can use simple touch command to get the same result:

[root@testdb ~]# ls -al /dev/emcpowerr1
brw-rw---- 1 oracle dba 120, 241 Jan 23 12:35 /dev/emcpowerr1
[root@testdb ~]# touch /dev/emcpowerr1
[root@testdb ~]# ls -al /dev/emcpowerr1
brw-rw---- 1 root root 120, 241 Jan 23 12:35 /dev/emcpowerr1

Something was changing the ownership of the file and I didn’t know what. Well you’ll be no less surprised than I was to find that linux has a similar auditing framework as the Oracle database.

Auditctl will allow you to audit any file for any syscall run against it. In my case I would like to know which process is changing the ownership of my device file. Another helpful command is ausyscall whic allows you to map syscall names and numbers. In other words I would like to know what is the chmod syscall number on a 64bit platform (it does matter):

[root@testdb ~]# ausyscall x86_64 chmod --exact

Then I would like to set up auditing for all chmod calls against my device file:

[root@testdb ~]# auditctl -a exit,always -F path=/dev/emcpowerr1 -F arch=b64 -S chmod
[root@testdb ~]# touch /dev/emcpowerr1
[root@testdb ~]# tail -f /var/log/audit/audit.log
type=SYSCALL msg=audit(1422016631.416:4208): arch=c000003e syscall=90 success=yes exit=0 a0=7f3cfbd36960 a1=61b0 a2=7fff5c59b830 a3=0 items=1 ppid=60056 pid=63212 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="udevd" exe="/sbin/udevd" key=(null)
type=CWD msg=audit(1422016631.416:4208):  cwd="/"
type=PATH msg=audit(1422016631.416:4208): item=0 name="/dev/emcpowerr1" inode=28418 dev=00:05 mode=060660 ouid=54321 ogid=54322 rdev=78:f1 nametype=NORMAL
[root@testdb ~]# auditctl -D
No rules

Gotcha! So it was udev changing the permissions but why ?

I spent half day going through logs and tracing udev but couldn’t find anything.

At the end of the day I found an article by RHEL on which they had exactly the same problem. The solution was to have “add|change” into the ACTION directive instead of only “add”.

So here is the rule you need to have in order for UDEV to set a persistent ownership/permission on EMC PowerPath device files in RHEL 6:

[root@testdb ~]# cat /etc/udev/rules.d/99-oracle-asm.rules
ACTION=="add|change", KERNEL=="emcpowerr1", OWNER:="oracle", GROUP:="dba", MODE="0600"

Hope it helps and you don’t have to spent half day as I did.


Categories: linux, oracle Tags:

runInstaller fails at CreateOUIProcess with permission denied

January 16th, 2015 No comments

Just a short post on a problem I encountered recently.

I had to install 11.2 GI and right after running the installer I got a message saying permission denied. Below is the exact error:

[oracle@testdb grid]$ ./runInstaller -silent -showProgress -waitforcompletion -responseFile /u01/software/grid/response/grid_install_20140114.rsp
Starting Oracle Universal Installer...

Checking Temp space: must be greater than 120 MB.   Actual 7507 MB    Passed
Checking swap space: must be greater than 150 MB.   Actual 8191 MB    Passed
Preparing to launch Oracle Universal Installer from /tmp/OraInstall2015-01-15_12-12-20PM. Please wait ...Error in CreateOUIProcess(): 13
: Permission denied

Quickly tracing the process I can see that it fails to execute the java installer:

27316 execve("/tmp/OraInstall2015-01-15_12-05-40PM/jdk/jre/bin/java", ["/tmp/OraInstall2015-01-15_12-05-"..., "-Doracle.installer.library_loc=/"..., "-Doracle.installer.oui_loc=/tmp/"..., "-Doracle.installer.bootstrap=TRU"..., "-Doracle.installer.startup_locat"..., "-Doracle.installer.jre_loc=/tmp/"..., "-Doracle.installer.nlsEnabled=\"T"..., "-Doracle.installer.prereqConfigL"..., "-Doracle.installer.unixVersion=2"..., "-mx150m", "-cp", "/tmp/OraInstall2015-01-15_12-05-"..., ""..., "-scratchPath", "/tmp/OraInstall2015-01-15_12-05-"..., "-sourceLoc", ...], [/* 22 vars */]) = -1 EACCES (Permission denied)

I never had this problem before, see similar behaviour with having selinux enabled but that wasn’t the case.

Then why I remembered that while formatting a partition for u01 and adding to fstab I saw that tmp didn’t have the default mount options:

/dev/mapper/vglocal00-tmp00 /tmp                    ext4    defaults,noexec 1 2

Indeed, the noexec option will not let you execute binaries that are on that partition. This server was built by a hosting provider and I guess this was part of thir default deployment process.

After removing the option and remounting /tmp (mount -o remount /tmp), installer was able to run successfully.

Categories: linux, oracle Tags:

2014 in review

January 12th, 2015 No comments

Happy New Year!

So many things happened in the past six months, I really can’t tell how quickly the time passed. As people say the life is what happens to you while you are making plans for the future.

I wish I had the time to blog more in the past year and I plan to change this in the New Year!

2014 was really successful for me. I worked on some really interesting projects, configured my first Exadata, migrated a few more databases to Exadata and I faced some challenging problems. This year is about to be no different and I’ve already started another interesting and challenging migration.

Same year I presented at Oracle Open World for which I would like to thank Jason Arneil for the joint presentation and E-DBA for making this happen! At the same time e-DBA have been awarded the Oracle Excellence Award Specialised Global Partner of year for Oracle Engineered Systems.

Last but not least I was honoured with Employee of the Year award last month, again thank you E-DBA team!

Employee of the Year


Categories: oracle, personal Tags:

Speaking at Oracle Open World 2014

September 25th, 2014 No comments

I’m more than happy that I will be speaking at this year’s Oracle Open World. The first and only time I attended was back in 2010 and now I’m not only attending but speaking as well!

Both with Jason Arneil will talk about what we’ve learned on our Exadata implementations with two of the biggest UK retailers so please join us:
Session ID: CON2224
Session Title: Oracle Exadata Migrations: Lessons Learned from Retail
Venue / Room: Moscone South – 310
Date and Time: 9/30/14, 15:45 – 16:30

I would like to thank E-DBA and especially Jason for making this happen!

I’m also planning to attend Oaktable World 2014 and Oracle OpenWorld 2014 – Bloggers Meetup for the best part of OOW – really technical sessions and networking!

See you there!



Categories: oracle Tags:

Speaking at BGOUG 2014 Spring conference

May 30th, 2014 No comments

I’ll be speaking at the spring conference of the BGOUG held between 13th and 15th of June. I was a regular attendee of the conference for eight years in a row but since I moved to UK I had to skip the last two conferences. My session is about Oracle GoldenGate – it will cover the basics, components, usage scenarios, installation and configuration, trail files and GG records and many more.

See you there in two weeks.



Categories: oracle Tags: ,

EMDIAG Repvfy 12c kit – troubleshooting part 1

March 26th, 2014 No comments

The following blog post continue the EMDIAG repvfy kit series and will focus on how to troubleshoot and solve the problems reported by the kit.

The repository verification kit reports number of problems with our repository which we are about to troubleshoot and solve one by one. It’s important to notice that some of the problem are related so solving one problem could also solve another one.

Here is the output I’ve got for my OEM repository:

-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114     Repository:     23-Jan-2014 13:35:41 --
-- Module:                                          Test:   0, Level: 2 --
-- --------------------------------------------------------------------- --


1004. Stuck PING jobs: 10
1002. Disabled response metrics (16570376): 2
1002. Unregistered ECM metadata tables: 2
1001. Undefined verification modules: 1
2001. Exadata plugin version mismatches: 5
2001. System jobs running for more than 24hr: 1
1002. Stuck PING jobs: 10
1003. Plugin metadata versions out of sync: 13
1021. Composite metric calculation with inconsistent dependant metadata versions: 3
2004. Targets without an ORACLE_HOME association: 2
2007. Targets with unpromoted ORACLE_HOME target: 2

I usually follow this sequence of actions when troubleshoot repository problems:
1. Verify the module with detail option. Increasing the level also might show more problems or problems related to the current one.
2. Dump the module and check for any unusual activity.
3. Check repository database alert log for any errors.
4. Check emagent logs for any errors.
5. Check OMS logs for any errors.

Troubleshoot Stuck PING jobs

Looking on the first problem reported for verifyAGENTS – Stuck PING jobs we can easily spot the relation between verifyAGENTS, verifyJOBS and verifyOMS modules where the same problem is occurring. For some reason there are ten ping jobs which are stuck and running for more than 24hrs.

The best approach would be running verify against any of these modules with the –detail option. This will show more information and eventually help analyze the problem. Running detail report for AGENTS and OMS didn’t helped and didn’t show much information related to the stuck pings jobs. However running detailed report for the JOBS we were able to identify the job_id, job_name and when the job was started:

[oracle@oem bin]$ ./repvfy verify jobs –detail

JOB_ID                           EXECUTION_ID                     JOB_NAME                                 START_TIME
-------------------------------- -------------------------------- ---------------------------------------- --------------------
ECA6DE1A67B43914E0432084800AB548 ECA6DE1A67B63914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB 03-DEC-2013 19:02:29

So we can see that the stuck job was started on 19:02 at 3rd of December and the time of check was 23rd of January.

Now we can say that there is a problem with the jobs rather than agents or oms, the problems at these two modules appeared as a result of the stuck job and we should be focusing on the JOBS module.

Running analyze against the job will show the same thing as verify with detail option, it’s usage would be appropriate if we got multiple jobs issues and want to see the details for particular one.

Dumping the job will show a lot of info from MGMT_ tables that’s useful, of particular interest are the details of the execution:

[oracle@oem bin]$ ./repvfy dump job -guid ECA6DE1A67B43914E0432084800AB548

[----- MGMT_JOB_EXEC_SUMMARY ------------------------------------------------]

EXECUTION_ID                     STATUS                           TLI QUEUE_ID                         TIMEZONE_REGION                SCHEDULED_TIME       EXPECTED_START_TIME  START_TIME           END_TIME                RETRIED
-------------------------------- ------------------------- ---------- -------------------------------- ------------------------------ -------------------- -------------------- -------------------- -------------------- ----------
ECA6DE1A67B63914E0432084800AB548 02-Running                         1                                  +00:00                         03-DEC-2013 18:59:25 03-DEC-2013 18:59:25 03-DEC-2013 19:02:29                               0

Again we can confirm that the job is still running and the next step would be to dump the execution which will show us on which step the job is waiting/hanging. That’s just an example because in my case I didn’t have any steps in my job execution:

[oracle@oem bin]$ ./repvfy dump execution -guid ECA6DE1A67B43914E0432084800AB548
[oracle@oem bin]$ ./repvfy dump step -id 739148

Checking job system health could also be useful by showing some job history, scheduled jobs and some performance metrics:

[oracle@oem bin]$ ./repvfy dump job_health

Back to our problem we may query MGMT_JOB to get the job name and confirm that’s system job run by SYSMAN:


JOB_ID                           JOB_NAME                                                     JOB_OWNER  JOB_DESCRIPTION                                              JOB_TYPE        SYSTEM_JOB JOB_STATUS
-------------------------------- ------------------------------------------------------------ ---------- ------------------------------------------------------------ --------------- ---------- ----------
ECA6DE1A67B43914E0432084800AB548 PINGCFMJOB_ECA6DE1A67B33914E0432084800AB548                  SYSMAN     This is a Confirm EMD Down test job                          ConfirmEMDDown            2          0

We may try to stop the job using emcli and job name:

[oracle@oem bin]$ emcli stop_job -name=PINGCFMJOB_ECA6DE1A67B33914E0432084800AB
Error: The job/execution is invalid (or non-existent)

If that doesn’t work then use emdiag kit to cleanup the repository part:

./repvfy verify jobs -test 1998 -fix

Please enter the SYSMAN password:

-- --------------------------------------------------------------------- --
-- REPVFY: 2014.0114     Repository:     27-Jan-2014 18:18:36 --
-- Module: JOBS                                     Test: 1998, Level: 2 --
-- --------------------------------------------------------------------- --
-- -- -- - Running in FIX mode: Data updated for all fixed tests - -- -- --
-- --------------------------------------------------------------------- --

The repository is now ok but it will not remove the stuck thread at the OMS level. In order for the OMS to get healthy again it needs to be restarted:

cd $OMS_HOME/bin
emctl stop oms
emctl start oms

After OMS was restarted there were no stuck jobs anymore!

I’ve still wanted to know why that happened. Although there were few bugs at MOS they were no very applicable and didn’t found any of the symptoms in my case. After checking repository database alertlog I found few disturbing messages:

  Tns error struct:
Time: 03-DEC-2013 19:04:01
TNS-12637: Packet receive failed
ns secondary err code: 12532
opiodr aborting process unknown ospid (15301) as a result of ORA-609
 opiodr aborting process unknown ospid (15303) as a result of ORA-609
 opiodr aborting process unknown ospid (15299) as a result of ORA-609
2013-12-03 19:07:58.156000 +00:00

I also found a lot of similar message on the target databases:

  Time: 03-DEC-2013 19:05:08
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

That pretty much matches the time when the job got stuck – 19:02:29. So I assume there was some network glitch at that time causing the ping job to stuck. The solution was simply run the repvfy with fix option and then restart the OMS service.

In case after restart the job is stuck again consider increasing the oms property oracle.sysman.core.conn.maxConnForJobWorkers.  Consider the following note if that’s the case:

EMDIAG repvfy blog series:


Categories: oracle Tags: , ,

Troubleshooting Oracle DBFS mount issues

March 13th, 2014 No comments

On Exadata the local drives on the compute nodes are not big enough to allow larger exports and often dbfs is configured. In my case I had a 1.2 TB dbfs file system mounted under /dbfs_direct/.

While I was doing some exports yesterday I found that my dbfs wasn’t mounted, running quick crsctl command to bring it online failed:

[oracle@exadb01 ~]$ crsctl start resource dbfs_mount -n exadb01
 CRS-2672: Attempting to start 'dbfs_mount' on 'exadb01'
 CRS-2674: Start of 'dbfs_mount' on 'exadb01' failed
 CRS-2679: Attempting to clean 'dbfs_mount' on 'exadb01'
 CRS-2681: Clean of 'dbfs_mount' on 'exadb01' succeeded
 CRS-4000: Command Start failed, or completed with errors.

It doesn’t give you any error messages or reason why it’s failing, neither the other database and grid infrastructure logs does. The only useful solution is to enable tracing for dbfs client and see what’s happening. To enable tracing edit the mount script and insert the following MOUNT_OPTIONS:

vi $GI_HOME/crs/script/

Now start the resource one more time to get the log file generated. You can get this working with the client as well from the command line:

[oracle@exadb01 ~]$ dbfs_client dbfs_user@ -o allow_other,direct_io,trace_level=1,trace_file=/tmp/dbfs_client_trace.$$.log /dbfs_direct
Fail to connect to database server.


After checking the log file it’s clear now why dbfs was failing to mount, the dbfs database user has expired:

tail /tmp/dbfs_client_trace.100641.log.0
 [43b6c940 03/12/14 11:15:01.577723 LcdfDBPool.cpp:189         ] ERROR: Failed to create session pool ret:-1
 [43b6c940 03/12/14 11:15:01.577753 LcdfDBPool.cpp:399         ] ERROR: ERROR 28001 - ORA-28001: the password has expired

[43b6c940 03/12/14 11:15:01.577766 LcdfDBPool.cpp:251         ] DEBUG: Clean up OCI session pool...
 [43b6c940 03/12/14 11:15:01.577805 LcdfDBPool.cpp:399         ] ERROR: ERROR 24416 - ORA-24416: Invalid session Poolname was specified.

[43b6c940 03/12/14 11:15:01.577844 LcdfDBPool.cpp:444         ] CRIT : Fail to set up database connection.


The account had a default profile which had the default PASSWORD_LIFE_TIME of 180 days:

SQL> select username, account_status, expiry_date, profile from dba_users where username='DBFS_USER';

USERNAME                       ACCOUNT_STATUS                   EXPIRY_DATE       PROFILE
------------------------------ -------------------------------- ----------------- ------------------------------
DBFS_USER                      EXPIRED                          03-03-14 14:56:12 DEFAULT

Elapsed: 00:00:00.02
SQL> select password from sys.user$ where name= 'DBFS_USER';


Elapsed: 00:00:00.00
SQL> alter user DBFS_USER identified by values 'A4BC1A17F4AAA278';

User altered.

Elapsed: 00:00:00.03
SQL> select username, account_status, expiry_date, profile from dba_users where username='DBFS_USER';

USERNAME                       ACCOUNT_STATUS                   EXPIRY_DATE       PROFILE
------------------------------ -------------------------------- ----------------- ------------------------------
DBFS_USER                      OPEN                             09-09-14 11:09:43 DEFAULT

SQL> select * from dba_profiles where resource_name = 'PASSWORD_LIFE_TIME';

PROFILE                        RESOURCE_NAME                    RESOURCE LIMIT
------------------------------ -------------------------------- -------- ----------------------------------------
DEFAULT                        PASSWORD_LIFE_TIME               PASSWORD 180


After resetting database user password dbfs successfully mounted!

If you are using dedicated database for dbfs make sure you have set the password_life_time to unlimited to avoid similar issues.



Categories: linux, oracle Tags: , ,

OEM 12c installation fails if parallel_max_servers too high

February 21st, 2014 No comments

Just a quick post regarding OEM 12c installation where recently I had to install OEM 12c and during the repository configuration step the installation fails with error:

ORA-12801: error signaled in parallel query server P151

This was caused by a known bug which requires decreasing the number of parallel queries of the repository databases and start over the installation. The database had cpu_count set to 64 and parallel_max_servers to 270. After setting the parallel_max_servers to lower value the installation completed successfully.

For more information refer to:
EM 12c: Enterprise Manager Cloud Control 12c Installation Fails At Repository Configuration With Error: ORA-12805: parallel query server died unexpectedly (Doc ID 1539444.1)


Categories: linux, oracle Tags:

Consider database timezone when created during DST

February 13th, 2014 No comments

Not long time ago customer asked me to create new a database and refresh it from production. Nothing special here database was quickly created and then refreshed using network import for few schemas. Few weeks later I’ve been told that the database has a timestamp problem. The date and time were correct but the time zone was different from the production:



Looking back I tried to find why that happened and I quickly found the answer in the documentation:
If you do not specify the SET TIME_ZONE clause, then the database uses the operating system time zone of the server.

Of course, by that time the time zone of the server was +1 (Daylight saving time) and the database inherited that time zone. The next logical thing was simply to change the time zone to correct one (UTC):

ERROR at line 1:
ORA-30079: cannot alter database timezone when database has TIMESTAMP WITH LOCAL TIME ZONE columns

Right, that won’t work for if there are tables with columns of type TIMESTAMP WITH LOCAL TIME ZONE and there is data within these tables. Unfortunately the only solution for that is to export the database, drop the users and then import back the data. Also for the change to take effect database must be restarted.

You can simply list the columns of that type and export just these tables, I had a lot of them and decided to export/import the whole database as it was small and used for testing anyway:

SQL> select owner, table_name, column_name, data_type from all_tab_columns where data_type like '%WITH LOCAL TIME ZONE' and owner='MY_USER';
MY_USER                         INVENTORY            DSTAMP    TIMESTAMP(6) WITH LOCAL TIME ZONE
Categories: oracle Tags: