Start of ‘ora.crf’ failed after update to 12.1.0.2 DBBP7

July 25th, 2015 No comments

This happened to me a month ago right after I applied DBBP7 on 12.1.0.2. For some reason the ora.crf resource didn’t start automatically:

CRS-5013: Agent "ORAROOTAGENT" failed to start process "/u01/app/12.1.0.2/grid/bin/osysmond" for action "start": details at "(:CLSN00008:)" in "/u01/app/oracle/diag/crs/exa01db01/crs/trace/ohasd_orarootagent_root.trc"
CRS-2674: Start of 'ora.crf' on 'exa01db01' failed

Checking the trace file for more details you can immediately spot where the problem is:

2015-06-04 10:35:51.156513 :CLSDYNAM:3286230784: [ ora.crf]{0:0:8275} [start] (:CLSN00008:)Utils:execCmd scls_process_spawn() failed 1
2015-06-04 10:35:51.156520 :CLSDYNAM:3286230784: [ ora.crf]{0:0:8275} [start] (:CLSN00008:) category: -1, operation: fail, loc: canexec2, OS error: 0, other: no exe permission, file [/u01/app/12.1.0.2/grid/bin/osysmond]

Indeed the osysmond is owned by the oracle user where it should be owned by root:

[root@exa01db01 ~]# ls -al /u01/app/12.1.0.2/grid/bin/osysmond
-rwxr-x--- 1 oracle oinstall 9441 Jun  4 10:42 /u01/app/12.1.0.2/grid/bin/osysmond

The fix for that is simple – you need to unlock and lock the GI:

[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/crs/install/rootcrs.pl -unlock
[root@exa01db01 ~]# /u01/app/12.1.0.2/grid/crs/install/rootcrs.pl -patch

The osysmond has the correct permissions now and the resource ora.crf starts sucessfully:

[root@exa01db01 ~]# ls -al /u01/app/12.1.0.2/grid/bin/osysmond
-rwxr-x--- 1 root oinstall 9533 Jun  4 10:48 /u01/app/12.1.0.2/grid/bin/osysmond

 

For reference:

Categories: oracle Tags:

Exadata’s onecommand fails to validate NTP servers on storage servers

July 6th, 2015 No comments

This will be simple and short post on an issue I had recently. I got the following error while running the first step of onecommand – Validate Configuration File:

2015-07-01 12:31:03,712 [INFO  ][    main][     ValidationUtils:761] SUCCESS: NTP servers on machine exa01db02.local.net verified successfully
2015-07-01 12:31:03,713 [INFO  ][    main][     ValidationUtils:761] SUCCESS: NTP servers on machine exa01db01.local.net verified successfully
2015-07-01 12:31:03,714 [INFO  ][    main][     ValidationUtils:778] Following errors were found...
2015-07-01 12:31:03,714 [INFO  ][    main][     ValidationUtils:783] ERROR: Encountered error while running NTP validation error on host: exa01cel03.local.net
2015-07-01 12:31:03,714 [INFO  ][    main][     ValidationUtils:783] ERROR: Encountered error while running NTP validation error on host: exa01cel02.local.net
2015-07-01 12:31:03,714 [INFO  ][    main][     ValidationUtils:783] ERROR: Encountered error while running NTP validation error on host: exa01cel01.local.net

Right, so my NTP servers were accessible from the db nodes but not from the cells. When I queried the NTP server from the cells I got the following error:

# ntpdate -dv ntpserver1
1 Jul 09:00:09 ntpdate[22116]: ntpdate 4.2.6p5@1.2349-o Fri Feb 27 14:50:33 UTC 2015 (1)
Looking for host ntpserver1 and service ntp
host found : ntpserver1.local.net
transmit(172.16.1.100)
transmit(172.16.1.100)
transmit(172.16.1.100)
transmit(172.16.1.100)
transmit(172.16.1.100)
172.16.1.100: Server dropped: no data
server 172.16.1.100, port 123

Perhaps I should have mentioned that the cells have their own firewall (cellwall) which will only allow certain inbound/outbound traffic. During boot the script will build all the rules dynamically and apply them. Now the above error occurred because of two reasons:

A) The NTP servers were specified using hostname instead of IP addresses in OEDA
B) The management network was NOT available after the initial config (applyElasticConfig) was applied

Because of that cellwall was not able to resolve the NTP servers IP addresses and thus they were omitted from the firewall configuration. You can safely proceed with the deployment but if you want to get rid of the annoying message the solution is simply to restart the cell firewall – /etc/init.d/cellwall restart

Categories: oracle, Uncategorized Tags:

MGMTDB not automatically created on Exadata X5 and GI 12.1.0.2

July 1st, 2015 No comments

While deploying an X5 Full Rack recently it happened that the Grid Infrastructure Management Repository was not created by onecommand. The GIMR database was optional in 12.1.0.1 and became mandatory in 12.1.0.2 and should be automatically installed with Oracle Grid Infrastructure 12c release 1 (12.1.0.2). For unknown reason to me that didn’t happen and I had to create it manually. I’ve checked all the log files but couldn’t find any errors.  For reference the OEDA version used was Feb 2015 v15.050, image version on the Exadata was 12.1.2.1.0.141206.1.

To create the database login as the grid user and create file holding the following variables:

cat > /tmp/cfgrsp.properties
oracle.assistants.asm|S_ASMPASSWORD=[your ASM password]
oracle.assistants.asm|S_ASMMONITORPASSWORD=[your ASM password]

and run the following command:

GRID_HOME=/u01/app/12.1.0.2/grid
[oracle@exa01 ~]$ $GRID_HOME/cfgtoollogs/configToolAllCommands RESPONSE_FILE=/tmp/cfgrsp.properties

For reference, here is similar bug I found on MOS:
-MGMTDB Not Created When Using EM12c Provisioning (Doc ID 1983885.1)

Categories: oracle Tags: , ,

dbnodeupdate.sh post upgrade step fails on Exadata storage software 12.1.2.1.1

June 23rd, 2015 No comments

I’ve done several Exadata deployments in the past two months and had to upgrade the Exadata storage software on half of them. Reason for that was because units shipped before May had their Exadata storage software version of 12.1.2.1.0.

The upgrade process of the database nodes ran fine but when I ran dbnodeupdate.sh -c for completing post upgrade steps I got an error that the system wasn’t on the expected Exadata release or kernel:

(*) 2015-06-01 14:21:21: Verifying GI and DB's are shutdown
(*) 2015-06-01 14:21:22: Verifying firmware updates/validations. Maximum wait time: 60 minutes.
(*) 2015-06-01 14:21:22: If the node reboots during this firmware update/validation, re-run './dbnodeupdate.sh -c' after the node restarts..
(*) 2015-06-01 14:21:23: Collecting console history for diag purposes

ERROR: System not on expected Exadata release or kernel, exiting


ERROR: Correct error, or to override run: ./dbnodeupdate.sh -c -q -t 12.1.2.1.1.150316.2

Indeed, the database node was running the new Exadata software but still using the old kernel (2.6.39-400.243) and dbnodeupdate was expecting me to run the new 2.6.39-400.248 kernel:

imageinfo:
Kernel version: 2.6.39-400.243.1.el6uek.x86_64 #1 SMP Wed Nov 26 09:15:35 PST 2014 x86_64
Image version: 12.1.2.1.1.150316.2
Image activated: 2015-06-01 12:27:57 +0100
Image status: success
System partition on device: /dev/mapper/VGExaDb-LVDbSys1

The reason for that was that the previous run of dbnodeupdate installed the new kernel package but failed to update grub.conf. The solution is to manually add the missing kernel entry to grub.conf and reboot the server to pick up the new kernel, here is a note for more information which by the time I had this problem was still internal:


Bug 20708183 – DOMU:GRUB.CONF KERNEL NOT ALWAYS UPDATED GOING TO 121211, NEW KERNEL NOT BOOTED

 

 

Categories: oracle Tags: ,

How do I change DNS servers on Exadata storage servers

June 19th, 2015 No comments

This is just a quick post to highlight a problem I had recently on another Exadata deployment.

For the most customers the management network on Exadata is routable and the DNS servers are accessible. However in a recent deployment for a financial organization this wasn’t the case and the storage servers were NOT able to reach the DNS servers. The customer provided a different set of DNS servers within the management network which were still able to resolve all the Exadata hostnames. If you encounter similar problem stop all cell services and run ipconf on each storage server to update the DNS servers.

On each storage server there is a service called cellwall (/etc/init.d/cellwall) which actually will run many checks and apply a lot of iptables rules. Here are couple of comments from the script to give you an idea:

# general lockdown from everything external, (then selectively permit)
  # general permissiveness (localhost: if you are in, you are in)
  # allow all udp traffic only from rdbms hosts on IPoIB only
      # allow DNS to work on all interfaces
      # open sport=53 only for DNS servers (mitigate remote-offlabel-port exploit)

and many more but you can check the script and see what it does OR run iptables -L -n to get all the iptables rules.

Here is some more information on how to change IP addresses on Exadata:

UPDATE: Thanks to Jason Arneil for pointing out that proper way to update the configuration of the cell.

Categories: oracle Tags:

How to configure Power Distribution Units on Exadata X5

June 18th, 2015 No comments

I’ve done several Exadata deployments recently and I have to say of all the components PDUs were hardest to configure. Important to notice that unlike earlier generations of Exadata the PDUs in X5 are Ehnanced PDUs and not Standard.

Reading the public documentation (Configuring the Power Distribution Units) it says that on PDUs with three power input leads you need to connect the middle power lead to the power source. Well I’ve done that many times and it didn’t NOT worked, the documentation says that the PDU should be accessible on 192.168.0.1. I believe the reason for that is because DHCP has been enabled by default and this can be easily confirmed by checking the LCD screen of the PDU. I even tried setting up DHCP server myself to make the PDU acquire IP but that didn’t worked either.

To configure the PDU you need to connect through serial management port. Nowadays there are no more laptops with serial ports so you will need USB to RS-232 DB9 Serial Adapter, I bought mine from Amazon. You will also need DB9 to RJ45 cable – these are quite popular and I’m sure you’ve seen before the blue Cisco console cable.

You need to connect the cable to SET MGT port of PDU and then establish terminal connection (you can use putty too) with the following settings:
9600 baud, 8 bit, 1 stop bit, no parity bit, no flow control

The username is admin and the password is adm1n.

Here are the commands you need to configure the PDU. Each network change requires reboot of the PDU:

Welcome to Oracle PDU

pducli->username: admin
pducli->password: *****
Login OK - Admin rights!
pducli->

set pdu_name=exa01pdu01
set systime_manual_date=2015-06-18
set systime_manual_time=12:45:00
set systime_ntp_server_enable=On
set systime_ntp_server=192.168.1.2
set systime_dst_enable=On
set net_ipv4_dhcp=Off
reset=yes

set net_ipv4_ipaddr=192.168.1.10
set net_ipv4_subnet=255.255.255.0
set net_ipv4_gateway=192.168.1.1
set net_ipv4_dns1=192.168.1.3
set net_ipv4_dns2=192.168.1.4
reset=yes

Regarding the network connectivity – the documentation says you need two additional cables from your management network. However if you have half or quarter rack you can plug-in the PDU network connections to the management/cisco switch. Make a note that if you ever plan to upgrade to full rack you will have to provide the two additional cables from your management network and disconnect PDUs from the management switch.

 

IMPORTANT: Make sure you don’t leave any active CLI sessions, otherwise you won’t be able to login remotely and will require data centre visit to reboot the PDU.

Categories: oracle Tags:

opatch 12.1.0.1.7 fails with System Configuration Collection Failed

June 1st, 2015 No comments

I was recently upgrading an Exadata 12.1.0.2 DBBP6 to DBBP7 and as usual I went for the latest opatch version which was 12.1.0.1.7 (Apr 2015) as of that time.

After running the opatchauto apply or opatchauto apply -analyze I got the following error:

System Configuration Collection failed: oracle.osysmodel.driver.sdk.productdriver.ProductDriverException: java.lang.NullPointerException
Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Stream closed
        at oracle.opatchauto.gi.GILogger.writeWithoutTimeStamp(GILogger.java:450)
        at oracle.opatchauto.gi.GILogger.printStackTrace(GILogger.java:465)
        at oracle.opatchauto.gi.OPatchauto.main(OPatchauto.java:97)
Caused by: java.io.IOException: Stream closed
        at java.io.BufferedWriter.ensureOpen(BufferedWriter.java:98)
        at java.io.BufferedWriter.write(BufferedWriter.java:203)
        at java.io.Writer.write(Writer.java:140)
        at oracle.opatchauto.gi.GILogger.writeWithoutTimeStamp(GILogger.java:444)
        ... 2 more

opatchauto failed with error code 1.

This was a known problem caused by these bugs which are fixed in 12.1.0.1.8:
Bug 20892488 : OPATCHAUTO ANALYZE FAILING WITH GIPATCHINGHELPER::CREATESYSTEMINSTANCE FAILED
BUG 20857919 – LNX64-121023GIPSU:APPLY GIPSU FAILED WITH SYSTEM CONFIGURATION COLLECTION FAILED

As 12.1.0.1.8 is not yet available the workaround is to use lower version of opatch 12.1.0.1.6 which can be downloaded from this note:
Opatchauto Gives “System Configuration Collection Failed” Message (Doc ID 2001933.1)

You might run into the same problem if you are applying 12.1.0.2.3 PSU.

Categories: oracle Tags: , ,

Speaking at UKOUG Systems Event and BGOUG

May 19th, 2015 No comments

I’m pleased to say that I will be speaking at the UKOUG Systems Event 2015, held at Cavendish Conference Center in London, 20 May 2015. My session “Oracle Exadata Meets Elastic Configurations” starts at 10:15 in Portland Suite. Here is the agenda of the UKOUG Systems Event.

In a month time I’ll be also speaking at the Spring Conference of the Bulgarian Oracle User Group. The conference will be held from 12th to 14th June, 2015 in hotel Novotel in Plovdiv, Bulgaria. I’ve got the conference opening slot at 11:00 in hall Moskva, my session topic is “Oracle Data Guard Fast-Start Failover: Live demo”. Here is the agenda of the conference.

I would like to thank EDBA for making this happen!

Categories: oracle Tags: ,

applyElasticConfig.sh fails with Unable to locate any IB switches

May 15th, 2015 No comments

With the release of Exadata X5 Oracle introduced elastic configurations and changed the process on how the initial configuration is performed. Back before you had to run applyconfig.sh which would go across the nodes and change all the settings according to your config. This script has now evolved and it’s called applyElasticConfig.sh which is part of OEDA (onecommand). During one of the recent deployments I ran into the below problem:

[root@node8 linux-x64]# ./applyElasticConfig.sh -cf Customer-exa01.xml

Applying Elastic Config...
Applying Elastic configuration...
Searching Subnet 172.16.2.x..........
5 live IPs in 172.16.2.x.............
Exadata node found 172.16.2.46.
Collecting diagnostics...
Errors occurred. Send /opt/oracle.SupportTools/onecommand/linux-x64/WorkDir/Diag-150512_160716.zip to Oracle to receive assistance.
Exception in thread "main" java.lang.NullPointerException
at oracle.onecommand.commandexec.utils.CommonUtils.getStackFromException(CommonUtils.java:1579)
at oracle.onecommand.deploy.cliXml.ApplyElasticConfig.doDaApply(ApplyElasticConfig.java:105)
at oracle.onecommand.deploy.cliXml.ApplyElasticConfig.main(ApplyElasticConfig.java:48)

Going through the logs we can see the following message:

2015-05-12 16:07:16,404 [FINE ][ main][ OcmdException:139] OcmdException from node node8.my.company.com return code = 2 output string: Unable to locate any IB switches... stack trace = java.lang.Throwable

The problem was caused because of IB switch names in my OEDA XML file were different to the one’s actually physically in the rack, actually the IB switch hostnames were missing from the hosts file. So if you ever run into this problem make sure your IB switch hosts file (/etc/hosts) has the correct hostname in the proper format:

#IP                 FQDN                      ALIAS
192.168.1.100       exa01ib01.local.net       exa01ib01

Also make sure to reboot the IB switch after any change of the hosts file.

Categories: oracle Tags:

How to configure Link Aggregation Control Protocol on Exadata

May 13th, 2015 2 comments

During a recent X5 installation I had to configure Link Aggregation Control Protocol (LACP) on the client network of the compute nodes. Although the ports were running at 10Gbits and default configuration of Active/Passive works perfectly fine the customer wanted even distribution of traffic and workload across their core switches.

Link Aggregation Control Protocol (LACP), also known as 802.3ad is a methods of combining multiple physical network connections into one logical connection to increase throughput and provide redundancy in case one of the links should fail. The protocol requires both – the server and the switch(es) to have the same settings to allow LACP to work properly.

To configure LACP on Exadata you need to change the bondeth0 parameters.

On each of the compute nodes open the following file:

/etc/sysconfig/network-scripts/ifcfg-bondeth0

and replace the line saying BONDING_OPTS with this one:

BONDING_OPTS="mode=802.3ad xmit_hash_policy=layer3+4 miimon=100 downdelay=200 updelay=5000 num_grat_arp=100"

and then restart the network interface:

ifdown bondeth0
ifup bondeth0
Determining if ip address 192.168.1.10 is already in use for device bondeth0...

You can check the status of the interface by query the proc filesystem. Make sure both interfaces are up and running at the same speed. The esential part to make sure the LACP is working is shown below:

cat /proc/net/bonding/bondeth0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 33
Partner Key: 34627
Partner Mac Address: 00:23:04:ee:be:c8
I had a problem with the network where the client network did NOT come up after server reboot. This was happening because during system boot the 10Gbit interfaces goes through multiple resets causing very fast link change. Here is the status of the bond as of that time:
cat /proc/net/bonding/bondeth0

802.3ad info
LACP rate: slow
Aggregator selection policy (ad_select): stable
bond bondeth0 has no active aggregator
The solution for that was to decrease the down_delay to 200. The issue is described in this note:
Bonding Mode 802.3ad Using 10Gbps Network – Slave NICs Fail to Come Up Consistently after Reboot (Doc ID 1621754.1)

 

Categories: linux, oracle Tags: