Archive

Posts Tagged ‘nfs’

Oracle GI 12.1 error when using NFS

January 16th, 2014 No comments

I had quite an interesting case recently where I had to build stretch cluster for a customer using Oracle GI 12.1 and placing quorum voting disk on NFS. There is a document at OTN regarding the stretch clusters and using NFS as a third location for voting disk but it has information for 11.2 only as of the moment. Assuming there is no difference in the NFS parameters I used the Linux parameters from that document and mounted the NFS share on the cluster nodes.

Later on when I tried to add the third voting disk within the ASM disk group I got this strange error:

SQL> ALTER DISKGROUP OCRVOTE ADD  QUORUM DISK '/vote_nfs/vote_3rd' SIZE 10000M /* ASMCA */
Thu Nov 14 11:33:55 2013
NOTE: GroupBlock outside rolling migration privileged region
Thu Nov 14 11:33:55 2013
Errors in file /install/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_26408.trc:
ORA-17503: ksfdopn:3 Failed to open file /vote_nfs/vote_3rd
ORA-17500: ODM err:Operation not permitted
Thu Nov 14 11:33:55 2013
Errors in file /install/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_33427.trc:
ORA-17503: ksfdopn:3 Failed to open file /vote_nfs/vote_3rd
ORA-17500: ODM err:Operation not permitted
NOTE: Assigning number (1,3) to disk (/vote_nfs/vote_3rd)
NOTE: requesting all-instance membership refresh for group=1
Thu Nov 14 11:33:55 2013
ORA-15025: could not open disk "/vote_nfs/vote_3rd"
ORA-17503: ksfdopn:3 Failed to open file /vote_nfs/vote_3rd
ORA-17500: ODM err:Operation not permitted
WARNING: Read Failed. group:1 disk:3 AU:0 offset:0 size:4096
path:Unknown disk
incarnation:0xeada1488 asynchronous result:'I/O error'
subsys:Unknown library krq:0x7f715f012d50 bufp:0x7f715e95d600 osderr1:0x0 osderr2:0x0
IO elapsed time: 0 usec Time waited on I/O: 0 usec
NOTE: Disk OCRVOTE_0003 in mode 0x7f marked for de-assignment
Errors in file /install/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ora_33427.trc  (incident=83441):
ORA-00600: internal error code, arguments: [kfgscRevalidate_1], [1], [0], [], [], [], [], [], [], [], [], []
ORA-15080: synchronous I/O operation failed to read block 0 of disk 3 in disk group OCRVOTE

This happens because with 12c direct NFS is used by default and it will use ports above 1024 to initiate connections. On the other hand there is a default option on the NFS server – secure which will require any incoming connections from ports below 1024:
secure This  option requires that requests originate on an internet port less than IPPORT_RESERVED (1024). This option is on by default. To turn it off, specify insecure.

The solution for that is to add insecure parameters to the exporting NFS server, remount the NFS share and then retry the above operation.

For more information refer to:
12c GI Installation with ASM on NFS Disks Fails with ORA-15018 ORA-15072 ORA-15080 (Doc ID 1555356.1)

 

Categories: linux, oracle Tags: , ,

HP-UX software bug hidden in cluster behaviour

October 8th, 2009 1 comment

I was called to check some strange behavior of two-node cluster and to see why the one of the nodes crashed unexpectedly.  The two nodes were HP Integrity servers installed with HP-UX 11.31 Base OE (March 2009). Well the node did not crashed it was just restarted from the ServiceGuard with safety timer expire for some reason. System log was not up to date because /var directory was full at some point and the syslog stopped writing. Console log showed standard messages INIT occurred and safety timer expire. Analyzing the crashdumps revealed that communication with cmcld was not possible and thats why the server was rebooted probably because /var directory was full.

Anyway few days later customer called again and said that the node was restarted again,  I expected to see the same reason but this time the reboot reason was “Reboot after panic: Fault when executing in kernel mode”.  The problem was not in the cluster this time and the reboot reason was talking about some problems in the the kernel.

What is crash anyway ? From HP documentation:
An abnormal system reboot is called a crash. There are many reasons that can cause a system to crash; hardware malfunctions, software panics or even power failures. The crash even type panic refers to crashes initiated by the HP-UX operating system (software crash event). There are two types of panics: direct and indirect. A direct panic refers to a subsystem calling directly the panic() kernel routine upon detection of an unrecoverable inconsistency. An indirect panic refers to a crash event as a result of trap interruption which could not be handled by the operating system for example when the kernel accesses a non-valid address.

I analyzed the crash dumps,  reviewed all the advisories and release notes and was unable to figure out what is the cause of the crash. Finally Level 2 of the support of HP  confirmed that this is known issue with the ONCPlus bundle. ONC stands for Open Network Computing (priviously called NFS bundle in 11.23) and it consists of the following components: Network File System, AutoFS, CacheFS, and Network Information Service. We were told to implement workaround until the fix is released next month. The workaround was to add -o readdir to the mount options of the NFS share in the fstab. Well it was obvious that the problem is with the NFS component of the ONCPlus bundle.

Few days later (not month) the new product (with fixed bugs) appeared online. It can be seen from the release notes the following defect fix:
Directory related operations on NFS client with ONCplus B.11.31.06 or B.11.31.07 installed and with file system mounted with read/write size greater than 8192 bytes, may result in system panic or data corruption.

Yes, the ONCPlus bundle was 11.31.06 and we had mounted NFS share with read/write size of 32768 bytes. Both workaround and the patch seemed to fix the problem and the crash never apeared again. Keep in mind that the installation of the new ONCPlus bundle needs restart and applying the workaround does not, BUT from the support adviced us to reboot the server just to make sure that the corruption is not loaded in the memory. So if you hit this bug consider applying the new bundle.

The latest ONCPlus bundle can be downloaded from there:
https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=ONCplus

Just for reference the following stack trace is dumped on the consle when the server crashes:

bad_kern_reference: 0xffff31.0x2c20486f6d65634f, fault = 0x8

Message buffer contents after system crash:

panic: Fault when executing in kernel mode
Stack Trace:
IP                  Function Name
0xe000000001f887e0  bad_kern_reference+0xa0
0xe00000000076a3d0  $cold_vfault+0x3b0
0xe000000000c45a10  vm_hndlr+0x510
0xe000000001bd9780  bubbledown+0x0
0xe000000000d00da1  vx_iupdat_cluster+0xa1
0xe000000000d14830  vx_async_iupdat+0x160
0xe000000000d4a530  vx_iupdat_local+0x2c0
0xe000000000d8c020  vx_iupdat+0xb0
0xe000000002134ed0  vx_iflush_list+0x4d0
0xe000000000afa8c0  vx_iflush+0x1d0
0xe000000000cf2710  vx_worklist_thread+0x200
0xe000000000e65d70  kthread_daemon_startup+0x90

Regards,
sve

Categories: hp-ux Tags: ,