Archive

Archive for the ‘hp-ux’ Category

Presentation about Oracle on HP-UX and Linux

January 16th, 2010 No comments

At the last BGOUG I was talking about some of the differences between HP-UX and Linux although they cannot be compared because they run on different platforms. I tried to figure out how Linux penetrate the Enterprise OS market in the last years, what is still missing and what features I would like to see in Linux that HP-UX has for a long time. I also discussed topics about memory, best practices in multipathing and networking, storage options, asmlib tips and tricks and some words about backup and recovery.

The presentation can be found here

Categories: hp-ux, linux, oracle Tags:

Constant cimprovagt daemon crashing and filling the /var directory

November 23rd, 2009 No comments

We installed two nodes with HP-UX 11.31 March 2009 BOE in a ServiceGuard environment and started test applications in two packets.

Suddenly the /var directories on both nodes started to grow and respectively the cluster was crashing because of that and the syslog was never up to date. It turns out that some of the components (cimprovagt) of the OnlineDiagnostics were crashing. I reviewed few advisories and bugs about it, but none of them were having the same behaviour.

Executing file on the core dump file shows the following:
core:      ELF-64 core file – IA64 from ‘cimprovagt’ – received SIGABRT

HP analyzed the core dump files and determined that the problem is already known and the fix is already implemented in September release of DASProvider, which is now part of the DiagProdCollection bundle and can be found  here:
http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=DiagProdCollection

After installing the bundle the daemon stopped crashing and the system is stable now.

Categories: hp-ux Tags:

HP-UX software bug hidden in cluster behaviour

October 8th, 2009 1 comment

I was called to check some strange behavior of two-node cluster and to see why the one of the nodes crashed unexpectedly.  The two nodes were HP Integrity servers installed with HP-UX 11.31 Base OE (March 2009). Well the node did not crashed it was just restarted from the ServiceGuard with safety timer expire for some reason. System log was not up to date because /var directory was full at some point and the syslog stopped writing. Console log showed standard messages INIT occurred and safety timer expire. Analyzing the crashdumps revealed that communication with cmcld was not possible and thats why the server was rebooted probably because /var directory was full.

Anyway few days later customer called again and said that the node was restarted again,  I expected to see the same reason but this time the reboot reason was “Reboot after panic: Fault when executing in kernel mode”.  The problem was not in the cluster this time and the reboot reason was talking about some problems in the the kernel.

What is crash anyway ? From HP documentation:
An abnormal system reboot is called a crash. There are many reasons that can cause a system to crash; hardware malfunctions, software panics or even power failures. The crash even type panic refers to crashes initiated by the HP-UX operating system (software crash event). There are two types of panics: direct and indirect. A direct panic refers to a subsystem calling directly the panic() kernel routine upon detection of an unrecoverable inconsistency. An indirect panic refers to a crash event as a result of trap interruption which could not be handled by the operating system for example when the kernel accesses a non-valid address.

I analyzed the crash dumps,  reviewed all the advisories and release notes and was unable to figure out what is the cause of the crash. Finally Level 2 of the support of HP  confirmed that this is known issue with the ONCPlus bundle. ONC stands for Open Network Computing (priviously called NFS bundle in 11.23) and it consists of the following components: Network File System, AutoFS, CacheFS, and Network Information Service. We were told to implement workaround until the fix is released next month. The workaround was to add -o readdir to the mount options of the NFS share in the fstab. Well it was obvious that the problem is with the NFS component of the ONCPlus bundle.

Few days later (not month) the new product (with fixed bugs) appeared online. It can be seen from the release notes the following defect fix:
Directory related operations on NFS client with ONCplus B.11.31.06 or B.11.31.07 installed and with file system mounted with read/write size greater than 8192 bytes, may result in system panic or data corruption.

Yes, the ONCPlus bundle was 11.31.06 and we had mounted NFS share with read/write size of 32768 bytes. Both workaround and the patch seemed to fix the problem and the crash never apeared again. Keep in mind that the installation of the new ONCPlus bundle needs restart and applying the workaround does not, BUT from the support adviced us to reboot the server just to make sure that the corruption is not loaded in the memory. So if you hit this bug consider applying the new bundle.

The latest ONCPlus bundle can be downloaded from there:
https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=ONCplus

Just for reference the following stack trace is dumped on the consle when the server crashes:

bad_kern_reference: 0xffff31.0x2c20486f6d65634f, fault = 0x8

Message buffer contents after system crash:

panic: Fault when executing in kernel mode
Stack Trace:
IP                  Function Name
0xe000000001f887e0  bad_kern_reference+0xa0
0xe00000000076a3d0  $cold_vfault+0x3b0
0xe000000000c45a10  vm_hndlr+0x510
0xe000000001bd9780  bubbledown+0x0
0xe000000000d00da1  vx_iupdat_cluster+0xa1
0xe000000000d14830  vx_async_iupdat+0x160
0xe000000000d4a530  vx_iupdat_local+0x2c0
0xe000000000d8c020  vx_iupdat+0xb0
0xe000000002134ed0  vx_iflush_list+0x4d0
0xe000000000afa8c0  vx_iflush+0x1d0
0xe000000000cf2710  vx_worklist_thread+0x200
0xe000000000e65d70  kthread_daemon_startup+0x90

Regards,
sve

Categories: hp-ux Tags: ,

vxfs extendfs: Invocation of the fsck program terminated abnormally

September 20th, 2009 2 comments

You will see this message in case you interrupt the extendfs command. I my case I passed block device as an argument to the extendfs command and waited for several minutes (usually it takes one or two) then I remembered that the command expects character device. I read some opinions in Internet and some of them were really horrifying, like: “Well, when you passed block device to extendfs you have to format it and  restore the filesystem from backup“.

Yes, it was stupid to press Ctrl-C, but when people panic they don’t think normally.
root@host:/# extendfs -F vxfs /dev/vg00/lvora

vxfs extendfs: Invocation of the fsck program terminated abnormally.
The file system is marked bad.  Run full fsck manually.
(e.g. fsck -F vxfs /dev/vg00/lvora)

Next step was to make a normal filesystem check:
root@host:/# fsck /dev/vg00/lvora
log replay in progress
log replay failed to clean file system
file system is not clean, full fsck required
full file system check required, exiting …

Mount also says that the filesystem is corrupted:
root@host:/# mount /oracle

vxfs mount: /dev/vg00/lvora is corrupted. needs checking

After reading some manuals and opinions I saw that in this version of HP-UX it doesn”t matter what device is passed because extendfs accepts both. Luckily fsck fixes the problem:
root@host:/root# fsck -y -F vxfs /dev/vgora/lvtest
log replay in progress
log replay failed to clean file system
file system is not clean, full fsck required
pass0 – checking structural files
pass1 – checking inode sanity and blocks
pass2 – checking directory linkage
pass3 – checking reference counts
pass4 – checking resource maps
au 0 summary incorrect – fix? (ynq)y
au 1 summary incorrect – fix? (ynq)y
au 2 summary incorrect – fix? (ynq)y
au 3 summary incorrect – fix? (ynq)y
au 4 summary incorrect – fix? (ynq)y
au 5 summary incorrect – fix? (ynq)y
………………….
au 3500 summary incorrect – fix? (ynq)y
au 3501 summary incorrect – fix? (ynq)y
au 3502 summary incorrect – fix? (ynq)y
au 3503 summary incorrect – fix? (ynq)y
au 3504 summary incorrect – fix? (ynq)y
au 3505 emap incorrect – fix? (ynq)y
au 3505 summary incorrect – fix? (ynq)y
au 3506 emap incorrect – fix? (ynq)y
au 3506 summary incorrect – fix? (ynq)y
au 3507 emap incorrect – fix? (ynq)y
au 3507 summary incorrect – fix? (ynq)y
…………………
au 4653 summary incorrect – fix? (ynq)y
au 4654 emap incorrect – fix? (ynq)y
au 4654 summary incorrect – fix? (ynq)y
au 4655 emap incorrect – fix? (ynq)y
au 4655 summary incorrect – fix? (ynq)y
free block count incorrect 1292071477 expected 39009935 fix? (ynq)y
free extent vector incorrect fix? (ynq)y
OK to clear log? (ynq)y
set state to CLEAN? (ynq)y

After fsck finishes the filesystem is extended and it can be mounted:
root@host:/root# mount /dev/vgora/lvtest /mnt
root@host:/root#

Next step was to make a normal filesystem check:
root@isengard:/# extendfs -F vxfs /dev/vg00/lvora
Categories: hp-ux Tags:

Cannot start HP Integrity Virtual Machines Manager

September 19th, 2009 No comments

System: HP-UX 11.31 DCOE Match 2009, installed HP-SIM and HPVM

This error is shown on the System Management Homepage when I try to open HP Integrity Virtual Machines Manager:

An error occurred collecting data query failed.
An error occurred communicating with WBEM: CIM_ERR_FAILED CIM_ERR_FAILED: @1:An internal error has occurred.[hpvm_get_rsrc_controller:115:scheduler failure]

Which looked like the HP Integrity Virtual Machines are not running and when I try to start them I get the following:

root@itan2:/# /sbin/init.d/hpvm start
NOTE:   HPSIM-HP-UX is incompatible with Integrity VM software and should be removed.
ERROR:   Integrity VM software cannot be started when hyperthreading is enabled
(getconf SC_HT_ENABLED). Use /usr/sbin/setboot -m off and reboot to enable
this system as an Integrity VM host.
root@itan2:/# getconf SC_HT_ENABLED
1

On page 24 and 26 of HP Integrity Virtual Machines Version 4.1 Installation, Configuration, and Administration manual there are few requirements. In this particular case the two items which block Integrity VM Version 4.1 from starting are HP System Insight Manager (HP SIM) Server bundle and hyperthreading. First check for installed HPSIM product with the following command:

swlist | grep HPSIM-HP-UX

Remove with swremove if necessary.
Then check to see if the hyperthreading which if it is enabled. On page 26 of the manual there is a note regarding the hyperthreating:

NOTE: Integrity VM Version 4.1 does not support hyperthreading. Specify the following command to turn off hyperthreading; otherwise, Integrity VM will not start:

/usr/sbin/setboot -m off

Reboot the system and then HP Integrity Virtual Machines will start normally and respectively HP Integrity Virtual Machines Manager will start and will be available at the HP SMH.

Categories: hp-ux Tags:

Many racgmain(check) processes at HP-UX 11iv3

August 17th, 2009 9 comments

I was called that some commands for controlling the cluster and the oracle are not working. This was two node cluster installed with Oracle 10.2.0.4 RAC on HP-UX 11.31 Data Center OE (December 2008) working for a month already.

Arriving at the customer site I noticed that there are a lot (around 500) of hanging racgmain(check) processes which obviously were blocking some of the cluster commands. Errors also can be seen at this log: $CRS_HOME/log/$HOSTNAME/crsd/crsd.log:

2009-04-08 15:22:01.700: [  CRSEVT][90801] CAAMonitorHandler :: 0:Action Script /oracle/ora10g/bin/racgwrap(check) timed
out for ora.ORCL.ORCL1.inst! (timeout=600)
2009-04-08 15:22:01.700: [  CRSAPP][90801] CheckResource error for ora.ORCL.ORCL1.inst error code = -2
2009-04-08 15:25:42.180: [  CRSEVT][90811] CAAMonitorHandler :: 0:Could not join /oracle/ora10g/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0, other: Abnormal termination of the child

There are a lot of bugs at metalink, but no documents or suggestions how to fix that.

Fortunately we found a solution:

1. Stop CRS on all nodes.

2. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on all nodes

3. Edit the file racgwrap and modify the last 3 lines from:

$ORACLE_HOME/bin/racgmain “$@”
status=$?
exit $status

to:

exec $ORACLE_HOME/bin/racgmain “$@”

4. Restart CRS and make sure that all the resources are starts.

We were lucky that hit the bug just before the migration and restarting the instances/servers was easy enough. I don’t know if this really solves the problem, but we never hit the bug again.

Categories: hp-ux, oracle Tags: ,

Changing physical path of ASM disk group

August 11th, 2009 6 comments

The purpose of this document is to show that changing the psyhical path of ASM disk MEMBERS is possible and there is no risk.

For the purpose of the test, we create one logical volume called lvora and we grant ownership of this file to oracle:
root@node1:/# lvcreate -n lvora -L 1024 vg00
root@node1:/# chown oracle:dba /dev/vg00/rlvora

Start DBCA and create ASM instance:
– set sys password
– set data group name to DATA
– set redundancy to External
– set Disk Discovery Path to /dev/vg00/rlv*

At this stage only /dev/vg00/rlvora is CANDIDATE disk for disk group with size of 1 Gb.
Select the disk and create the disk group. Now we have one mounted disk group called DATA with external redundancy and
using /dev/vg00/rlvora as a MEMBER of the disk group.

To simulate changing (or failure) of the physical disk or even moving data from one physical disk to another we used dd
to copy raw data from /dev/vg00/rlvora to /dev/rdsk/c0t2d0 and then we delete the logical volume.

We shutdown the ASM instance and copy the contents of the logical volume to the raw physical disk using dd:

oracle@node1:/home/oracle$ export ORACLE_HOME=/oracle/ora10g
oracle@node1:/home/oracle$ export ORACLE_SID=+ASM
oracle@node1:/home/oracle$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Dec 13 01:50:38 2007

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

Connected to:
Oracle Database 10g Release 10.2.0.1.0 - 64bit Production
With the Real Application Clusters option

SQL> select GROUP_NUMBER, NAME, STATE, TYPE from v$asm_diskgroup;

GROUP_NUMBER NAME                           STATE       TYPE
------------ ------------------------------ ----------- ------
1 DATA                           MOUNTED     EXTERN

SQL> select GROUP_NUMBER, DISK_NUMBER, MODE_STATUS, STATE, NAME, PATH from v$asm_disk;

GROUP_NUMBER DISK_NUMBER MODE_ST STATE    NAME      PATH
------------ ----------- ------- -------- --------- ----------------
1             0          ONLINE  NORMAL   DATA_0000 /dev/vg00/rlvora

SQL> shutdown immediate
ASM diskgroups dismounted
ASM instance shutdown
SQL>  exit

oracle@node1:/home/oracle$ exit

root@node1:/root# chown oracle:dba /dev/rdsk/c0t2d0

root@node1:/root# dd if=/dev/vg00/rlvora of=/dev/rdsk/c0t2d0 bs=1024k
1024+0 records in
1024+0 records out
root@node1:/root#  lvremove /dev/vg00/lvora
The logical volume "/dev/vg00/lvora" is not empty;
do you really want to delete the logical volume (y/n) : y
Logical volume "/dev/vg00/lvora" has been successfully removed.
Volume Group configuration for /dev/vg00 has been saved in /etc/lvmconf/vg00.conf

We have moved data to /dev/rdsk/c0t2d0 and we have removed the logical volume.

Now if you try to mount the disk group or start the instance you will get the following error:

oracle@node1:/home/oracle$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Dec 13 02:05:48 2007

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ASM instance started

Total System Global Area  130023424 bytes
Fixed Size                  1991968 bytes
Variable Size             102865632 bytes
ASM Cache                  25165824 bytes
ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "DATA"

SQL> select GROUP_NUMBER, NAME, STATE, TYPE from v$asm_diskgroup;

no rows selected

SQL> select GROUP_NUMBER, DISK_NUMBER, MODE_STATUS, STATE, NAME, PATH from v$asm_disk;

no rows selected

SQL> show parameter diskstring

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/vg00/rlv*

As you can seen the discovery path is still pointing to /dev/vg00/rlv*, now we will change disk discovery path by pointing asm_diskstring parameter to the new location of the disk and we will mount the ASM instance:

SQL> alter system set asm_diskstring='/dev/rdsk/*' scope=both;

System altered.

SQL> select GROUP_NUMBER, DISK_NUMBER, MODE_STATUS, STATE, NAME, PATH from v$asm_disk;

GROUP_NUMBER DISK_NUMBER MODE_ST STATE    NAME      PATH
------------ ----------- ------- -------- --------- ----------------
0            0           ONLINE  NORMAL             /dev/rdsk/c0t2d0

SQL> alter diskgroup data mount;

Diskgroup altered.

SQL> select GROUP_NUMBER, DISK_NUMBER, MODE_STATUS, STATE, NAME, PATH from v$asm_disk;

GROUP_NUMBER DISK_NUMBER MODE_ST STATE    NAME      PATH
------------ ----------- ------- -------- --------- ----------------
1            0           ONLINE  NORMAL   DATA_0000 /dev/rdsk/c0t2d0

SQL> select GROUP_NUMBER, NAME, STATE, TYPE from v$asm_diskgroup;

GROUP_NUMBER NAME                           STATE       TYPE
------------ ------------------------------ ----------- ------
1 DATA                           MOUNTED     EXTERN

SQL> show parameter diskstring;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
asm_diskstring                       string      /dev/rdsk/*

Final test to show that the changes are applied:

SQL> shutdown immediate
ASM diskgroups dismounted
ASM instance shutdown
SQL> startup
ASM instance started

Total System Global Area  130023424 bytes
Fixed Size                  1991968 bytes
Variable Size             102865632 bytes
ASM Cache                  25165824 bytes
ASM diskgroups mounted
SQL> exit
Disconnected from Oracle Database 10g Release 10.2.0.1.0 - 64bit Production
With the Real Application Clusters option
oracle@node1:/home/oracle$

Conclusion
ASM does not keep track of the physical disks of the data groups. Said in other way it does not matter the path or the mminor, major numbers of the physical disks, because the metadata is kept on the disk itself and there is nothing in the dictionary. When you start ASM instance it scans the disks based on the parameter asm_diskstring and reads the header information of the discovered disks.

Categories: hp-ux, oracle Tags: , ,

Migration of HP-UX raw devices to Oracle ASM

August 10th, 2009 No comments

This is an article which I wrote about how to migrate Oracle datafiles from LVM raw devices to Oracle ASM.

Categories: hp-ux, oracle Tags: ,