Shared disk support for VirtualBox

August 9th, 2010 Sve 1 comment

I’m very happy to announce that VirtualBox now supports shared disks. Finally we can attach one disk to several virtual machines and run Oracle RAC and other clusters. As Oracle promised, this feature is released with the next maintenance patch (thanks!).

There is a new image write mode which is called shareable and this options is now available for the commands createhd and modifyhd of VBoxManage. To create new shared image use the command VBoxManage createhd with type shareble, creating shared disk from the GUI is not possible. To mark an existing image as a shared use the command VBoxManage modifyhd with type shareable.

Something important is that only fixed size disks are supported. If the disk is dynamic you will encounter the following error if you try to modify the image:
ERROR: Cannot change type for medium ‘/home/vm/ora11g_shared.vdi’ to ‘Shareable’ since it is a dynamic medium storage unit

There is other minor issue, if the image is already attached to two virtual machines the command modifyhd will also fail:
ERROR: Cannot change the type of medium ‘/home/vm/ora11g_shared.vdi’ because it is attached to 2 virtual machines

And finally, YES it works, I have tested it already!

sve@host:~$ VBoxManage showhdinfo /home/vm/ora11g_shared.vdi
Oracle VM VirtualBox Command Line Management Interface Version 3.2.8
(C) 2005-2010 Oracle Corporation
All rights reserved.

UUID:                     7521f059-1196-4d68-a1a6-cf0082fb446a
Accessible:               yes
Description:          
Logical size:             2048 MBytes
Current size on disk:     2048 MBytes
Type:                     shareable
Storage format:           VDI
In use by VMs:            labs1 (UUID: 25475ff4-70bc-4e2e-aa38-d8fae289273e)
                          labs2 (UUID: e4441f4c-1ef9-42e0-8e54-d2aec2c6cf4f)
Location:                 /home/vm/ora11g_shared.vdi

Regards and happy migration ;)
Sve

Categories: oracle, virtualization Tags: , , ,

Patch Set 10.2.0.5 for Oracle Database Server re-released on Linux x86

August 9th, 2010 Sve No comments

A week ago Oracle has re-released the patch set 10.2.0.5 for Oracle Database on Linux x86 (32 bit). It seems that some additional bug fixes were added to the patch set, but I was unable to find exactly which one. The patch set is available for download from My Oracle Support with same number 8202632. There is also alert with MOS ID 1156958.1 regarding the re-release of the patch set.

Regards,
Sve

Categories: linux, oracle Tags:

Many open files on HP-UX after RAC upgrade to 10.2.0.4 – racgimon file handle leak

July 23rd, 2010 Sve No comments

Two months after patching a customer database to 10.2.0.4 I’ve received a call, telling me that the database is hanging. Usually this happens when they missed the backup of the archive logs and the database stops. This time there was enough space available and this was not the problem. I logged to the first node and start looking around, weird things were happening, some commands were failing and other were hanging. Then I realized that this is not an ordinary case and start looking deeper. It turns out that this is a bug of Oracle with HP-UX and there is a patch and work around too.

The customer was having HP-UX 11.23 (September 2006) with patch bundles from September 2008. The database was Oracle RAC Enterprise Edition 10.2.0.2.

This problem had very big impact on the database because although the database is running in RAC the database was not accessible and there were a lot of locks. Rebooting the node or killing the processes do the job

After some reading it figure out that this happens only on HP-UX, after patching the database to 10.2.0.4 and it happens only on the first node.

Here are some symptoms:


Executing sar -v show the current-size and maximum size of the system file table:

12:00:00   N/A   N/A 328/4200  0  1374/286108 0  41906/65536 0
12:02:00   N/A   N/A 330/4200  0  1376/286108 0  41944/65536 0
12:04:00   N/A   N/A 336/4200  0  1390/286108 0  41999/65536 0
12:06:00   N/A   N/A 331/4200  0  1377/286108 0  41983/65536 0
12:08:00   N/A   N/A 330/4200  0  1376/286108 0  41976/65536 0
12:10:00   N/A   N/A 330/4200  0  1377/286108 0  41935/65536 0


With lsof the following open files are seen:

racgimon   3506 oracle   14u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   28u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   30u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat
racgimon   3506 oracle   37u   REG             64,0x9        1552   29678 /oracle/ora10g/dbs/hc_baandb1.dat


The processes which is holding the open files:

 oracle  3506     1  0  Nov  5  ?        18:16 /oracle/ora10g/bin/racgimon startd baandb


At this log “$ORACLE_HOME/log/<NodeName>/racg/imon_<InstanceName>.log” every minute can be seen the following error:

2009-12-02 12:12:35.454: [    RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13
2009-12-02 12:13:35.474: [    RACG][73] [3506][73][ora.baandb.baandb1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13


When the file table gets full weird things start to happen,  in the syslog the following can be seen:

Nov  5 08:00:02 db1 vmunix: file: table is full
Nov  5 08:00:03 db1 vmunix: file: table...
Nov  5 08:00:03 db1 vmunix: file...
Nov  5 08:00:03 db1 vmunix: file...
Nov  5 08:01:13 db1 vmunix: file: table is full
Nov  5 08:11:15 db1  above message repeats 34260 times


Also in the alertlog file the following can be seen:

ORA-00603: ORACLE server session terminated by fatal error
ORA-27544: Failed to map memory region for export
ORA-27300: OS system dependent operation:socket failed with status: 23
ORA-27301: OS failure message: File table overflow
ORA-27302: failure occurred at: sskgxpcre1


Solution:
Base bug is 6931689 (SS10204-HP-PARISC64-080216.080324 HEALTH CHECK FAILED TO CONNECT TO INSTANCE), but it’s not public. It’s fixed in CRS 10.2.0.4 Bundle Patch #2, but the actual CRS bundle is PSU2 with Patch# 8705958: TRACKING BUG FOR 10.2.0.4.2 PSU FOR CRS which is around 41Mb big.
This patch# 8705958 should be applied to all Oracle homes although the bug is in the database CRS should always be a higher version.

To apply this patch OPatch version must be at least 10.2.0.4.7, which can be downloaded with patch# 6880880. At the moment of writing this the latest version was 10.2.0.4.9 and its 34Mb. To install it, simply download it and unzip it under ORACLE_HOME.

I didn’t went with the patch because I read some scary stuff at OTN and thanks to Ivan Kartik I integrated a dirty work around. He proposed very good script which is checking if opened files are more than 20000 just to kill the racgimon process:

13:56:00   N/A   N/A 307/4200  0  1352/286108 0  44102/65536 0
13:58:00   N/A   N/A 307/4200  0  1353/286108 0  44119/65536 0
14:00:01   N/A   N/A 309/4200  0  1355/286108 0  44135/65536 0
14:02:01   N/A   N/A 307/4200  0  1353/286108 0  44153/65536 0
14:04:01   N/A   N/A 301/4200  0  1336/286108 0  2583/65536 0
14:06:01   N/A   N/A 306/4200  0  1347/286108 0  2610/65536 0
14:08:01   N/A   N/A 299/4200  0  1333/286108 0  2583/65536 0
14:10:01   N/A   N/A 300/4200  0  1335/286108 0  2571/65536 0

The work around fixed the problem. This article was written half an year ago and reading MOS now they say that this bug is fixed in 10.2.0.5 which was released at the beginning of June.

Regards,
Sve

Categories: hp-ux, oracle Tags: ,

Oracle will bring back VirtualBox shared disk capability

July 1st, 2010 Sve No comments

During the questions section of the last webinar Introducing Oracle VM VirtualBox 3.2 Oracle said that they received a complains from a lot of customers using VirtualBox regarding the installation of Oracle RAC. This requires a shared disk drive to be accessed by the nodes (VMs) of the cluster simultaneously, but this cannot be achieved directly. There is a workaround by using iSCSI, but this is not the point.

Achim Hasenmueller from VirtualBox engineering team said that they plan to deliver this capability very soon with the next maintenance release and not to wait for the major update. I was surprised to hear that they used to have this feature working, but during one of the major changes to the storage stack they have lost it. I was not able to find this one at the changelogs, but by accident I found the announcement of this limitation at debian bug report log:

From: "VirtualBox" <trac@virtualbox.org>
Cc: vbox-trac@virtualbox.org
Subject: Re: [VirtualBox] #1188: Please support to share a disk image
 between two guests
Date: Wed, 08 Apr 2009 15:24:49 -0000

#1188: Please support to share a disk image between two guests
-----------------------------+----------------------------------------------
Reporter:  bzed              |        Owner:
    Type:  enhancement       |       Status:  closed
Priority:  minor             |    Component:  VM control
 Version:  VirtualBox 1.5.4  |   Resolution:  wontfix
Keywords:                    |        Guest:  other
    Host:  other             |
-----------------------------+----------------------------------------------
Changes (by frank):

  * status:  new => closed
  * resolution:  => wontfix

Comment:

 Starting with 2.1.0, a disk image can be attached to two VMs at the same
 time, but only one of these two VMs can be powered on at the same time.
 Klaus already explained why we wouldn't implement sharing an image between
 running VMs. Closing

I’ve been using VirtualBox for an year now, but recently I decided to install Oracle RAC. Like most of the ex-vmware users I’ve just created a new disk and added it to two virtual machines. The first one started normaly, but when I tryed to start the second one I got the following error:

Result Code: VBOX_E_INVALID_OBJECT_STATE (0x80BB0007)
Component: Machine
Interface: IMachine {6d9212cb-a5c0-48b7-bbc1-3fa2ba2ee6d2}

It turns out that VirtualBox will not allow more than one running VM to use a VDI file. The solution I found most useful is to setup a third server (or VM) with Openfiler iSCSI host. Then VirtualBox can transparently present iSCSI disk to a virtual machine as a virtual hard disk. The guest operating system will not see any difference between a virtual disk image (VDI file) and an iSCSI target. To achieve this, VirtualBox has an integrated iSCSI initiator.

Regards,
Sve

Categories: oracle, virtualization Tags: , ,

Patch Set 10.2.0.5 for Oracle Database Server

June 9th, 2010 Sve No comments

Just to mention that few days ago patch set 10.2.0.5 was released for HP-UX Itanium and IBM AIX systems. The patch set is available for download from My Oracle Support with number 8202632.

Regards,
Sve

Categories: hp-ux, oracle Tags:

Oracle 11g R2 installer fails on HP-UX 11iv3

May 20th, 2010 Sve 4 comments

Running the installer of any of the products (client, grid, database) of Oracle Database 11g Release 2 on HP-UX 11iv3 (Itanium) fails with:
“An internal error occurred within cluster verification framework”

After starting ./runInstaller the following error window pops-up:
runInstaller error

Also at the installAction$DATE.log the following error can be seen:

SEVERE: [FATAL] An internal error occurred within cluster verification framework
Unable to get the current group.

This happens, because patch PHCO_40381 is not installed. There is a list of patches to be installed at 2.3.4 Patch Requirement of the Database Installation guide for HP-UX.

The first one is:
PHCO_40381 11.31 Disk Owner Patch

The patch is available from ITRC. It’s 205Kb big and it fixes behavior of the command diskowner. The installation of the patch does not require reboot of the server.

After the installation of the patch, runInstaller starts succesfully.

There is also MOS Doc ID regarding this problem:
HP-UX: 11gR2 runInstaller Fails with “An internal error occurred within cluster verification framework” [ID 983713.1]


Regards,
Sve

Categories: hp-ux, oracle Tags: , ,

Visiting BGOUG

April 22nd, 2010 Sve No comments

I’ll be visiting the spring conference of the BGOUG this weekend. It will be very interesting since there are some topics related to Sun technologies. Again this time we have a lot of foreign presence.

Categories: Uncategorized, oracle Tags:

Presentation about Oracle on HP-UX and Linux

January 16th, 2010 Sve No comments

At the last BGOUG I was talking about some of the differences between HP-UX and Linux although they cannot be compared because they run on different platforms. I tried to figure out how Linux penetrate the Enterprise OS market in the last years, what is still missing and what features I would like to see in Linux that HP-UX has for a long time. I also discussed topics about memory, best practices in multipathing and networking, storage options, asmlib tips and tricks and some words about backup and recovery.

The presentation can be found here

Categories: hp-ux, linux, oracle Tags:

Constant cimprovagt daemon crashing and filling the /var directory

November 23rd, 2009 Sve No comments

We installed two nodes with HP-UX 11.31 March 2009 BOE in a ServiceGuard environment and started test applications in two packets.

Suddenly the /var directories on both nodes started to grow and respectively the cluster was crashing because of that and the syslog was never up to date. It turns out that some of the components (cimprovagt) of the OnlineDiagnostics were crashing. I reviewed few advisories and bugs about it, but none of them were having the same behaviour.

Executing file on the core dump file shows the following:
core:      ELF-64 core file – IA64 from ‘cimprovagt’ – received SIGABRT

HP analyzed the core dump files and determined that the problem is already known and the fix is already implemented in September release of DASProvider, which is now part of the DiagProdCollection bundle and can be found  here:

http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=DiagProdCollection

After installing the bundle the daemon stopped crashing and the system is stable now.

Categories: hp-ux Tags:

HP-UX software bug hidden in cluster behaviour

October 8th, 2009 Sve 1 comment

I was called to check some strange behavior of two-node cluster and to see why the one of the nodes crashed unexpectedly.  The two nodes were HP Integrity servers installed with HP-UX 11.31 Base OE (March 2009). Well the node did not crashed it was just restarted from the ServiceGuard with safety timer expire for some reason. System log was not up to date because /var directory was full at some point and the syslog stopped writing. Console log showed standard messages INIT occurred and safety timer expire. Analyzing the crashdumps revealed that communication with cmcld was not possible and thats why the server was rebooted probably because /var directory was full.

Anyway few days later customer called again and said that the node was restarted again,  I expected to see the same reason but this time the reboot reason was “Reboot after panic: Fault when executing in kernel mode”.  The problem was not in the cluster this time and the reboot reason was talking about some problems in the the kernel.

What is crash anyway ? From HP documentation:
An abnormal system reboot is called a crash. There are many reasons that can cause a system to crash; hardware malfunctions, software panics or even power failures. The crash even type panic refers to crashes initiated by the HP-UX operating system (software crash event). There are two types of panics: direct and indirect. A direct panic refers to a subsystem calling directly the panic() kernel routine upon detection of an unrecoverable inconsistency. An indirect panic refers to a crash event as a result of trap interruption which could not be handled by the operating system for example when the kernel accesses a non-valid address.

I analyzed the crash dumps,  reviewed all the advisories and release notes and was unable to figure out what is the cause of the crash. Finally Level 2 of the support of HP  confirmed that this is known issue with the ONCPlus bundle. ONC stands for Open Network Computing (priviously called NFS bundle in 11.23) and it consists of the following components: Network File System, AutoFS, CacheFS, and Network Information Service. We were told to implement workaround until the fix is released next month. The workaround was to add -o readdir to the mount options of the NFS share in the fstab. Well it was obvious that the problem is with the NFS component of the ONCPlus bundle.

Few days later (not month) the new product (with fixed bugs) appeared online. It can be seen from the release notes the following defect fix:
Directory related operations on NFS client with ONCplus B.11.31.06 or B.11.31.07 installed and with file system mounted with read/write size greater than 8192 bytes, may result in system panic or data corruption.

Yes, the ONCPlus bundle was 11.31.06 and we had mounted NFS share with read/write size of 32768 bytes. Both workaround and the patch seemed to fix the problem and the crash never apeared again. Keep in mind that the installation of the new ONCPlus bundle needs restart and applying the workaround does not, BUT from the support adviced us to reboot the server just to make sure that the corruption is not loaded in the memory. So if you hit this bug consider applying the new bundle.

The latest ONCPlus bundle can be downloaded from there:

https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=ONCplus

Just for reference the following stack trace is dumped on the consle when the server crashes:

bad_kern_reference: 0xffff31.0x2c20486f6d65634f, fault = 0×8

Message buffer contents after system crash:

panic: Fault when executing in kernel mode
Stack Trace:
IP                  Function Name
0xe000000001f887e0  bad_kern_reference+0xa0
0xe00000000076a3d0  $cold_vfault+0x3b0
0xe000000000c45a10  vm_hndlr+0×510
0xe000000001bd9780  bubbledown+0×0
0xe000000000d00da1  vx_iupdat_cluster+0xa1
0xe000000000d14830  vx_async_iupdat+0×160
0xe000000000d4a530  vx_iupdat_local+0x2c0
0xe000000000d8c020  vx_iupdat+0xb0
0xe000000002134ed0  vx_iflush_list+0x4d0
0xe000000000afa8c0  vx_iflush+0x1d0
0xe000000000cf2710  vx_worklist_thread+0×200
0xe000000000e65d70  kthread_daemon_startup+0×90

Regards,
sve

Categories: hp-ux Tags: ,

WP SlimStat