Exhaust of Windows 2008 heap memory with Oracle Database 188.8.131.52
Recently I had an interesting setup for one of our customers. Because they got Oracle Standard Edition and Windows 2008 Server R2 Standard Edition licenses I was asked to create HA database installation. After looking around I found few docs about installing Standard Edition with Clusterware and I had some ideas. Finally I installed Grid Infrastructure on both servers and Oracle Database binaries. Then created single instance database on the second server and replicated the configuration to the first one. Currently the relocation of the database is done manually, but one could create a start/stop/monitor scripts and integrate these with GI. Once the database starts it's registering at the scan listener so in theory it's running in HA (just the relocation is manual) :)
So during the weekend I received mail from my colleagues above error messages they received from the database: connect error, Socket read timed out. It wasn't a rush as the database is not yet in production, but it's ahead and this was the first task for the Monday. Next day I looked around and everything was up and running, except that I wasn't able to login through the listener and I also wasn't able to stop or relocate it. Looking at the logs I found at some point the following message: TNS-12531: TNS:cannot allocate memory which explains the previous message.
That was weird, the server on which error appeared was the first one and had only GI running and SCAN LISTENER. This really looked like a memory leak, it's a Windows so maybe that was obvious. I decided to look around the processes using the Resource Monitor when I found a lot of many cmd.exe processes. To confirm the problem I used Process Explorer which is a very nice tool for Windows. As could be seen below I've got plenty of cmd processes which were spawned, but not (obviously) closed after completion:
It turned out that this is a bug for 184.108.40.206 and Windows (64 bit). The Oracle CVU resource (ora.cvu), which by default is started on the first node in the cluster (this makes sense now) it's doing checks on every six hours (CHECK_INTERVAL=21600) and leaves process open. Because of this the heap memory is exhausted and that's the reason why the SCAN LISTENER is failing and giving the error message TNS-12531: TNS:cannot allocate memory
The following errors could be seen in Windows Eventlog, once the patch is applied the errors disappeared:
Faulting application lsnrctl.exe, version 220.127.116.11, time stamp 0x4cea8f55, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb73957, exception code 0xc0000142, fault offset 0x00000000000b1b48, process id 0x1eac, application start time 0x01cc6ab588f992c0. Faulting application cmd.exe, version 6.0.6001.18000, time stamp 0x47918bde, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb733e1, exception code 0xc0000142, fault offset 0x0006f1e7, process id 0x1004, application start time 0x01cc6af0fa982500. Faulting application sclsspawn.exe, version 0.0.0.0, time stamp 0x4ce622a7, faulting module kernel32.dll, version 6.0.6001.18538, time stamp 0x4cb73957, exception code 0xc0000142, fault offset 0x00000000000b1b48, process id 0x1ca0, application start time 0x01cc6c0e5efd5380.
This is the bug at MOS:
Bug 12529945: CVU HEALTH CHECKS EXHAUST WINDOWS HEAP MEMORY
The bug should have been fixed in BP8, but I applied the latest one BP10:
Patch 12849789: ORACLE 11G 18.104.22.168 PATCH 10 BUG FOR WINDOWS (64-BIT AMD64 AND INTEL EM64)