Archive

Posts Tagged ‘dcd’

Dead Connection Detection in Oracle Database 12c

April 7th, 2016 No comments

In my earlier post I discussed what Dead Connection Detection is and why you should use it – read more here Oracle TNS-12535 and Dead Connection Detection

The pre-12c implementation of DCD used TNS packages to “ping” the client and relied on the underlying TCP stack which sometimes may take longer. Now in 12c this has changed and DCD probes are implemented by TCP Stack. The DCD probes will now use the TCP KEEPALIVE socket option to check if the connection is still usable.

To use the new implementation set the SQLNET.EXPIRE_TIME in sqlnet.ora to the amount of time between the probes in minutes. If the operating system supports TCP keep-alive tuning then Oracle Net automatically uses the new method. The new mechanism is supported on all platforms except on Solaris.

The following parameters are associated with the TCP keep-alive probes:
TCP_KEEPIDLE  – specifies the timeout of no activity until the probe is sent. The parameter takes value from SQLNET.EXPIRE_TIME.
TCP_KEEPCNT   – number of keep-alive probes to be sent, it is always set to 10.
TCP_KEEPINTVL – specifies the delay between probes if a keep-alive packets are sent and no acknowledgment is received, it is always set to 6.

If you need to revert to the pre-12c DCD mechanism (10 bytes TNS data) add the following parameters in sqlnet.ora:
USE_NS_PROBES_FOR_DCD=true

 

Categories: oracle Tags: ,

Oracle TNS-12535 and Dead Connection Detection

March 31st, 2016 2 comments

These days everything goes to the cloud or it has been collocated somewhere in a shared infrastructure. In this post I’ll talk about sessions being disconnected from your databases, firewalls and dead connection detection.

Changes

We moved number of 11g databases from one data centre to another.

Symptoms

Now probably many of you have seen the following error in your database alertlog “TNS-12535: TNS:operation timed out” or if you haven’t you will definitely see it some day.

Consider the following error from database alert log:

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 12-MAR-2015 10:28:08
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505

TNS-00505: Operation timed out
nt secondary err code: 110
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))
Thu Mar 12 10:28:09 2015

Now this error indicate timing issues between the server and the client. It’s important to mention that those errors are RESULTANT, they are informational and not the actual cause of the disconnect. Although this error might happen for number of reasons it is commonly associated with firewalls or slow networks.

Troubleshooting

The best way to understand what’s happening is to build a histogram of the duration of the sessions. In particular we want to understand whether disconnects are sporadic and random or they follow a specific pattern.

To do so you need to parse the listener log and locate the following line from the above example:

(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831))

Since the port is random you might not get same record or if you do it might be days apart.

Here’s what I found in the listener:

12-MAR-2015 08:16:52 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831)) * establish * ORCL * 0

In other words – at 8:16 the user scott established connection from host 192.168.0.10.

Now if you compare both records you’ll get the duration of the session:

Established: 12-MAR-2015 08:16:52
Disconnected: Thu Mar 12 10:28:09 2015

Here are couple of other examples:
alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620))
Thu Mar 12 10:31:20 2015 

listener.log:

12-MAR-2015 08:20:04 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620)) * establish * ORCL * 0 

alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157))
Thu Mar 12 10:37:51 2015 

listener.log:

12-MAR-2015 08:26:36 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157)) * establish * ORCL * 0 

alertlog:

Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618))
Tue Mar 10 19:09:09 2015 

listener.log

10-MAR-2015 16:57:54 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=root))(SERVICE_NAME=ORCL1)(SERVER=DEDICATED)) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618)) * establish * ORCL1 * 0 

As you may have noticed the errors follow very strict pattern – each one gets disconnect exactly 2hrs 11mins after it has been established.

Cause

Given the repetitive behaviour of the issue and that it happened for multiple databases and application servers we can conclude that’s definitely a firewall issue.

The firewall recognizes the TCP protocol and keeps a record of established connections and it also recognizes TCP connection closure packets (TCP FIN type packet). However sometimes the client may abruptly end communication without closing the end points properly by sending FIN packet in which case the firewall will not know that the end-points will no longer use the opened channel. To resolve this problem firewall imposes a BLACKOUT on those connections that stay idle for a predefined amount of time.

The only issues with BLACKOUT is that neither or the sides will be notified.

In our case the firewall will disconnect IDLE sessions after around 2hrs of inactivity.

Solution

The solution for database server is to use Dead Connection Detection (DCD) feature. DCD detects when a connection has terminated unexpectedly and flags the dead session so PMON can release the resources associated with it.

DCD sets a timer when a session is initiated and when the timer expires SQL*Net on the server sends a small 10 bytes probe packet to the client to make sure connection is still active. If the client has terminated unexpectedly the server will get an error and the connection will be closed and the associated resources will be released. If the connection is still active then the probe packet is discarded and the timer is reset.

To enable DCD you need to set SQLNET.EXPIRE_TIME in sqlnet.ora of you RDBMS home!

cat >> $ORACLE_HOME/network/admin/sqlnet.ora
SQLNET.EXPIRE_TIME=10 

This will set the timer to 10 minutes. Remember that sessions need to reconnect for the change to take place it won’t work for existing connections.

Firewalls become smarter and they can now inspect packages even deeper. Make sure the following settings are also disabled:
– SQLNet fixup protocol
– Deep Packet Inspection (DPI)
– SQLNet packet inspection
– SQL Fixup

I had similar issue with Dataguard already, read more here – Smart Firewalls

How to test Dead Connection Detection

You might want to test or make sure DCD really works. You’ve got multiple options here – Oracle SQL client trace, Oracle SQL Server Trace, Sniff the network with packet analyzer OR using strace to trace the server process. I used strace since I had access to the database server and it was non intrusive.

1. Establish a connection to the database through SQL*Net

2. Find the processes number for your session:

SQL>  select SPID from v$process where ADDR in (select PADDR from v$session where username='SVE');

SPID
------------------------
62761 

3. Trace the process

[oracle@dbsrv ~]$ strace -tt -f -p 62761
Process 62761 attached - interrupt to quit
11:36:58.158348 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:36:58.158485 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
....
11:46:58.240065 --- SIGALRM (Alarm clock) @ 0 (0) ---
11:46:58.240211 rt_sigprocmask(SIG_BLOCK, [], NULL, 8) = 0
...
11:46:58.331063 write(20, "\0\n\0\0\6\20\0\0\0\0", 10) = 10
... 

What I did was to attach to the process, simulate some activity at 11:36 and then leave the session IDLE. Then 10 minutes later the server process sent an empty packet to the client to check if the connection is still alive.

Conclusion

Errors in alertlog disappeared after I enabled the DCD.

Make sure to enable DCD if you host your databases in a shared infrastructure or there are firewalls between your database and application servers.

References
How to Check if Dead Connection Detection (DCD) is Enabled in 9i ,10g and 11g (Doc ID 395505.1)
Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)
Resolving Problems with Connection Idle Timeout With Firewall (Doc ID 257650.1)
Dead Connection Detection (DCD) Explained (Doc ID 151972.1)

Categories: oracle Tags: