These days everything goes to the cloud or it has been collocated somewhere in a shared infrastructure. In this post I'll talk about sessions being disconnected from your databases, firewalls and dead connection detection.
We moved number of 11g databases from one data centre to another.
Now probably many of you have seen the following error in your database alertlog "TNS-12535: TNS:operation timed out" or if you haven't you will definitely see it some day.
Consider the following error from database alert log:
Fatal NI connect error 12170. VERSION INFORMATION: TNS for Linux: Version 126.96.36.199.0 - Production Oracle Bequeath NT Protocol Adapter for Linux: Version 188.8.131.52.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 184.108.40.206.0 - Production Time: 12-MAR-2015 10:28:08 Tracing not turned on. Tns error struct: ns main err code: 12535 TNS-12535: TNS:operation timed out ns secondary err code: 12560 nt main err code: 505 TNS-00505: Operation timed out nt secondary err code: 110 nt OS err code: 0 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831)) Thu Mar 12 10:28:09 2015
Now this error indicate timing issues between the server and the client. It's important to mention that those errors are RESULTANT, they are informational and not the actual cause of the disconnect. Although this error might happen for number of reasons it is commonly associated with firewalls or slow networks.
The best way to understand what's happening is to build a histogram of the duration of the sessions. In particular we want to understand whether disconnects are sporadic and random or they follow a specific pattern.
To do so you need to parse the listener log and locate the following line from the above example:
Since the port is random you might not get same record or if you do it might be days apart.
Here's what I found in the listener:
12-MAR-2015 08:16:52 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=49831)) * establish * ORCL * 0
In other words - at 8:16 the user scott established connection from host 192.168.0.10.
Now if you compare both records you'll get the duration of the session:
Established: 12-MAR-2015 08:16:52
Disconnected: Thu Mar 12 10:28:09 2015
Here are couple of other examples:
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620)) Thu Mar 12 10:31:20 2015
12-MAR-2015 08:20:04 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=20620)) * establish * ORCL * 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157)) Thu Mar 12 10:37:51 2015
12-MAR-2015 08:26:36 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=ORCL)(CID=(PROGRAM=app)(HOST=apps01)(USER=scott))) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.10)(PORT=48157)) * establish * ORCL * 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618)) Tue Mar 10 19:09:09 2015
10-MAR-2015 16:57:54 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=root))(SERVICE_NAME=ORCL1)(SERVER=DEDICATED)) * (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.0.11)(PORT=42618)) * establish * ORCL1 * 0
As you may have noticed the errors follow very strict pattern - each one gets disconnect exactly 2hrs 11mins after it has been established.
Given the repetitive behaviour of the issue and that it happened for multiple databases and application servers we can conclude that's definitely a firewall issue.
The firewall recognizes the TCP protocol and keeps a record of established connections and it also recognizes TCP connection closure packets (TCP FIN type packet). However sometimes the client may abruptly end communication without closing the end points properly by sending FIN packet in which case the firewall will not know that the end-points will no longer use the opened channel. To resolve this problem firewall imposes a BLACKOUT on those connections that stay idle for a predefined amount of time.
The only issues with BLACKOUT is that neither or the sides will be notified.
In our case the firewall will disconnect IDLE sessions after around 2hrs of inactivity.
The solution for database server is to use Dead Connection Detection (DCD) feature. DCD detects when a connection has terminated unexpectedly and flags the dead session so PMON can release the resources associated with it.
DCD sets a timer when a session is initiated and when the timer expires SQL*Net on the server sends a small 10 bytes probe packet to the client to make sure connection is still active. If the client has terminated unexpectedly the server will get an error and the connection will be closed and the associated resources will be released. If the connection is still active then the probe packet is discarded and the timer is reset.
To enable DCD you need to set SQLNET.EXPIRE_TIME in sqlnet.ora of you RDBMS home!
cat >> $ORACLE_HOME/network/admin/sqlnet.ora
This will set the timer to 10 minutes. Remember that sessions need to reconnect for the change to take place it won't work for existing connections.
Firewalls become smarter and they can now inspect packages even deeper. Make sure the following settings are also disabled:
- SQLNet fixup protocol
- Deep Packet Inspection (DPI)
- SQLNet packet inspection
- SQL Fixup
I had similar issue with Dataguard already, read more here:
How to test Dead Connection Detection
You might want to test or make sure DCD really works. You've got multiple options here - Oracle SQL client trace, Oracle SQL Server Trace, Sniff the network with packet analyzer OR using strace to trace the server process. I used strace since I had access to the database server and it was non intrusive.
Establish a connection to the database through SQL*Net
Find the processes number for your session:
SQL> select SPID from v$process where ADDR in (select PADDR from v$session where username='SVE'); SPID ------------------------ 62761
- Trace the process:
[oracle@dbsrv ~]$ strace -tt -f -p 62761 Process 62761 attached - interrupt to quit 11:36:58.158348 --- SIGALRM (Alarm clock) @ 0 (0) --- 11:36:58.158485 rt_sigprocmask(SIG_BLOCK, , NULL, 8) = 0 .... 11:46:58.240065 --- SIGALRM (Alarm clock) @ 0 (0) --- 11:46:58.240211 rt_sigprocmask(SIG_BLOCK, , NULL, 8) = 0 ... 11:46:58.331063 write(20, "\0\n\0\0\6\20\0\0\0\0", 10) = 10 ...
What I did was to attach to the process, simulate some activity at 11:36 and then leave the session IDLE. Then 10 minutes later the server process sent an empty packet to the client to check if the connection is still alive.
Errors in alertlog disappeared after I enabled the DCD.
Make sure to enable DCD if you host your databases in a shared infrastructure or there are firewalls between your database and application servers.
How to Check if Dead Connection Detection (DCD) is Enabled in 9i ,10g and 11g (Doc ID 395505.1)
Alert Log Errors: 12170 TNS-12535/TNS-00505: Operation Timed Out (Doc ID 1628949.1)
Resolving Problems with Connection Idle Timeout With Firewall (Doc ID 257650.1)
Dead Connection Detection (DCD) Explained (Doc ID 151972.1)