/ Thread race condition using Tomcat 4.1 and Oracle 10g ~ Java EE Support Patterns

2.02.2011

Thread race condition using Tomcat 4.1 and Oracle 10g

This case study describes the complete steps from root cause analysis to resolution of a problem affecting on an older system running Tomcat 4.1 and Oracle10g database. We will also revisit and demonstrate the importance of JDBC Connection pooling and how lack of proper JDBC implementaion can bring down even a great Servlet container such as Tomcat.

Environment specifications

·         Java EE server: Tomcat 4.1
·         JDK: Sun JDK 1.5.0.04
·         RDBMS: Oracle Database 10g Enterprise Edition Release 10.2.0.2.0
·         Platform type: Web Service Provider (XML via HTTP exposed via plain old Servlet 2.3)
  
Problem overview-       

        Problem type: Major thread slowdown and contention observed from a Tomcat 4.1 environment

Major slowdown was observed from a Tomcat 4.1 production environment under heavy load. The problem was first reported by upstream applications (HTTP timeouts). A support team task force was immediately assigned to investigate and deep dive into the problem.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·         Recent change of the affected platform? No
·         Any recent traffic increase to the affected platform? Yes, it was determine that a recent deployment on one of its upstream application did increase the peak volume by a factor of 20%
·         Any upstream and/or downstream system problem reported? No problem reported for any downstream application; including the internal Oracle10g database
·         Did a restart of the Tomcat server resolve the problem? No, restarting the Tomcat server did only temporary alleviate the problem for about just a few minutes

Thread Dump analysis

Several Thread Dump snapshots were captured during the slowdown condition and were analysed. Additional findings and facts were derived from thread dump analysis:

·         Total Tomcat Thread Pool current count was: 300
·         Thread pool free count was: 0
·         299 threads were in a BLOCKED state; waiting to acquire a lock on object <32142310>

"http-7001-Processor752" daemon prio=10 tid=026cbc40 nid=782 lwp_id=1050722 waiting for monitor entry [2f77d000..2f77c678]
                at java.sql.DriverManager.registerDriver(DriverManager.java:259)
                - waiting to lock <32142310> (a java.lang.Class)
                at <App Servlet>.doGet…….
                at <App Servlet>.doPost…..
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
                at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Unknown Source)
                at org.apache.catalina.core.ApplicationFilterChain.doFilter(Unknown Source)
                at org.apache.catalina.core.StandardWrapperValve.invoke(Unknown Source)
                at org.apache.catalina.core.StandardContextValve.invoke(Unknown Source)
                at org.apache.catalina.core.StandardHostValve.invoke(Unknown Source)
                at org.apache.catalina.valves.ErrorReportValve.invoke(Unknown Source)
                at org.apache.catalina.core.StandardEngineValve.invoke(Unknown Source)
                at org.apache.catalina.connector.CoyoteAdapter.service(Unknown Source)
                at org.apache.coyote.http11.Http11Processor.process(Unknown Source)
                at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Unknown Source)
                at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Unknown Source)
                at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Unknown Source)
                at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(Unknown Source)
                at java.lang.Thread.run(Thread.java:595)


·         1 Thread (the culprit Thread) was in BLOCKED IO state, waiting to get a response from the Oracle database during the JDBC Connection creation (OracleDriver.connect()) process.

"http-7001-Processor751" daemon prio=10 tid=026cba90 nid=781 lwp_id=1050721 runnable [29a44000..29a43578]
                at java.net.SocketInputStream.socketRead0(Native Method)
                at java.net.SocketInputStream.read(SocketInputStream.java:134)
                at oracle.net.ns.Packet.receive(Unknown Source)
                at oracle.net.ns.NSProtocol.connect(Unknown Source)
                at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:843)
                at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:267)
                at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:419)
                at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:164)
                at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:34)
                at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:752)
                at java.sql.DriverManager.getConnection(DriverManager.java:525)
                - locked <32142310> (a java.lang.Class)
                at java.sql.DriverManager.getConnection(DriverManager.java:171)
                - locked <32142310> (a java.lang.Class)
                at <App Servlet>.doGet…….
                at <App Servlet>.doPost…..
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:709)
                at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)   ………………………………………………………………………………………………

Thread Dump analysis, facts and conclusions

Thread dump analysis did confirm the additional facts below:

·         All Threads, except one, were blocked, trying to acquire a lock on the SQL DriverManager
·         Thread #751 was the culprit and waiting from a response from Oracle 10g database to complete the connection process

·         Conclusion #1: Slow response time of the Oracle listener did cause a full domino effect since all Threads were in race to acquire an exclusive lock on the SQL DriverManager.
·         Conclusion #2: Such Thread race condition did reveal that the application is lacking proper JDBC Connection Pooling; causing a new database creation and connection on each incoming Servlet request.

Oracle TNSPING analysis

Based on the Thread Dump analysis conclusion, the team quickly moved with analysis of the Oracle listener response time, running some diagnostics using the TNSPING command. 

The issue appears to center around the creation of new connections through the Oracle listener. The listener typically handles 30 new connections per second – each taking normally 10ms to 50ms to process. On the day of the incident, the system continued to process roughly 30 new connections per second; however, the new connections were taking roughly 5000 to 7000ms to complete. A few peaks of 50 new connections per second were observed. Tests using TNSPING reported response times of up to 15000ms. It is worth noting that TNSPING does not actually speak to the database it only opens a physical network connection to the listener. Manual test connections to the database both locally and remotely appeared slow to connect but once connected were extremely quick to process.

The network was also analysed and the team did rule out any issue on that layer.

A ticket (SR) was opened with the Oracle database vendor. The Oracle engineer noted that this is one of the busiest listeners that they have worked with recently. On a regular basis it handles 30 new connections per second. A comparison given was that this listener is processing close to that of a site such as Amazon.com. For this reason, Oracle suggested that we utilize multiple listeners to split load and multiply our processing capacity for new connections. Also, by deploying multiple listeners we could also isolate traffic by application. This would allow us to prevent the various systems that currently share the database listener from interfering with each other.  

Root cause

Thread dump analysis combined with TNSPING snapshot and discussion with Oracle did confirm the following root cause:

·         The lack of JDBC Pooling of our older application is causing too many concurrent connection (up to 50 per second) attempts to the single Oracle listener process, causing slower response time and domino effect on the Tomcat 4.1 server
·         The load increase introduced by the upstream application was the actual trigger of this problem (break point) 

Solution and tuning

2 areas were looked at during the resolution phase:

·         Implement JDBC Connection Pooling. This was easily achieved in Tomcat 4.1 with a min and max setting setup at 50.
·         Scale up the number of Oracle listeners. It was decided to hold any vertical scaling solution on Oracle side until the implementation of JDBC Connection Pooling.

Results: the implementation of the JDBC Connection Pooling did provide great relief. The concurrent # of connections on the Oracle listener dropped close to 0 since all connections are now created on Tomcat server at start-up time. No more Thread race observed at runtime post JDBC Pooling implementation and no re-occurrence of the problem since then.

Conclusion

The JDBC Connection Pooling approach is definitely not new but I hope this case study has helped you understand and appreciate its importance in today’s modern Java EE production environments.

Online savings and coupons are always available for electronics, technology, and software. 

2 comments:

Interesting article though its worth noting that JAVA Thread can not be restarted once dead and this is something I never understand :)

Thanks
Javin
How to implement Thread in JAVA

Thanks Javin for your comment.

I actually got this question quite often from my clients and work colleagues during Thread troubleshooting episodes e.g. why not stop the hanging Threads and resume/restart them quickly to resolve the problem (via a simple button from application server console).

Old Sun Java specifications used to have such ways are dealing with Threads:

Thread.stop() / Thread.suspend() / Thread.resume() / Thread.runFinalizersOnExit()

They deprecated these methods because of the risk involved. The main risk is that stopping a Thread is unlocking all its surrounding Object monitor locks, potentially causing damaging scenarios such as data corruption depending on current Thread operation(s).

With current Java specifications, you best strategy right now is to prevent as much as you can such Thread to get in a hang status at the first place; such as proper timeout management for IO blocking calls, proper use of synchronized{} block, proper coding practices etc.

Regards,
P-H

Post a Comment