Social Icons

twitterfacebookgoogle pluslinkedinrss feedemail

Tuesday, February 16, 2010

Why server monitoring is imperative for web applications

We had an issue at work last week whereby we had to kill both our web servers and restart them because we ran out of database connections.  To be more specific, we have a database connection pool and the pool filled up because more requests came in than connections were available because database queries weren't completing fast enough.

The solution is NOT to simply add more connections - this can actually exacerbate the problem.  Rather, one solution is potentially to decrease the number of connections.  This solution can also have the effect of people waiting for connections - namely that lag you see on web pages could be due to a poorly tuned database API.

We are fairly confident, however, at our company that our database API and database itself are tuned appropriately and given we've been using it for years the only thing we might need to concern ourselves with is that our usage may have grown such that we need to take a look at performance tuning our database to account for said added usage.

Today, we were expecting a rather large number of users at a given time so I spent the time during this time watching our server logs, monitoring server statuses, and looking in the database.  Also during this time, I used our Cacti monitoring service to follow server health.  While looking over the statistics for one of our load-balanced web servers, I came across this amber gem:

The graph spans a 6-week period and that black line is the thread count line for the Tomcat server running on one of our web servers.  The sheer cliff on the far right is when we had to kill the servers and restart.  To those of you who have done software engineering, this is a leak.  We are spawning threads that aren't dying and/or cleaning up properly.

Our current problem is that we only added this monitoring at the beginning of the year so we can't go back and look for the start of the climb to point to a specific code change (or changes) that resulted in this leak.  The solution for us, is to start taking thread dumps and comparing them over time to see if threads are hanging around over time or what.  Fortunately for us, Java makes this rather easy to do.

Regardless of all this, the bottom line is that if we didn't have server monitoring in place, we might never know about this problem.  And given that Cacti is free and open-source software there's not really any excuse no to use it or something similar.  If you do decide to use it, consider supporting them.