Tuesday, February 16, 2010

Why server monitoring is imperative for web applications

We had an issue at work last week whereby we had to kill both our web servers and restart them because we ran out of database connections.  To be more specific, we have a database connection pool and the pool filled up because more requests came in than connections were available because database queries weren't completing fast enough.

The solution is NOT to simply add more connections - this can actually exacerbate the problem.  Rather, one solution is potentially to decrease the number of connections.  This solution can also have the effect of people waiting for connections - namely that lag you see on web pages could be due to a poorly tuned database API.

We are fairly confident, however, at our company that our database API and database itself are tuned appropriately and given we've been using it for years the only thing we might need to concern ourselves with is that our usage may have grown such that we need to take a look at performance tuning our database to account for said added usage.

Today, we were expecting a rather large number of users at a given time so I spent the time during this time watching our server logs, monitoring server statuses, and looking in the database.  Also during this time, I used our Cacti monitoring service to follow server health.  While looking over the statistics for one of our load-balanced web servers, I came across this amber gem:


The graph spans a 6-week period and that black line is the thread count line for the Tomcat server running on one of our web servers.  The sheer cliff on the far right is when we had to kill the servers and restart.  To those of you who have done software engineering, this is a leak.  We are spawning threads that aren't dying and/or cleaning up properly.

Our current problem is that we only added this monitoring at the beginning of the year so we can't go back and look for the start of the climb to point to a specific code change (or changes) that resulted in this leak.  The solution for us, is to start taking thread dumps and comparing them over time to see if threads are hanging around over time or what.  Fortunately for us, Java makes this rather easy to do.

Regardless of all this, the bottom line is that if we didn't have server monitoring in place, we might never know about this problem.  And given that Cacti is free and open-source software there's not really any excuse no to use it or something similar.  If you do decide to use it, consider supporting them.

Wednesday, November 18, 2009

Zen and the Art of Patience

Love him or hate him, President Obama can teach you about patience. I cannot recall anything, even going back to the early days of his 2008 presidential campaign where he rushed to judgment or hurried through something in an effort to either appease the masses or to make it look like he was reacting. President Obama picks his battles and he picks them carefully.

Now, I'm not saying that in some cases this lack of reaction isn't a liability or that the careful choosing of battles is not politically motivated - on the contrary, I believe that plays a part. Take for instance two major topics in the news right now: the health care debate and the investigation into the Ft. Hood shootings.

The White House has taken very careful steps to let Congress handle most of the details of the health care debate and one wonders would things be different if, perhaps, it had taken a much larger and more active role. On one hand, the fate of his presidency would be more closely tied to the health care bill and perhaps he learned some lessons in 1992 from then President Clinton. On the other hand, however, it is not the "job" of the White House to craft legislature, the Bush administration notwithstanding. His hands-off approach has returned Congress to it's rightful place in the three branches of government - the capacity to create laws. Whether or not you happen to like that is open for discussion, but it is his choice and were you to ask him, I'm sure he'd mention the previous administration's way of doing things as an example of how NOT to do things.

I also mentioned the Ft. Hood shootings. Certain Senators are clamoring for their own, Congressional, investigation. The White House has asked them to wait until the military, et. al. completes their own investigations. Without proof, I suspect that this is because the White House is fearful of unwarranted retribution and xenophobia from, let's say, the lesser educated. Doubt me? Fine, but I'm betting you can't sit there and tell me with a straight face that ignorant people, just like the rest of us, aren't at some point  subject to overreaction. On the flip side, though I've not seen anything yet to indicate this, I suspect that the public is demanding answers. There's nothing inherently wrong with this except that the politicians to whom these people are complaining are compelled to give them answers. And, the problem stems from said politicians giving out either half answers or just plain wrong answers.

So, we then have politicians, under the guise of "doing the right thing," speaking to anyone who will listen about the travesty of not acting fast enough and "we need to protect the troops now." It seems that they have learned nothing about what happens when we rush to judgment from the previous 8 years. It also seems that they care more about seeing themselves on TV or in print, than they do about actually solving the problem at hand.

Why "Quick and Dirty" is bad for software projects

Anyone whose ever worked on a software engineering project has at one point or another either said or done (or both) "let's just implement something quick and dirty for now and we'll fix it later." While I've been living through the nightmare of this statement for the last 2 years at work, it just hit me today as to why this is so bad.

A co-worker just asked me if I knew what the c-rt.tld file was about and why it was only referenced in one JSP page. It struck me almost immediately what that file was and even though I'd seen it before, it never clicked until now. We opened it up and as expected it was a JSTL tag library definition file, in this case, for the core library. Now, we also have a c.tld which also defines the JSTL tag for the core library. Diffing the files yielded that the JSTL versions were slightly off, but that other than ordering, the files were the same. Looking at the SVN history made it all clear.

First, a brief background: in late spring 2006 my company hired a non-American subcontractor to implement a rather large and important piece of functionality for our web application because at the time, we simply didn't have the resources to do it in house. They initially branched our code and went at it while we did our own thing in our existing code branch. According to the check-in comment for this c-rt.tld file, the file was added as a result of the merging of these branches some 6 months (!) later.

My guess as to what happened is this: the developer for this check-in ran into conflicts the likes of which even god has never seen and instead of trying to understand the conflicts and why they were occurring, they just created different versions of the file and "fixed" the references in the @taglib directives in the JSP pages. The result is that instead of only having one TLD for the core JSTL library, we have four(!). Likewise, we have four for the fmt library and three for each of the sql (which you should NEVER use, by the way, and we don't) and xml JSTL libraries.

I realize that this example alone doesn't directly state why "Quick and Dirty" is bad, but consider this: We fired said subcontractor in December 2007 and we've been cleaning up, rewriting, and removing their code ever since. This means that we've been working on fixing issues with their code for longer than they were even contracting for us.

So, the next time someone on your project says "let's just do it the quick and dirty way for now," push back, put your foot down, and say "no." For better effect, replace "no" with "over my dead body." If you're overruled, and this is bound to happen, file a P1/Severe bug against the code fully documenting where in the code the egregiousness lies, why the decision was made to not do it correctly the first time, and the right way to fix it. Trust me, you think you won't forget, but you will - and the bug documentation is all you will have.