I recently came across a situation with one of my Linux servers. A web application on the server was getting very slow to respond. Web pages were taking 25-30 seconds to load when they usually load in less than 4 seconds. Something was obviously wrong.
Looking to Splunk for Answers
I turned to my trusty logging software, Splunk. The more I use this software, the more I love the insight it gives me into my systems. I started off with my Operational Monitoring dashboard that I have built over time. This dashboard gives me a number of graphical views into the key server management indicators. I scrolled down the page to my CPU utilization view and I saw the following:
My web server, shown here as the red line, was clearly in a higher than usual CPU utilization and that explained a lot of things. But it didn’t yet tell me why this was occurring.
I started a new search in Splunk for the period around when the CPU utilization started increasing. My Splunk environment captures a lot of data, so at a first glance, there was a lot to look at.
As I was sifting through the data, I remembered that through the Splunk Add-On for *nix, the software is periodically capturing the output of the Linux ps command. This command line utility reports on what applications are currently running at that moment on the server. In particular, it also shows the amount of CPU utilization that is being used by that program. Knowing this, I crafted the following search and visualization:
host=HOSTNAME source=ps | timechart span=1h sum(pctCPU) by COMMAND
This search gave me a graphical view of each program that was running and sums up the pctCPU value for each record in that hour. I chose to use the SUM function because any subtle increase in CPU usage would be compounded with a SUM instead of an average.
The green line in this graph clearly showed me the program that was beginning to utilize more and more CPU at the same time that the graph above showed the higher overall utilization.
Now that I knew which application was causing the problem, I started doing some research. I did find other users reporting similar CPU utilization issues with the application. I felt better knowing it wasn’t an issue specific to my environment. I am still continuing to learn about this issue and I’m sure an overall fix will soon follow.