Tag Archives: Analytics

Intelligence in 404 Errors

I recently found myself in a conversation about Splunk. During the conversation, I was asked about which types of logs I found easiest and and most useful to ingest into the Splunk environment. Without giving it much thought, I immediately responded that web access logs were very easy to ingest and there is a lot of data that can be seen if you know where to look. Well, of course I set myself up for the next question . . . “can you give me an example?”

Understanding Access Logs

First, for those that aren’t familiar with web access logs, let’s take a moment to look at one. Below is a sample log entry from an Apache web server log:

103.249.31.189 - - [15/May/2016:21:43:02 -0400] "GET /wpfoot.php HTTP/1.1" 404 14975 "http://www.googlebot.com/bot.html" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

At a first glance, this can look a little intimidating. But let’s break down the various parts:

  • 103.249.31.189 – This is the IP address of the computer that is making the request to your web site.
  • 15/May/2016:21:43:02 -0400 – This is the date, time and timezone of when the request was made.
  • GET – When communicating with a web site, there are a number of different actions that can be requested against that page. For most standard web traffic, those requests are for either a GET or a POST. A GET can be thought of as a request to get data from a site. When you are simply clicking around a web site, you are most likely using GET requests to retrieve that data.  A POST is used when you are trying to submit data to a site. When you are filling out a contact form or logging into a site, you are most likely sending a POST request with that data.
  • /wpfoot.php – This is the page on your site that the user was trying to access.
  • HTTP/1.1 – This is simply telling you what HTTP protocol was used by the client when requesting the page.
  • 404 – We now come to the status code for this particular request. In this example, we see that the server responded with a status code of 404. This will be important to our conversation because a status code of 404 means that the server could not find the page that was requested.
  • 14975 – This number gives you the size of the response in bytes that was sent back to the requestor.
  • http://www.googlebot.com/bot.html – Often, we see a web page here and this is the address of the referring web page. This tells us that someone tried to get to /wpfoot.php from the bot.html site.
  • Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) – Finally, we see information about the browser that was used to make the connection.

That’s A Lot Of Data!

In just that one request, we identified 9 pieces of information. Imagine trying to review a web site’s access log with hundreds, thousands, even millions of log entries! How will you ever find anything useful in this data?

This is where one of my favorite pieces of software comes into the picture – Splunk. Splunk does a phenomenal job of ingesting all types of log and data sources and giving you simple yet powerful tools to analyze that data. Start sending your web server logs to your Splunk server and let’s begin to analyze.

A Simple Query

Once Splunk has started reading your data, we can begin to develop some searches against that data. For this topic, I decided to talk about gaining intelligence based on 404 status codes. Our Splunk search is very simple:

sourcetype=access_combined status=404 | top limit=10 uri

Simple, right? This search says, “Grab all of my access_combined log data (Apache access logs) and look for any record with a status of 404. Then show me the top 10 most requested web pages that received that status code.”

When you run this search, you will see something that might look like the following:

Splunk 404 Search

You will see the top 10 web pages requested, the number of times it was requested, and what percent of the total count of requests that is made up by this count.

Now that you have this list . . . what intelligence can we gather from it? Let me give you two scenarios to consider.

Scenario #1 – Web Coding Issue

Most people are familiar with the concept of broken links. This occurs when a web site directs you to a page that doesn’t exist. Nothing is more frustrating than trying to find a resource on the web to answer a question that you have, just to be taken to that “Page Cannot Be Found” message. If your site has any broken links, they will quickly be seen with a search like this and you can begin to find and correct these.

For example, I recently came across a result that looked something like:

/www.domain.com/page.html

At first glance, this looks like a perfectly legitimate URL. But you have to remember that what you are seeing in these logs is the part of the web page that comes after the web site address. Therefore, this was actually a link for:

http://www.domain.com/www.domain.com/page.html

We quickly found the pages in our site that were coded incorrectly.

Scenario #2 – Hackers Knocking At Your Front Door

Our 2nd scenario can actually be the most important of the two scenarios. Analyzing your 404 errors can give you a huge amount of insight into the activity of hackers on the Internet.

There are a large number of sites on the Internet that advertise vulnerabilities found in software. (See reference below for CVE) The intent of these sites are to make you aware of the vulnerability and urge you to upgrade software to remedy the issue, or possibly provide work arounds until a software patch is released. This is a great tool for admins to use to monitor issues that are released about the software they administer. But, admins aren’t the only individuals using these sites. The hackers know about them too!

Often times, you will see patterns in your logs where hackers are testing your site to see what software you have installed. Maybe they are looking for certain pieces of software. Maybe they are looking for installed components in the software. Regardless, this is a trial and error effort. But the good news is, we can see this in our logs.

An Example

I recently came across this exact URL in one of my searches:

/magazine/js/mage/cookies.js

This struck me as odd because there is nothing in any of my web sites about a magazine. So my suspicion level was already pretty high. I grabbed this URL and pasted it into a Google search. It didn’t take me long to discover that this is a component of the open source Magento e-commerce system. I took this knowledge and looked to see if there were any recent vulnerabilities discovered in the software. Sure enough, there is a bulletin on the Magento site asking users to upgrade because of vulnerabilities recently found in their software.

Safe, Right?

Luckily for me, I wasn’t running the Magento software on my system, so I was safe from being hacked. Or am I?

Let me give you something to think about. You are home with your family and a stranger comes to your front door. They jiggle the door knob to see if it’s unlocked and they find that it isn’t. So they walk away. The next night, you notice this same person come to your house and try to open the front window. Again, it was locked, so they leave. The third night, you find them snooping around your back door. Lucky for you, that was locked too. How many times are you going to let this happen before you take action?

This example is no different. You have concrete evidence of someone “jiggling the door knob” and “opening your front window” on your web site. Obviously, this wasn’t going to work because you don’t use the software. But they tried anyway and left evidence of them doing something they shouldn’t be doing. What if other hackers attempt to do the same thing? You now know what to look for so that you can park that 100 lb. German Shepard at the window and door to keep them away.

This is valuable information that you should now use to protect your network. If a hacker was willing to find an exploit this way, then you can be sure they will try other ways as well. As soon as we have a good way of knowing that someone is up to no good, we should be blocking them immediately for any further access to our sites.

Fail2Ban

One good resource that I personally work with is the open source Fail2Ban project. This is an extremely simple and yet very powerful piece of software. One of the many things this software can do is look for patterns in a web log and then alter the firewall of the server in real time to block further attacks from the source IP address. I created a new filter rule:

[Definition]
failregex = ^<HOST> - .*\/magazine\/js\/mage\/cookies\.js
ignoreregex =

With this rule, I can now monitor for future attempts against this specific URL and block the offender from making any other attempts against our systems.

Conclusion

This is just one of the millions of ways that Splunk can bring valuable intelligence into your environment with very little effort. Once you start identifying sources for this data and building out the searches to aggregate that data, you will find that the data mining options are endless.

Resources