Adam Prescott

Verifying access log requests

verify-ip is a command line tool to filter out access log request lines by doing a reverse DNS lookup of the IP addresses and matching the domain against a specified pattern.

When you need to filter out access logs to contain only, say, Googlebot web crawler requests, you can’t just rely on the user agent string, because it can be easily spoofed to one of the official Googlebot strings. Luckily, all Googlebot requests come from an IP associated with a googlebot.com domain, so you can use a reverse DNS lookup to confirm you’re dealing with a valid request:

$ host 66.249.76.91
91.76.249.66.in-addr.arpa domain name pointer crawl-66-249-76-91.googlebot.com.

Because someone could just buy evildomain.com then create googlebot.com.evildomain.com, you have to take a bit more care by ensuring googlebot.com. is a suffix not just a general match. Someone could also register myevilgooglebot.com, so those too need to be ignored in any automated scripts.

verify-ip does this procedure for a general domain that you specify, and outputs any valid lines.

$ cat access.log | verify-ip --domain 'googlebot\.com'

For this command, (anything.)googlebot.com will match, but none of the fake domains will come out as valid.

It comes with a --google switch which does two things for convenience:

  1. Uses the list of known Googlebot user agent strings as a pre-filter on the access lines.
  2. Sets --domain 'googlebot\.com' automatically.

verify-ip does an initial pass through the input lines to first construct a set of unique IP addresses, and the valid IP addresses are used on a second pass. When using --google, the pre-filtered requests from (1) will end up in a much smaller second list that can then be very quickly checked against the valid IPs; I found that calling host is more expensive than grep -f when there are many duplicate IP addresses.

One caveat is that --google does not include requests made by Feedfetcher, since those are a little different:

Google has several other user-agents, including Feedfetcher (user-agent Feedfetcher-Google). Since Feedfetcher requests come from explicit action by human users who have added the feeds to their Google home page or to Google Reader, and not from automated crawlers, Feedfetcher does not follow robots.txt guidelines.

Feedfetcher isn’t a crawler and doesn’t use Googlebot in its user agent strings, and its IP addresses reverse-lookup to google.com, not googlebot.com. But filtering out Feedfetcher requests to get subscriber details is easy enough:

$ cat access.log |
    grep -E 'Feedfetcher-Google[^"]+"$' |
    verify-ip --domain 'google\.com' |
    # ... further processing to compute subscriber counts

If there are other automated bots that use user agent strings and domains for verification, this should work fine for those too. Just don’t trust only the user agent string if it’s important.