verify-ip is a command line tool to filter out access log request lines by doing a reverse DNS lookup of the IP addresses and matching the domain against a specified pattern.
When you need to filter out access logs to contain only, say, Googlebot web crawler requests, you can’t just rely on the user agent string, because it can be easily spoofed to one of the official Googlebot strings. Luckily, all Googlebot requests come from an IP associated with a
googlebot.com domain, so you can use a reverse DNS lookup to confirm you’re dealing with a valid request:
$ host 22.214.171.124 126.96.36.199.in-addr.arpa domain name pointer crawl-66-249-76-91.googlebot.com.
Because someone could just buy
evildomain.com then create
googlebot.com.evildomain.com, you have to take a bit more care by ensuring
googlebot.com. is a suffix not just a general match. Someone could also register
myevilgooglebot.com, so those too need to be ignored in any automated scripts.
verify-ip does this procedure for a general domain that you specify, and outputs any valid lines.
$ cat access.log | verify-ip --domain 'googlebot\.com'
For this command,
(anything.)googlebot.com will match, but none of the fake domains will come out as valid.
It comes with a
- Uses the list of known Googlebot user agent strings as a pre-filter on the access lines.
verify-ip does an initial pass through the input lines to first construct a set of unique IP addresses, and the valid IP addresses are used on a second pass. When using
host is more expensive than
grep -f when there are many duplicate IP addresses.
One caveat is that
Google has several other user-agents, including Feedfetcher (user-agent Feedfetcher-Google). Since Feedfetcher requests come from explicit action by human users who have added the feeds to their Google home page or to Google Reader, and not from automated crawlers, Feedfetcher does not follow robots.txt guidelines.
Feedfetcher isn’t a crawler and doesn’t use
Googlebot in its user agent strings, and its IP addresses reverse-lookup to
googlebot.com. But filtering out Feedfetcher requests to get subscriber details is easy enough:
$ cat access.log | grep -E 'Feedfetcher-Google[^"]+"$' | verify-ip --domain 'google\.com' | # ... further processing to compute subscriber counts
If there are other automated bots that use user agent strings and domains for verification, this should work fine for those too. Just don’t trust only the user agent string if it’s important.