Mining Website Logs For Robot Crawlers
Mining Website Logs For Hidden Robots.
We worked on a project recently, where we had found that a client’s entire site was screen scraped and put up at a different location without their permission. This of course led to a drop in search engine traffic as the new site was considered duplicate content.
The client receives a ton of traffic each day and none of the traditional visitor analytic tools (Google Analytics/Piwik) were providing any real insight into what and who was crawling their site in general.
We decided to bring out our old tool, Perl, to help data mine for the robots.
First, let me state, that by no means will this ever stop or detect every robot that is nefariously crawling your website. Most really crafty developers would mask their robot in their code, by stating that it is actually a Chrome browser running on Windows 7. It would take its time and slowly/carefully/randomly download pages to grab the content.
The purpose of our exercise below was to see what other reported robots where mining our site for links and/or possible competitive link ranking.
Our first step was tuning our Perl script to parse through our web server log file. This Perl code will be different for every type of web-server. For Apache, we usually use the following line, which matches up with most default installs of Apache on Debian/Ubuntu, to parse a single line of the log-file.
my ($host,$date,$url_with_method,$status,$size,$referrer,$agent) = $line =~ m/^(\S+) - - \[(\S+ [\-|\+]\d{4})\] "(\S+ \S+ [^"]+)" (\d{3}) (\d+|-) "(.*?)" "([^"]+)"$/; |
For our IIS folks, you could try and pattern match the lines as well, but we highly recommend downloading and installing Microsoft’s cool free tool, Log Parser (Download Here). It will parse/understand nearly any type of IIS configuration for you. For instance, if you just wanted to pull out a list of agents in a log file you would run the following:
LogParser -i:IISW3C -o:TSV "SELECT cs(User-Agent) FROM iislogfile.log" > agents_out.txt |
That would send all of the agents found in the log file into a text file called “agents_out.txt”.
Regardless, the next step is to parse out and pull out all of the agents into a hash, count the number of times they appear, and then run some basic business logic to determine if the user agent is a real user or a possible robot. We can up with some general rules which pulled out all of the robots we were looking for.
#SNIPPET if($agent =~m/(bot|spider|link|java|wordpress)/ig){ $botspect.=$agent."\n"; }elsif(length($key)<70){ $botspect.=$agent."\n"; } |
We realize that the code could be more elegant, by combing all of the criteria matches in one “IF” statement, but for the sake of rapid prototyping and ease of testing the rules, this worked best for our needs. Efficiency is not the main purpose of this data mining project.
Our basic rules above check if the agent contains any of the target words (bot, spider, link, java, wordpress) or if the agent string is less than 70 characters.
Once our list was process it was dumped into a basic CSV file that contained the agent string and the amount of times it appeared in the log. Granted these “hits” for the agent may be duplicated, if multiple bots are browsing with the exact same agent string. Our purpose is just to see who is visiting our site and then validate whether if having them use our bandwidth is worth it or not.
Below is a sample of our robots, sorted by most hits to least.
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 19541 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 12423 Mozilla/5.0 (compatible; AhrefsBot/4.0; +http://ahrefs.com/robot/) 4217 Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 2565 msnbot/2.0b (+http://search.msn.com/msnbot.htm) 1987 Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm 1247 checks.panopta.com 694 Mozilla/5.0 (compatible; Ezooms/1.0; [email protected]) 634 Mozilla/5.0 (compatible; JikeSpider; +http://shoulu.jike.com/spider.html) 628 MobileSafari/8536.25 CFNetwork/609.1.4 Darwin/13.0.0 496 Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 346 ShopWiki/1.0 ( +http://www.shopwiki.com/wiki/Help:Bot) 276 msnbot-media/1.1 (+http://search.msn.com/msnbot.htm) 276 Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 234 Mozilla/5.0 (compatible; MJ12bot/v1.4.3; http://www.majestic12.co.uk/bot.php?+) 186 Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots) 158 AdsBot-Google (+http://www.google.com/adsbot.html) 156 Mozilla/5.0 (compatible; discoverybot/2.0; +http://discoveryengine.com/discoverybot.html) 144 Googlebot-Image/1.0 134 Mozilla/5.0 (compatible; SearchmetricsBot; http://www.searchmetrics.com/en/searchmetrics-bot/) 119 Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.0.11) Firefox/1.5.0.11; 360Spider 110 WordPress/3.5.1; http://yaggalife.com 109 Mozilla/5.0 (textmode; U; Linux i386; en-US; rv:3.0.110.0) Gecko/20101006 EzineArticlesLinkScanner/3.0.0g 90 WordPress/3.5.1; http://jainkanaknandi.org 81 AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari 70 rarely used 68 Mozilla/5.0 (compatible; Mail.RU_Bot/2.0; +http://go.mail.ru/help/robots) 55 .... |
The list is pretty basic and contains most of what you would expect: Google, MSN, Bing. The WordPress links were pretty suspicious in our opinion and we are still looking into what type of traffic they are bringing to the site.
So using this basic tool, we were able to go through and selectively write an HTACCESS rule to ban certain robots that are too excessive in their crawling and would have zero benefit to our client’s services.
We also blocked the “lazy” crawl authors as well, or any crawler that left their agent tag as the default programming language’s library tag. ie: PHP/5.2.10,curl,Java/1.7.0_09,libwww-perl/5.805.
Figure it was a good block, since it looks like somebody was trying to automate some crawling process.
The “All Star” crawler of the bunch was the one that actually left the word “User-Agent:” actually embedded in their crawler (ie. “User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31.” Note from us: You had one job!).
So using a very basic data-mining technique of simplifying the data pool iteratively, we were able to figure out what bots where crawling the site and how frequently. This data sample was only a few days worth of logs, but could be expanded with a database and historically trending over time to determine whether any of the bots increase their activity significantly. Like we mentioned in the beginning, these will NOT trap the real cleaver content scrapers, but just the robots that are being reported.
Note: To trap content scrapers, we have a few tricks up our sleeves using honeypot links. Contact Us for more information.