Scraping the web for fun and profit

Crawling and scraping rarely get discussed in a security context because everyone is too busy creating cute mashups and messaging their MySpace friends.

I recently read Webbots, Spiders, and Screen Scrapers from NoStarch Press. The author uses PHP-CURL for all his examples, and provides interesting ideas for scrapers. Most of these ideas can be replaced by two tools: Google Alerts or Dapper.Net.

Google Alerts is one of my favorite web applications yet. I use it all the time, because it is constantly sending me emails about things that I am interested in. Choosing keywords for it is extremely important - a skill that I believe is not only necessary, but will grow in experience and have big end rewards. There is a limitation of 1,000 keywords - but I figure you can open more GMail accounts if you want more Google Alerts.

Another great aspect of Google Alerts in your GMail is not only searching them, but sorting them into archived tags. Then you can search on the tags as well. My personal favorite feature of Google Alerts is to be notified immediately on certain keywords. I have noticed a very low (within 1-2 minutes, possibly less) amount of lag time between when the word first appears on a high traffic site to when it drops into my GMail.

Marcin uses Dapper, and has run into the same legal issues that Pamela Fox describes in her presentation on Web 2.0 & Mashups: How People can Tap into the "Grid" for Fun & Profit. We talked about how the legality of this probably won't stand up - unless you are actively leeching content, growing it as parasite hosting, and/or making money off it somehow. It could be against an AUP, and therefore your account could be disabled - but as long as you can create a new account I think this sort of activity will continue.

Marcin also got me interested in scraping more when he pointed me towards iOpus and I found out about iMacros. I had used a few other similar scraping tools in the past, and wanted to put together a collection of scraping tools for those who are unable to benefit from Google Alerts or Dapper. For example, at home, locally, on an Intranet, or other website that is otherwise unreachable by Googlebots or Dapper scrapes.

Some say that everything started with Perl first, and in the case of scraping - this is almost certainly the case. Randal Schwartz wrote an article for Linux Magainze almost 5 years ago regarding WWW:Mechanize. Perl has evolved to include a few other parsing modules, including HTML::TokeParser, HTML::PullParser, and [IMO best] XML::Smart::HTMLParser. However, most scrapers in scripting languages evolved or copied from WWW::Mechanize.

In fact, Ruby's primary scraper is called exactly that, mechanize. It relies on Hpricot, an HTML parser for Ruby, which Jonathan Wilkins also recently blogged about, while trying to find a Ruby equivalent to Python's setattr. Ruby also has another scraping toolkit, called scRUBYt that is most certainly worth checking out, even for a novice.

One of the latest toolkits for parsing comes from the Python camp. Called pyparsing, this appears to be something Google would use to scrape the entire Internet. Of course, other Python users will be familiar with BeautifulSoup, which has been a very popular and powerful parsing library over the past few years, mostly because it handles invalid markup well, similarly to XML::Smart from Perl.

So let's say you have malformed, unvalidated HTML. What's the best way to handle it in the various languages besides Perl and Python? Well, Ruby has RubyfulSoup (same website). For language-dependent, one could also use HTML TIDY (and here are the bindings for Ruby). Sylvan von Stuppe first mentioned NekoHTML on his website, and then went on to go through his ideas on scraping using Groovy. In this exhaustive list of MITM proxies and web application security testing tool ideas, he also mentions that HTMLUnit uses both the Commons HTTPClient and NekoHTML. We'll talk more about HTMLUnit and related utilities in a future blog post.

I want to wrap this blog post up here because I'm on my way to San Diego to speak at Toorcon. I'll be covering security testing as it relates to the three phases of application lifetime: the programming phase, the testing and pre-deployment phase, and operations/maintenance. Hope to see you there!

Posted by Dre on Thursday, October 18, 2007 in Security.