Don't Scrape Me Bro

2012-01-23

Web browsers have advanced nearly to the state where a web application can feel like a desktop application. Much of this is a result of advanced JavaScript engines in modern browsers. I think we can all agree, theses changes have been good for the experience. But one particular area that is suffering is generic search.

DevonAgent

I use DevonAgent for more sophisticated searches. It’s a surgical tool for finding a needle in a haystack.[1] But it is becoming increasingly more difficult to leverage a tool that does not run a javascript engine for searching.

Here’s an example. If I want to search Getty Images there is a search string that looks like this:

http://www.gettyimages.com/Search/Search.aspx?contractUrl=2&language=en-US&family=creative&lic=rf&assetType=image&p=sad+clown

That looks very useable for an application like DevonAgent. The parameters are nicely named and there aren’t that many of them. But if I try to scrape or extract image elements from the resulting page, very little is returned. Go ahead, take a second to look at the source of that page. I’ll wait…

[caption id="" align=“aligncenter” width=“300” caption=“Sad Clown is Sad”][/caption]

See, no images. All of the images displayed in the web browser are returned from a JavaScript function. That means the tried-and-true functions like curl are useless for extracting information. I understand the Getty wants me to purchase the images, and that’s exactly why I search there. I WANT to buy them. I just don’t want to page through their user interface to find an image I like. I also want to leverage more powerful search tools like DevonAgent. I seem to run into these sites once a week.

That’s forcing me to deploy heavier artillery like Fake app or Python libraries that actually run a browser instance.[2] Maybe that’s the point. If it’s difficult I will just browse the site and look at ads.

<li id="fn:1">I don’t want to dive into the application. It’s very complex and could take several hundred words just to describe what’s possible with DevonAgent. <a class="reversefootnote" title="return to article" href="#fnref:1"> ↩</a></li>

<li id="fn:2">There’s also the <a href="http://pypi.python.org/pypi/selenium">Selenium package</a> for python and other languages. These are much more complex than using something like <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> that is fast at extracting from an XML hierarchy. <a class="reversefootnote" title="return to article" href="#fnref:2"> ↩</a></li>