Don't Scrape Me Bro

January 23, 2012 by Gabe | [mmd] |

Web browsers have advanced nearly to the state where a web application can feel like a desktop application. Much of this is a result of advanced JavaScript engines in modern browsers. I think we can all agree, theses changes have been good for the experience. But one particular area that is suffering is generic search.


I use DevonAgent for more sophisticated searches. It’s a surgical tool for finding a needle in a haystack.[1] But it is becoming increasingly more difficult to leverage a tool that does not run a javascript engine for searching.

Here’s an example. If I want to search Getty Images there is a search string that looks like this:

That looks very useable for an application like DevonAgent. The parameters are nicely named and there aren’t that many of them. But if I try to scrape or extract image elements from the resulting page, very little is returned. Go ahead, take a second to look at the source of that page. I’ll wait…

[caption id="" align="aligncenter" width="300" caption="Sad Clown is Sad"]Sad Clown By 'Jerry' on Flickr[/caption]

See, no images. All of the images displayed in the web browser are returned from a JavaScript function. That means the tried-and-true functions like curl are useless for extracting information. I understand the Getty wants me to purchase the images, and that's exactly why I search there. I WANT to buy them. I just don't want to page through their user interface to find an image I like. I also want to leverage more powerful search tools like DevonAgent. I seem to run into these sites once a week.

That’s forcing me to deploy heavier artillery like Fake app or Python libraries that actually run a browser instance.[2] Maybe that’s the point. If it’s difficult I will just browse the site and look at ads.

  1. I don’t want to dive into the application. It’s very complex and could take several hundred words just to describe what’s possible with DevonAgent.
  2. There’s also the Selenium package for python and other languages. These are much more complex than using something like BeautifulSoup that is fast at extracting from an XML hierarchy.