Wednesday, May 26, 2010

ScreenScraping with Groovy

I'm attempting to write a screen scraper for various websites, so I will document my findings here.

A quick search on the internet finds links to examples like the following


#!/usr/bin/env groovy
// Depends on tagsoup library:
//      http://ccil.org/~cowan/XML/tagsoup/
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
def url = new URL("http://fcd.mcw.edu/?module=faculty&func=view&id=1674")
url.withReader { reader ->
       html = slurper.parse(reader)
       //we should now have a parsed file
       def value = html.body.div.div.div[2].ul.li
       value.list().each { f ->
               println "\nPub : " << f.toString()[0..80] << "..."
       }
}
This is super simple and works really well.
My problem was I needed to also post data to the websites (e.g. to login, enter data etc).
For that you need to incorporate HttpClient. For Groovy there is a HttpBuilder library that wraps HttpClient libraries with Groovy syntax. It also allows you to use GPath expressions to quickly identify locations in the response page.
I needed to go through a proxy and this proved the first hurdle. After much messing this code worked.
(Note I added an if statement to get the current IP address of the machine, so this script would work in work (where we use a proxy), and at home (where I don't have a proxy)
 
#N.B. Also that httpBuilder can be got using grapes. However I had some problems getting this to work from behind a proxy. Check out this page for some tips
http://groovy.codehaus.org/modules/http-builder/download.html
Also worth of note is that when using grab (grapes) all files are pulled down to $HOME/.groovy/grapes 

So you could also manually download the latest version of the HttpBuilder and manualyl install it
 
Eventually this worked for me (this was after grape resolve failed.. Not sure why.. Obviously it worked once it was installed)
>grape install org.codehaus.groovy.modules.http-builder http-builder 0.6
 
 
@Grab(group='org.codehaus.groovy.modules.http-builder', module='http-builder', version='0.5.2' )
import groovyx.net.http.*
import static groovyx.net.http.ContentType.*
import static groovyx.net.http.Method.* 
def http = new HTTPBuilder( 'http://www.twitter.com' ) 
def ip=java.net.InetAddress.getLocalHost().getHostAddress() 
println "IP = $ip" if(ip.startsWith("10.5.") || ip.startsWith("10.2.")){
 def proxy = "10.5.0.250"  def proxyPort = 80  //Required for HttpClient
 http.setProxy(proxy, proxyPort, "http")  //http.setProxy(proxy, proxyPort, "https")
 http.auth.basic( proxy, proxyPort, System.properties["user.name"], System.getenv("user.password") ) 
} 
http.get( path: '/', query:[id:'httpbuilder'] ) { resp, xml ->     
 println resp.status  
 println xml     
 xml.status.each {  // iterate over each XML 'status' element in the response:         
  println it.created_at.text()         
  println "  " + it.text.text()     
 } 
}
Like the simple example above (which used the TagSoup Sax Parser), the HttpBuilder parser includes an Xml Parser that can handle HTML
  • HTML response data will also be parsed automatically, by using NekoHTML which corrects the XML stream before it is passed to the XmlSlurper. The resulting behavior is that you can parse HTML as if it was well-formed XML.