chtml-matcher

Description

chtml-matcher is a simple Lisp-based DSL for extracting information from HTML. The fancy way of describing it is that it performs pattern-based unification over HTML via a set of compiled nested closures.

It uses the closure-html library to parse HTML to lhtml, a lisp form of HTML. A template list is passed to (match-template template lhtml) and returns a bindings object containing an alist of all the extracted information. See the README for more details.

The semantics are somewhat intuitive, but might require a little playing around. The API is small and the package.lisp provides pointers to where to look. The whole thing is less than 1k lines of code so easy enough to read through.

Download

The library is available via darcs:
darcs get http://www.common-lisp.net/project/chtml-matcher/darcs/chtml-matcher
It depends on my home-brew stdutil (available at common-lisp.net), closure-html, cl-ppcre, and f-underscore, although all but closure-html could be removed if necessary.

Example

I've recently been mining some posts from vBulletin sites. I go to the last day's posts, get a list of all the new posts, then go to the thread and grab the post body. The following two templates do 90% of the work. Of course, I have to write code to convert the data I extract to web page fetches, etc.


(defparameter *vbulletin-search-template*
  '(<tbody nil
    (all ?records 
     (tr nil
      (td nil)
      (td ((class "alt1"))
        (div nil
	     (a ((href ?thread-uri))
	        ?thread-name)))
      (td ((class "alt2") (title ?activity))
        (div nil ?post-date
	       (span nil ?post-time)
	              (a ((href ?user-uri))
		        ?username)
			       (a ((href ?last-post-uri)))))))))
This looks for a table body in the search results page, then gets bindings for all matching <tr> elements and puts them within another bindings object bound to :records as specified by 'all'. The pattern pulls out all the user, thread, post and date information for all results. You can match elements on strings, regular expressions and arbitrary function calls as well.

Given a I use subst to customize the following pattern to find a particular post in a page. It replaces 'post_message_?' with a unique id for a post then returns its thread number and the entire post body.

(defparameter *vbulletin-post-template*
  `(<tbody nil 
     (tr nil (<a ((name ?post-num)))
        (tr nil)
        (tr nil (?post-body <div ((id "post_message_?"))))))

 
I use Firefox FireBug to inspect the HTML tree, identify the best unique enclosing context I can specify and then provide enough structure to uniquely capture the data I want. This approach is highly robust to many small HTML changes and should be reasonably fast.

Mailing Lists

Additional announcements, discussion and details may be found in the mailing list archives.
Valid XHTML 1.0 Strict