blog.kfish.org

My name is Conrad Parker, and I live in Kyoto, Japan. I am working towards a PhD in Computer Science at Kyoto University, finishing September 2009. I also work on some free software projects including the Sweep sound editor and the Annodex media system, and various smaller projects which you can read about here.

Saturday, 9 June 2007

Review: TagSoup

This week I've been playing with TagSoup, a Haskell library by Neil Mitchell and Henning Thielemann "for extracting information out of unstructured HTML code, sometimes known as tag-soup". This article introduces the basic usage of TagSoup, and discusses the functional approach to mining XML-like data.

  • Name: TagSoup
  • Version: darcs as at June 9, 2007; prior to release version 0.2
  • Functionality: Parsing possibly malformed HTML/XML
  • Inputs: String (also includes a URL -> IO String helper function)
  • License: BSD3

parseTags :: String -> [Tag]

The first thing you do with TagSoup is parse the document into a list of [Tag]. The Tag type is fairly general, and can represent the various things that can occur when reading an HTML document: an opening or closing tag, the text between tags (like <strong>this</strong>), comments, or special tags like <!DOCTYPE ...>. It is also used to mark the location of syntax errors, though of course these are not fatal as the whole point of the library is to robustly work around badly-formed input.

XML Parsing

TagSoup actually contains no HTML-specific code, other than that it knows about HTML entities. To demonstrate that it can be used for other kinds of possibly malformed XML, I added an example which extracts information from an RSS feed (now part of Example.hs):

-- rssCreators Example: prints names of story contributors on
-- sequence.complete.org. This content is RSS (not HTML), and the selected
-- tag uses a different XML namespace "dc:creator".

rssCreators :: IO [String]
rssCreators = do
    tags <- liftM parseTags $ openURL "http://sequence.complete.org/node/feed"
    return $ map names $ partitions (~== "dc:creator") tags
    where
      names xs = innerText $ xs !! 1

This function is of type IO [String]: it uses IO, and returns a list of Strings -- the names of contributors. The first line:

    tags <- liftM parseTags $ openURL "http://sequence.complete.org/node/feed"
uses openURL (part of TagSoup) to read the contents of the given RSS feed into a String, and then runs parseTags on that, calling the result tags.

Extracting information

Now that we can use the XML as a list of Tags, the second line:

    return $ map names $ partitions (~== "dc:creator") tags
splits it up into separate partitions, starting a new partition wherever there is a tag that roughly matches <... dc:creator ...>. It then runs the function names on each partition, and returns the result. names simply grabs the text inside the first thing in a partition, ie. the content of the <dc:creator> tag itself. Done:
*Example.Example> rssCreators 
["dons","dons","dons","jgoerzen","dons","dons","dons","dons","dons","dons"]

A more complex example, using an external HTTP library

Simon Peyton-Jones is a Free Software developer working on the GHC compiler at Microsoft Research in Cambridge, England. One of the examples given in Example.hs attempts to extract a list of his current research papers. Using TagSoup's convenient but simple HTTP library, it fails to terminate due to a hanging server. Here is a working version of that example using the new lazy ByteString version of the Haskell HTTP library:

-- compile with: ghc --make -o spj -Ldist/build -lHSHTTP1-3000.0.0 spj.hs

module Main where

import Text.HTML.TagSoup

import qualified Data.ByteString.Lazy.Char8 as BS
import Network.HTTP (rspBody)
import Network.HTTP.UserAgent as UA

spjPapers :: IO ()
spjPapers = do
        rsp <- UA.get "http://research.microsoft.com/~simonpj/"
        let tags = parseTags $ BS.unpack $ rspBody rsp
        let links = map f $ sections (~== "a") $
                    takeWhile (~/= TagOpen "a" [("name","haskell")]) $
                    drop 5 $ dropWhile (~/= TagOpen "a" [("name","current")]) tags
        putStr $ unlines links
    where
        f :: [Tag] -> String
        f = dequote . unwords . words . innerText . head . filter isTagText

        dequote ('\"':xs) | last xs == '\"' = init xs
        dequote x = x

main = spjPapers

This example is obviously a little more involved, but this is typical for a real-world example of scraping HTML -- the main content of the page is handwritten, and the enclosing content management system uses many non-standard elements and attributes.

The guts of that example is the links = ... declaration. It looks for the part of the page roughly between <a name="current"> and <a name="haskell">, breaks it up into sections (starting each section wherever there is a new <a ...> tag), and then runs some function f on each of those sections. To see this in more detail, we can read the declaration from the far end backwards: with the parsed list of tags :: [Tag], do the following:

  1. drop everything until you get an opening tag matching <a ... name="current" ...>
  2. drop the next 5 tags (no matter what they are)
  3. take everything until you get an opening tag matching <a ... name="haskell" ... >
  4. We now have a big list of tags (:: [Tag]): split it up into sections, starting each section wherever there is a new <a ...> tag. This will give us a list of lists of tags (:: [[Tag]]).
  5. run f on each section's [Tag].

And what about this magical function f? It takes the first item of TagText (text between tags) in a section, runs unwords . words on it to clean up the whitespace (by splitting it up into words, then joining the words back up with a single space between each), and finally removes any surrounding quotes if present. You could use a function like f to clean up tag text in any page you are scraping. Here, the final result is a nice, clean, plaintext list of the titles of Simon Peyton-Jones' current research papers:

Constructor specialisation for Haskell programs
Faster laziness using dynamic pointer tagging
Scrap your type applications
A History of Haskell: being lazy with class
...

Functions for extracting information

TagSoup provides a few useful functions for pulling apart web pages. Taken together with Haskell's list processing functions, the result is a very expressive, concise language for extracting information from web data. In the above examples, we've seen how various String and list handling functions (like lines, words, unlines, unwords, takeWhile, drop, filter, defined in the Haskell Prelude), can be used together with predicates like isTagText from TagSoup. TagSoup also provides operators for inexact matching, ~== and its negative ~/=. These allow you to do a loose match on tag contents, so that your HTML-scraping application has some resilience to minor changes in page generation.

Lastly, functions like sections make use of the above predicates to divide the page up into similar parts. As we have seen in the above examples, this is useful for the common case where a page contains a list of items, and we want to extract the same kind of information from each of those items. We can write a function f to handle one item, then simply map f across all the contents of the page.

Comparisons

The similarly-named Java TagSoup is "a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild". This tries to make badly formed HTML usable through a conventional SAX interface. This makes it useful for more general applications than the Haskell TagSoup, but far less expressive for the task of scraping a known web page.

The Python Beatiful Soup (and similarly its Ruby counterpart Rubyful Soup) "can turn even invalid markup into a parse tree." This takes quite a similar approach to the Haskell TagSoup, and also provides functions to print out the modified parse tree. It attempts to create a full DOM parse tree (using HTML-specific heuristics), so extracting information can involve walking the tree with syntax like head.nextSibling.contents[0].nextSibling. Nevertheless it provides content-based search functions like soup.find('p', align="center"), and with a few lambda functions can be quite expressive.

Conclusion

This TagSoup does one thing and does it very well: it provides a small set of useful abstractions for extracting information from HTML pages and other XML data, As the markup you are scraping is badly formed, TagSoup provides operators for inexact matching and does not attempt to coax the page into a conventional tree structure. The result is a very fast, list-based representation of page content which can be mined using TagSoup's Tag-specific functions and Haskell's usual list operations.

Labels: , ,

2 Comments:

Blogger Jim Collins said...

They really should change the name; John Cowan's TagSoup has been around for a long time.

13 June 2007 00:00  
Blogger ctnd said...

This post was interesting and informative.

23 May 2008 05:52  

Post a Comment

Links to this post:

Create a Link

<< Home