blog.kfish.org

My name is Conrad Parker, and I live in Kyoto, Japan. I am working towards a PhD in Computer Science at Kyoto University, finishing September 2009. I also work on some free software projects including the Sweep sound editor and the Annodex media system, and various smaller projects which you can read about here.

Monday, 14 April 2008

Continuation Fest 2008: Continuations for video decoding and scrubbing

Yesterday was Continuation Fest 2008, at the University of Tokyo's campus in Akihabara (a very nice venue!). It was very well attended; latecomers overflowed to a second room and participated by video conference. It was a little strange to see so many people interested in such an obscure, troublesome and malignant expressively powerful programming construct; the breadth of talks made for a very inspiring and practical introduction to the theory, applications and implementation of continuations in many different languages.

I recommend reading Kenichi Asai's introduction to delimited continuations (slides [PDF]). He introduced the shift and reset operators through the problem of expressing exceptional control flow, and then explained how to use these to type (ie. determine a concrete type for) printf. The main point was that shift/reset provide a high-level abstraction over control flow, with minimal impact on the implementation of your existing functions.

Oleg Kiselyov demonstrated some new code for transactional web applications, using delimited continuations for explicit state sharing between parallel connections. The result is that the user has a consistent view across multiple tabs are open on the same site, and the state is transactional so that there is no need for warnings like "Do not press the BUY button more than once!". He said that everyone already understands delimited continuations, they just don't realize it.

The topic of my presentation at Continuation Fest was Continuations for video decoding and scrubbing:

Playback of encoded video involves scheduling the decoding of audio and video frames and synchronizing their playback. "Scrubbing" is the ability to quickly seek to and display an arbitrary frame, and is a common user interface requirement for a video editor. The implementation of playback and scrubbing is complicated by data dependencies in compressed video formats, which require setup and manipulation of decoder state.

We present the preliminary design of a continuation-based system for video decoding, reified as a cursor into a stream of decoded video frames. Frames are decoded lazily, and decoder states are recreated or restored when seeking. To reduce space requirements, a sequence of decoded frames can be replaced after use by the continuation which created them. We outline implementations in Haskell and C.

I'll be introducing the code for this over the next few months. Whereas in my presentation about video player internals at BOSSA I outlined the problem space in designing a multimedia architecture, at Continuation Fest I tried to break it down into subproblems and considered useful data structures and programming techniques for dealing with them.

I got a lot of great feedback, and I think I succeeded in my mission to introduce this problem space to some really smart people. Thanks particularly to Chung-chieh Shan for some insightful ideas about how to deal with existing stateful codec implementations. It was also very interesting to talk with Shinji Kono about Continuation-based C (cBc) (slides [HTML tarball]), a C-like language capable of expressing continuations, non-local jumps, multiple function entry-points, and assorted other ways to shoot yourself in the foot. He suggested that it was designed for exactly the kind of thing I'm doing, and I'll be interested to try it out. It is implemented in a modifed GCC 4.x as an RTL code generator, so should now be (fairly) architecture-independent.

Thanks to the organizers of Continuation Fest 2008 for putting together such a useful and interesting event. I look forward to implementing just some of the things I learned :-)

Labels: ,

Tuesday, 25 March 2008

Release: HOgg 0.4.0

HOgg is a Haskell library and commandline tool for manipulating Ogg files. This release contains a bunch of code written during FOMS and LCA 2008, including a new sort subcommand and proper handling of Skeleton when merging and ripping files. Full details are in the release notes.

sort implementation

My favourite part is the implementation of the new sort subcommand:

sort :: [OggPage] -> [OggPage]
sort = sortHeaders . listMerge . demux

This is somewhat shorter than the equivalent C implementation, oggz-sort.cHaskell affords abstraction whereas in C it's a trade-off. sortHeaders is a long (21 line) function that re-orders header pages according to the Theora and Skeleton specifications, and listMerge is a generic list merging function, also used in the merge subcommand. demux is tiny:

demux :: (Serialled a) => [a] -> [[a]]
demux = classify serialEq
You can read that as "demux is classification by serial number": classify is a generic list function, classifying list elements according to some criterion you give it. Here, for example, the list of pages:
[Video0, Audio0, Video1, Audio1, Audio2, Audio3, Video2, Audio4, Video3, ...]
will get classified into two separate lists:
[[Video0, Video1, Video2, Video3, ...],
 [Audio0, Audio1, Audio2, Audio3, Audio4, ...]]
This is done lazily, meaning that the processing is done on the fly and big intermediate lists are not constructed in memory. Video0, Audio0 will be passed through listMerge and sortHeaders and written to disk by the consumer of sort well before Video103 and Audio5007 are seen.

Documentation improvements and self-checking

The help for each subcommand now contains long descriptions, mostly similar to the man pages of the Oggz tools. The descriptions also have explicit sections describing how Theora, Skeleton and chained files are handled. The example commandlines for each subcommand use the Ogg MIME types and file extensions that we are now recommending in Xiph.Org.

The best bit though is hogg selfcheck, which checks that the help examples are valid. It checks that all the example commandlines pass through getOpt without errors, and that all file extensions used in options are valid. This is the kind of nice touch which would have been a pain to code up in C, but fell out cleanly in the Haskell implementation. As it is fairly cheap to run (and printing help text is hardly a performance-critical operation), this option is also silently run after printing out any help output at all, so that such errors are more likely to be found and reported. The same commit that introduced hogg selfcheck also fixed two such documentation errors which were found by this option :-)

Labels: ,

Thursday, 6 December 2007

Release: HOgg 0.3.0

Hogg is a commandline tool for manipulating Ogg files. It has subcommands, like hogg chop for cutting out bits of video, hogg info for telling you about the codecs, and hogg dump for hexdumping the packet data. It's basically a re-implementation of most of the stuff in liboggz, but the new features in hogg 0.3.0 such as chopping out a section of a file and adding Ogg Skeleton metadata, are not yet in oggz-tools.
$ hogg help chop
chop: Extract a section (specify start and/or end time)
Usage: hogg chop [options] filename ...

Examples:
  Extract the first minute of file.ogg:
    hogg chop -e 1:00 file.ogg

  Extract from the second to the fifth minute of file.ogg:
    hogg chop -s 2:00 -e 5:00 -o output.ogg file.ogg

  Extract only the Theora video stream, from 02:00 to 05:00, of file.ogg:
    hogg chop -c theora -s 2:00 -e 5:00 -o output.ogg file.ogg

  Extract, specifying SMPTE-25 frame offsets:
    hogg chop -c theora -s smpte-25:00:02:03::12 -e smpte-25:00:05:02::04 -o output.ogg file.ogg
Nevertheless, I'm continuing to work on both liboggz and hogg. liboggz, in pure C, is faster; hogg, in pure (but unoptimised) Haskell, is more correct. I spent a few hours earlier today tracking down a corner case in liboggz, coincidentally triggered by the chopping routines in libannodex. It reminded me that one of my first realizations about Haskell was that its sanity-checker often tells you about forgotten corner cases of algorithms.

Labels: ,

Thursday, 15 November 2007

Survey: Haskell Unicode support

Haskell source is interpreted as UTF-8, but internally the data is stored as Unicode code points. However the generic show method does not serialize Strings as UTF-8 (when using GHC). So, when reading or writing documents it is necessary to introduce an explicit conversion from or to the desired character set. This article outlines how to use Unicode in Haskell, and surveys three alternatives for character set conversion: iconv, utf8-string and encoding, providing working examples for each.

Unicode in Haskell source

The Haskell Prime standardization wiki contains discussions of Unicode in Haskell Source, and of ways of handling Char as Unicode. In particular, GHC (as of release 6.6, early Jan 2006) interprets source files as UTF-8. Hence the following is a valid source file:
import System.Time

main :: IO ()
main = do
  time <- getClockTime
  cal <- toCalendarTime time
  putStrLn $ dayName $ ctWDay cal

dayName :: Day -> String
dayName d = case d of
              Monday -> "月曜日"
              Tuesday -> "火曜日"
              Wednesday -> "水曜日"
              Thursday -> "木曜日"
              Friday -> "金曜日"
              Saturday -> "土曜日"
              Sunday -> "日曜日"
The dayName function provides the Japanese name for a given Day. However the main function, which tries to print that onto stdout, dumps it without any character set conversion, truncating each character to 8 bits. In order to control the output charset, we need to use a Unicode conversion library. The three libraries iconv, utf8-string and encoding have similar purposes but some different features.

iconv

Description:Binding to C iconv() function
Author:Duncan Coutts
darcs gethttp://code.haskell.org/iconv/
Exports:Codec.Text.IConv
Interface:ByteString.Lazy
Advantages:Speed, coverage of charset support
Disadvantages:Portability: requires POSIX iconv()
This is a Haskell binding to the iconv() C library function, providing a lazy ByteString interface. The only module exported is Codec.Text.IConv, which provides a single function:
-- | Convert fromCharset toCharset input output
convert :: String -> String -> Lazy.ByteString -> Lazy.ByteString
where fromCharset and toCharset are the names of the input and output character set encodings, and input and output are the input and output text as lazy ByteStrings. An example program to convert the encoding of an input file, similar to the GNU iconv program, is given in examples/hiconv.hs. The guts of that program is:
        output = convert (fromEncoding config) (toEncoding config) input
which is somewhat clearer than the brain-damaged interface exported by the C library. Exceptions are provided for handling unsupported conversions, invalid and incomplete characters. These errors can be silently ignored if desired by calling convertFuzzy instead. As this library wraps the system iconv() implementation, all character sets supported on the underlying system are available. The Lazy.ByteString interface works directly on the memory buffers used by the C library, which may give a speed advantage for large conversions. Note however that the iconv() C library function is defined by POSIX.1-2001 and may not be available on some older systems. In most such cases it should be possible to install GNU libiconv separately.

utf8-string

Description:Simple UTF-8 conversion library
Author:Eric Mertens
darcs gethttp://code.haskell.org/utf8-string/
Exports:Codec.Binary.UTF8.String, System.IO.UTF8
Interface:String
Advantages:Simplicity
Disadvantages:Only supports UTF-8 conversions
This library contains both a simple module for data conversion with a String interface, and a useful IO module. The String conversion module, Codec.Binary.UTF8.String, provides two pairs of complementary encoding and decoding functions:
-- | Encode a string using 'encode' and store the result in a 'String'.
encodeString :: String -> String

-- | Decode a string using 'decode' using a 'String' as input.
-- | This is not safe but it is necessary if UTF-8 encoded text
-- | has been loaded into a 'String' prior to being decoded.
decodeString :: String -> String

-- | Encode a Haskell String to a list of Word8 values, in UTF8 format.
encode :: String -> [Word8]

-- | Decode a UTF8 string packed into a list of Word8 values, directly to String
decode :: [Word8] -> String
I guess "not safe" in the comment for decodeString refers to type-safety; for example this function doesn't stop you from trying to decode the same text twice, whereas if you tried that with the plain decode function, the compiler would point out your bug for you. To see how this might look in the wild, the following is a complete "Hello World" web application (err, CGI script) in Japanese:
import Codec.Binary.UTF8.String
import Network.CGI hiding (Html)
import Text.Html

main :: IO ()
main = runCGI $ handleErrors cgiMain

cgiMain :: CGI CGIResult
cgiMain = do
    setHeader "Content-Type" "text/html; charset=utf-8"
    output $ renderHtml $ h1 << encodeString "おはよう御座います!"
The utf8-string library also includes an entire IO module, System.IO.UTF8, exporting print, putStr, putStrLn, getLine, readLn, readFile, writeFile, appendFile, getContents, hGetLine, hGetContents, hPutStr, hPutStrLn. These essentially wrap the default IO functions in encodeString and decodeString, which you may find convenient if you are doing lots of UTF-8 processing. This library is tiny, and implemented natively in Haskell so there are no portability issues. As it works directly on ByteStrings it should be sufficiently fast for practical purposes. Of course, if you need to do conversions to or from character sets other than UTF-8, you will need to use a different library.

encoding

Description:Native Haskell charset conversion library
Author:Henning Günther
darcs gethttp://code.haskell.org/encoding/
Exports:Data.Encoding.*, System.IO.Encoding
Interface:ByteString.Lazy
Advantages:Portable; covers more charsets than utf8-string
Disadvantages:Covers fewer charsets than iconv
Data.Encoding provides native Haskell implementations for encoding and decoding of many common character sets: ASCII, UTF8, UTF16, UTF32, ISO8859[1-16], CP125[0-8], KOI8R, and GB18030, as well as BootString (for Punycode). For each of these, it implements an Encoding interface:
{- | Represents an encoding, supporting various methods of de- and encoding.
     Minimal complete definition: encode, decode
 -}
class Encoding enc where
        -- | Encode a 'String' into a strict 'ByteString'. Throws the
        --   'HasNoRepresentation'-Exception if it encounters an unrepresentable
        --   character.
        encode :: enc -> String -> ByteString
        -- | Encode a 'String' into a lazy 'Data.ByteString.Lazy.ByteString'.
        encodeLazy :: enc -> String -> LBS.ByteString
        encodeLazy e str = LBS.fromChunks [encode e str]
        -- | Whether or not the given 'Char' is representable in this encoding. Default: 'True'.
        encodable :: enc -> Char -> Bool
        encodable _ _ = True
        -- | Decode a strict 'ByteString' into a 'String'. If the string is not
        --   decodable, a 'DecodingException' is thrown.
        decode :: enc -> ByteString -> String
        decodeLazy :: enc -> LBS.ByteString -> String
        decodeLazy e str = concatMap (decode e) (LBS.toChunks str)
        -- | Whether or no a given 'ByteString' is decodable. Default: 'True'.
        decodable :: enc -> ByteString -> Bool
        decodable _ _ = True
Notice that this interface provides exceptions for handling unrepresentable characters. Instances of Encoding can be found by importing charset-specific modules; each simply exports a value with the same name as the module, ie. Data.Encoding.ISO88592 exports ISO88592, which is an instance of Encoding. Here is a "Hello World" CGI in Polish, using ISO-8859-2:
import Data.Encoding
import Data.Encoding.ISO88592
import Data.ByteString.Char8
import Network.CGI hiding (Html)
import Text.Html

main :: IO ()
main = runCGI $ handleErrors cgiMain

cgiMain :: CGI CGIResult
cgiMain = do
    setHeader "Content-Type" "text/html; charset=iso-8859-2"
    output $ renderHtml $ h1 << (unpack $ encode ISO88592 "Cześć")
You'll notice the call to the unpack to convert the ByteString into a plain String as expected by Html. The encoding library also provides a way to select an encoding by name:
-- | Takes the name of an encoding and creates a dynamic encoding from it.
encodingFromString :: String -> DynEncoding
(Anything which is a DynEncoding is by definition an instance of Encoding). So we could choose the encoding at runtime, or we can just be lazy and pick encodings by name. If we do this, we don't need to import the charset-specific module, and we can replace the last line of our CGI with:
    let enc = encodingFromString "ISO-8859-2"
    output $ renderHtml $ h1 << (unpack $ encode enc "Cześć")
The encoding library also provides a pair of functions for converting character sets directly between two ByteStrings:
-- | This decodes a string from one encoding and encodes it into another.
recode :: (Encoding from,Encoding to) => from -> to -> ByteString -> ByteString

recodeLazy :: (Encoding from,Encoding to) => from -> to -> Lazy.ByteString -> Lazy.ByteString
The System.IO.Encoding module does not try to provide as many convenience functions as the similar module provided by utf8-string, providing only the generic hGetContents and hPutStr. However, it does provide a way of retrieving the current system's default encoding (when used on systems supporting POSIX.1-2001 nl_langinfo()), which utf8-string lacks.
-- | Like the normal 'System.IO.hGetContents', but decodes the input using an
--   encoding.
hGetContents :: Encoding e => e -> Handle -> IO String

-- | Like the normal 'System.IO.hPutStr', but encodes the output using an
--   encoding.
hPutStr :: Encoding e => e -> Handle -> String -> IO ()

-- | Returns the encoding used on the current system.
getSystemEncoding :: IO DynEncoding
As this library is native Haskell it is portable, and as it uses lazy ByteStrings it can be fast. While it does not (yet) provide as many character sets as your system's iconv(), it does support many of the most commonly used ones.

Notes

The libraries surveyed here are under fairly active maintenance, and there are rumours of unifying their implementations. Nevertheless the existing interfaces are fairly similar where common functionality exists. Historically, all serialized data was handled in Haskell as Strings, and there was a legitimate concern that transparently converting the character set of arbitrary Strings could mangle data. The newer ByteString and Binary interfaces may allow future Haskell standards to clearly disambiguate binary and textual data, and simply serialize Strings as UTF-8 by default. Although it might be nice to "simply" serialize Strings as UTF-8, show is the wrong place to do it. Haskell's Read/Show serialization serializes to String, which is a list of Char, ie. a list of abstract Unicode code points. Character set conversion should rather happen on conversion to [Word8], at which point byte values become significant. This also encompasses direct conversions to ByteString, and the internals of primitive IO functions such as:
putChar    :: Char -> IO ()
putChar    =  primPutChar

getChar    :: IO Char
getChar    =  primGetChar
, getContents, readFile, writeFile, and appendFile defined in the Haskell Prelude, and the various character IO functions on Handles defined in System.IO. Whether or not this conversion can be done everywhere transparently, and backwards-compatibly, is an open issue for Haskell Prime. Meanwhile these libraries provide useful interfaces for explicit [Word8] and ByteString conversion, and various IO wrappers.

Summary

Although all Haskell Strings are Unicode, Haskell98 does not specify a character set representation for their IO. Unicode strings can be written directly into Haskell source files and hence exist as data within a program, but character set conversion is required if you wish to read or write these Strings in files, user input or on the network. We looked at ways of dealing with Unicode in Haskell, surveyed some useful libraries and provided working examples. Although we might hope that a future version of Haskell will provide a way to handle UTF-8 conversions, in the meantime we need to choose an appropriate library for each project that handles Unicode text.

Updates

Fri Nov 16: Edited to incorporate some feedback from #haskell:
  • Thanks to Tim Newsham for clarifying GHC's default character encoding when printing Strings.
  • Thanks to Stefan N. O'Rear for pointing out that Show/Read is not the right place for serialization, but that it should instead occur on conversion to/from [Word8].

Labels: , ,

Saturday, 9 June 2007

Review: TagSoup

This week I've been playing with TagSoup, a Haskell library by Neil Mitchell and Henning Thielemann "for extracting information out of unstructured HTML code, sometimes known as tag-soup". This article introduces the basic usage of TagSoup, and discusses the functional approach to mining XML-like data.

  • Name: TagSoup
  • Version: darcs as at June 9, 2007; prior to release version 0.2
  • Functionality: Parsing possibly malformed HTML/XML
  • Inputs: String (also includes a URL -> IO String helper function)
  • License: BSD3

parseTags :: String -> [Tag]

The first thing you do with TagSoup is parse the document into a list of [Tag]. The Tag type is fairly general, and can represent the various things that can occur when reading an HTML document: an opening or closing tag, the text between tags (like <strong>this</strong>), comments, or special tags like <!DOCTYPE ...>. It is also used to mark the location of syntax errors, though of course these are not fatal as the whole point of the library is to robustly work around badly-formed input.

XML Parsing

TagSoup actually contains no HTML-specific code, other than that it knows about HTML entities. To demonstrate that it can be used for other kinds of possibly malformed XML, I added an example which extracts information from an RSS feed (now part of Example.hs):

-- rssCreators Example: prints names of story contributors on
-- sequence.complete.org. This content is RSS (not HTML), and the selected
-- tag uses a different XML namespace "dc:creator".

rssCreators :: IO [String]
rssCreators = do
    tags <- liftM parseTags $ openURL "http://sequence.complete.org/node/feed"
    return $ map names $ partitions (~== "dc:creator") tags
    where
      names xs = innerText $ xs !! 1

This function is of type IO [String]: it uses IO, and returns a list of Strings -- the names of contributors. The first line:

    tags <- liftM parseTags $ openURL "http://sequence.complete.org/node/feed"
uses openURL (part of TagSoup) to read the contents of the given RSS feed into a String, and then runs parseTags on that, calling the result tags.

Extracting information

Now that we can use the XML as a list of Tags, the second line:

    return $ map names $ partitions (~== "dc:creator") tags
splits it up into separate partitions, starting a new partition wherever there is a tag that roughly matches <... dc:creator ...>. It then runs the function names on each partition, and returns the result. names simply grabs the text inside the first thing in a partition, ie. the content of the <dc:creator> tag itself. Done:
*Example.Example> rssCreators 
["dons","dons","dons","jgoerzen","dons","dons","dons","dons","dons","dons"]

A more complex example, using an external HTTP library

Simon Peyton-Jones is a Free Software developer working on the GHC compiler at Microsoft Research in Cambridge, England. One of the examples given in Example.hs attempts to extract a list of his current research papers. Using TagSoup's convenient but simple HTTP library, it fails to terminate due to a hanging server. Here is a working version of that example using the new lazy ByteString version of the Haskell HTTP library:

-- compile with: ghc --make -o spj -Ldist/build -lHSHTTP1-3000.0.0 spj.hs

module Main where

import Text.HTML.TagSoup

import qualified Data.ByteString.Lazy.Char8 as BS
import Network.HTTP (rspBody)
import Network.HTTP.UserAgent as UA

spjPapers :: IO ()
spjPapers = do
        rsp <- UA.get "http://research.microsoft.com/~simonpj/"
        let tags = parseTags $ BS.unpack $ rspBody rsp
        let links = map f $ sections (~== "a") $
                    takeWhile (~/= TagOpen "a" [("name","haskell")]) $
                    drop 5 $ dropWhile (~/= TagOpen "a" [("name","current")]) tags
        putStr $ unlines links
    where
        f :: [Tag] -> String
        f = dequote . unwords . words . innerText . head . filter isTagText

        dequote ('\"':xs) | last xs == '\"' = init xs
        dequote x = x

main = spjPapers

This example is obviously a little more involved, but this is typical for a real-world example of scraping HTML -- the main content of the page is handwritten, and the enclosing content management system uses many non-standard elements and attributes.

The guts of that example is the links = ... declaration. It looks for the part of the page roughly between <a name="current"> and <a name="haskell">, breaks it up into sections (starting each section wherever there is a new <a ...> tag), and then runs some function f on each of those sections. To see this in more detail, we can read the declaration from the far end backwards: with the parsed list of tags :: [Tag], do the following:

  1. drop everything until you get an opening tag matching <a ... name="current" ...>
  2. drop the next 5 tags (no matter what they are)
  3. take everything until you get an opening tag matching <a ... name="haskell" ... >
  4. We now have a big list of tags (:: [Tag]): split it up into sections, starting each section wherever there is a new <a ...> tag. This will give us a list of lists of tags (:: [[Tag]]).
  5. run f on each section's [Tag].

And what about this magical function f? It takes the first item of TagText (text between tags) in a section, runs unwords . words on it to clean up the whitespace (by splitting it up into words, then joining the words back up with a single space between each), and finally removes any surrounding quotes if present. You could use a function like f to clean up tag text in any page you are scraping. Here, the final result is a nice, clean, plaintext list of the titles of Simon Peyton-Jones' current research papers:

Constructor specialisation for Haskell programs
Faster laziness using dynamic pointer tagging
Scrap your type applications
A History of Haskell: being lazy with class
...

Functions for extracting information

TagSoup provides a few useful functions for pulling apart web pages. Taken together with Haskell's list processing functions, the result is a very expressive, concise language for extracting information from web data. In the above examples, we've seen how various String and list handling functions (like lines, words, unlines, unwords, takeWhile, drop, filter, defined in the Haskell Prelude), can be used together with predicates like isTagText from TagSoup. TagSoup also provides operators for inexact matching, ~== and its negative ~/=. These allow you to do a loose match on tag contents, so that your HTML-scraping application has some resilience to minor changes in page generation.

Lastly, functions like sections make use of the above predicates to divide the page up into similar parts. As we have seen in the above examples, this is useful for the common case where a page contains a list of items, and we want to extract the same kind of information from each of those items. We can write a function f to handle one item, then simply map f across all the contents of the page.

Comparisons

The similarly-named Java TagSoup is "a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild". This tries to make badly formed HTML usable through a conventional SAX interface. This makes it useful for more general applications than the Haskell TagSoup, but far less expressive for the task of scraping a known web page.

The Python Beatiful Soup (and similarly its Ruby counterpart Rubyful Soup) "can turn even invalid markup into a parse tree." This takes quite a similar approach to the Haskell TagSoup, and also provides functions to print out the modified parse tree. It attempts to create a full DOM parse tree (using HTML-specific heuristics), so extracting information can involve walking the tree with syntax like head.nextSibling.contents[0].nextSibling. Nevertheless it provides content-based search functions like soup.find('p', align="center"), and with a few lambda functions can be quite expressive.

Conclusion

This TagSoup does one thing and does it very well: it provides a small set of useful abstractions for extracting information from HTML pages and other XML data, As the markup you are scraping is badly formed, TagSoup provides operators for inexact matching and does not attempt to coax the page into a conventional tree structure. The result is a very fast, list-based representation of page content which can be mined using TagSoup's Tag-specific functions and Haskell's usual list operations.

Labels: , ,

Sunday, 17 December 2006

Introductory Haskell Programming in the UNIX Environment

A few months back I was chatting to Don Stewart about scripting in Haskell, and he pointed me towards some Haskell shell scripts he's written.

This weekend, Don wrote some introductory tutorials. Part 1 introduces Haskell in a similar style to how the Camel book introduces Perl -- quite readable, and fairly low on mathematical jargon. Part 2 introduces character and file IO, which I'll dig into below.

Why bother?

It turns out that you can re-implement the core of many simple UNIX tools as one-liners in Haskell. This is interesting because, like C, Haskell compiles to a binary and runs like a real program. Its also interesting because, unlike C, Haskell provides lots of error checking, as well as guarantees against segfaults and memory leaks, for free.

Lazy evaluation

Consider the following implementation of cp (from Part 2), which copies its standard input to standard output:

import System.Environment

main = do
  [infile, outfile] <- getArgs
  s <- readFile infile 
  writeFile outfile s

Although this is pretty simple to understand, it looks like it reads the entire contents of the input file into the variable s, and then writes that to the output file. That would be a huge memory hog, so let's take a look at what's actually going on.

Haskell compiles to a binary, so we can strace the resulting program:

$ strace -o /tmp/cp.out ./cp bigfile.ogg /tmp/bigfile-copy.ogg
$ less /tmp/cp.out
...
read(3, "\300\23n\261\205\v\fD$\r\330,\260\2172Zp\241h\306<\216"..., 8192) = 8192
write(4, "\300\23n\261\205\v\fD$\r\330,\260\2172Zp\241h\306<\216"..., 8192) = 8192 
read(3, "\2646\353t\304\300\f9|\36\10|O@r|\3149\3\340v{4\366|\17"..., 8192) = 8192
write(4, "\2646\353t\304\300\f9|\36\10|O@r|\3149\3\340v{4\366|\17"..., 8192) = 8192
...

We see that it has actually set up an 8K temporary buffer to funnel data back and forth, keeping the memory requirements very low. So the code was not a memory hog at all, even though its pretty simple to understand.

The way this works is that s is not a normal String variable at all. It is an IO String, the embodiment of everything String-like in IO. It lives in a very beautiful, transient and continually changing state of interaction where it might read some chars, write some, read some, write some, and so on until EOF. This is all that an IO String could want from its brief yet pristine existence, and nothing more.

Pass the pipe

Giving our instance of this IO String a name is conceptually similar to the use of named pipes in shell scripts. A direct translation of the above Haskell script into sh might be:

#!/bin/sh

infile=$1
outfile=$2

s="${TMPDIR-/tmp}/$$.fifo"
mkfifo $s

cat < $s > $outfile &
cat < $infile > $s

rm $s

Of course, this example is trivial; you'd only use named pipes for more complex tasks, such as setting up transcoding pipelines, where you might not know the names or parameters of the commands to be run up front. So, what if your shell script doesn't need to be so complex? What if you don't need to name your intermediate pipe?

cat $infile | cat > $outfile

Well, that's fine in Haskell too:

readFile f >>= writeFile g

No more naming our intermediate IO String. But now we know that it's still there, lurking inside that little >>=. This uses lazy evaluation, and we read in the Camel book that laziness is the first virtue of a programmer; Haskell gives it to you in spades.

Labels: ,