blog.kfish.org

My name is Conrad Parker, and I live in Kyoto, Japan. I am working towards a PhD in Computer Science at Kyoto University, finishing September 2009. I also work on some free software projects including the Sweep sound editor and the Annodex media system, and various smaller projects which you can read about here.

Sunday, 17 December 2006

Introductory Haskell Programming in the UNIX Environment

A few months back I was chatting to Don Stewart about scripting in Haskell, and he pointed me towards some Haskell shell scripts he's written.

This weekend, Don wrote some introductory tutorials. Part 1 introduces Haskell in a similar style to how the Camel book introduces Perl -- quite readable, and fairly low on mathematical jargon. Part 2 introduces character and file IO, which I'll dig into below.

Why bother?

It turns out that you can re-implement the core of many simple UNIX tools as one-liners in Haskell. This is interesting because, like C, Haskell compiles to a binary and runs like a real program. Its also interesting because, unlike C, Haskell provides lots of error checking, as well as guarantees against segfaults and memory leaks, for free.

Lazy evaluation

Consider the following implementation of cp (from Part 2), which copies its standard input to standard output:

import System.Environment

main = do
  [infile, outfile] <- getArgs
  s <- readFile infile 
  writeFile outfile s

Although this is pretty simple to understand, it looks like it reads the entire contents of the input file into the variable s, and then writes that to the output file. That would be a huge memory hog, so let's take a look at what's actually going on.

Haskell compiles to a binary, so we can strace the resulting program:

$ strace -o /tmp/cp.out ./cp bigfile.ogg /tmp/bigfile-copy.ogg
$ less /tmp/cp.out
...
read(3, "\300\23n\261\205\v\fD$\r\330,\260\2172Zp\241h\306<\216"..., 8192) = 8192
write(4, "\300\23n\261\205\v\fD$\r\330,\260\2172Zp\241h\306<\216"..., 8192) = 8192 
read(3, "\2646\353t\304\300\f9|\36\10|O@r|\3149\3\340v{4\366|\17"..., 8192) = 8192
write(4, "\2646\353t\304\300\f9|\36\10|O@r|\3149\3\340v{4\366|\17"..., 8192) = 8192
...

We see that it has actually set up an 8K temporary buffer to funnel data back and forth, keeping the memory requirements very low. So the code was not a memory hog at all, even though its pretty simple to understand.

The way this works is that s is not a normal String variable at all. It is an IO String, the embodiment of everything String-like in IO. It lives in a very beautiful, transient and continually changing state of interaction where it might read some chars, write some, read some, write some, and so on until EOF. This is all that an IO String could want from its brief yet pristine existence, and nothing more.

Pass the pipe

Giving our instance of this IO String a name is conceptually similar to the use of named pipes in shell scripts. A direct translation of the above Haskell script into sh might be:

#!/bin/sh

infile=$1
outfile=$2

s="${TMPDIR-/tmp}/$$.fifo"
mkfifo $s

cat < $s > $outfile &
cat < $infile > $s

rm $s

Of course, this example is trivial; you'd only use named pipes for more complex tasks, such as setting up transcoding pipelines, where you might not know the names or parameters of the commands to be run up front. So, what if your shell script doesn't need to be so complex? What if you don't need to name your intermediate pipe?

cat $infile | cat > $outfile

Well, that's fine in Haskell too:

readFile f >>= writeFile g

No more naming our intermediate IO String. But now we know that it's still there, lurking inside that little >>=. This uses lazy evaluation, and we read in the Camel book that laziness is the first virtue of a programmer; Haskell gives it to you in spades.

Labels: ,

0 Comments:

Post a Comment

Links to this post:

Create a Link

<< Home