blog.kfish.org

My name is Conrad Parker, and I live in Kyoto, Japan. I am working towards a PhD in Computer Science at Kyoto University, finishing September 2009. I also work on some free software projects including the Sweep sound editor and the Annodex media system, and various smaller projects which you can read about here.

Thursday, 15 November 2007

Survey: Haskell Unicode support

Haskell source is interpreted as UTF-8, but internally the data is stored as Unicode code points. However the generic show method does not serialize Strings as UTF-8 (when using GHC). So, when reading or writing documents it is necessary to introduce an explicit conversion from or to the desired character set. This article outlines how to use Unicode in Haskell, and surveys three alternatives for character set conversion: iconv, utf8-string and encoding, providing working examples for each.

Unicode in Haskell source

The Haskell Prime standardization wiki contains discussions of Unicode in Haskell Source, and of ways of handling Char as Unicode. In particular, GHC (as of release 6.6, early Jan 2006) interprets source files as UTF-8. Hence the following is a valid source file:
import System.Time

main :: IO ()
main = do
  time <- getClockTime
  cal <- toCalendarTime time
  putStrLn $ dayName $ ctWDay cal

dayName :: Day -> String
dayName d = case d of
              Monday -> "月曜日"
              Tuesday -> "火曜日"
              Wednesday -> "水曜日"
              Thursday -> "木曜日"
              Friday -> "金曜日"
              Saturday -> "土曜日"
              Sunday -> "日曜日"
The dayName function provides the Japanese name for a given Day. However the main function, which tries to print that onto stdout, dumps it without any character set conversion, truncating each character to 8 bits. In order to control the output charset, we need to use a Unicode conversion library. The three libraries iconv, utf8-string and encoding have similar purposes but some different features.

iconv

Description:Binding to C iconv() function
Author:Duncan Coutts
darcs gethttp://code.haskell.org/iconv/
Exports:Codec.Text.IConv
Interface:ByteString.Lazy
Advantages:Speed, coverage of charset support
Disadvantages:Portability: requires POSIX iconv()
This is a Haskell binding to the iconv() C library function, providing a lazy ByteString interface. The only module exported is Codec.Text.IConv, which provides a single function:
-- | Convert fromCharset toCharset input output
convert :: String -> String -> Lazy.ByteString -> Lazy.ByteString
where fromCharset and toCharset are the names of the input and output character set encodings, and input and output are the input and output text as lazy ByteStrings. An example program to convert the encoding of an input file, similar to the GNU iconv program, is given in examples/hiconv.hs. The guts of that program is:
        output = convert (fromEncoding config) (toEncoding config) input
which is somewhat clearer than the brain-damaged interface exported by the C library. Exceptions are provided for handling unsupported conversions, invalid and incomplete characters. These errors can be silently ignored if desired by calling convertFuzzy instead. As this library wraps the system iconv() implementation, all character sets supported on the underlying system are available. The Lazy.ByteString interface works directly on the memory buffers used by the C library, which may give a speed advantage for large conversions. Note however that the iconv() C library function is defined by POSIX.1-2001 and may not be available on some older systems. In most such cases it should be possible to install GNU libiconv separately.

utf8-string

Description:Simple UTF-8 conversion library
Author:Eric Mertens
darcs gethttp://code.haskell.org/utf8-string/
Exports:Codec.Binary.UTF8.String, System.IO.UTF8
Interface:String
Advantages:Simplicity
Disadvantages:Only supports UTF-8 conversions
This library contains both a simple module for data conversion with a String interface, and a useful IO module. The String conversion module, Codec.Binary.UTF8.String, provides two pairs of complementary encoding and decoding functions:
-- | Encode a string using 'encode' and store the result in a 'String'.
encodeString :: String -> String

-- | Decode a string using 'decode' using a 'String' as input.
-- | This is not safe but it is necessary if UTF-8 encoded text
-- | has been loaded into a 'String' prior to being decoded.
decodeString :: String -> String

-- | Encode a Haskell String to a list of Word8 values, in UTF8 format.
encode :: String -> [Word8]

-- | Decode a UTF8 string packed into a list of Word8 values, directly to String
decode :: [Word8] -> String
I guess "not safe" in the comment for decodeString refers to type-safety; for example this function doesn't stop you from trying to decode the same text twice, whereas if you tried that with the plain decode function, the compiler would point out your bug for you. To see how this might look in the wild, the following is a complete "Hello World" web application (err, CGI script) in Japanese:
import Codec.Binary.UTF8.String
import Network.CGI hiding (Html)
import Text.Html

main :: IO ()
main = runCGI $ handleErrors cgiMain

cgiMain :: CGI CGIResult
cgiMain = do
    setHeader "Content-Type" "text/html; charset=utf-8"
    output $ renderHtml $ h1 << encodeString "おはよう御座います!"
The utf8-string library also includes an entire IO module, System.IO.UTF8, exporting print, putStr, putStrLn, getLine, readLn, readFile, writeFile, appendFile, getContents, hGetLine, hGetContents, hPutStr, hPutStrLn. These essentially wrap the default IO functions in encodeString and decodeString, which you may find convenient if you are doing lots of UTF-8 processing. This library is tiny, and implemented natively in Haskell so there are no portability issues. As it works directly on ByteStrings it should be sufficiently fast for practical purposes. Of course, if you need to do conversions to or from character sets other than UTF-8, you will need to use a different library.

encoding

Description:Native Haskell charset conversion library
Author:Henning Günther
darcs gethttp://code.haskell.org/encoding/
Exports:Data.Encoding.*, System.IO.Encoding
Interface:ByteString.Lazy
Advantages:Portable; covers more charsets than utf8-string
Disadvantages:Covers fewer charsets than iconv
Data.Encoding provides native Haskell implementations for encoding and decoding of many common character sets: ASCII, UTF8, UTF16, UTF32, ISO8859[1-16], CP125[0-8], KOI8R, and GB18030, as well as BootString (for Punycode). For each of these, it implements an Encoding interface:
{- | Represents an encoding, supporting various methods of de- and encoding.
     Minimal complete definition: encode, decode
 -}
class Encoding enc where
        -- | Encode a 'String' into a strict 'ByteString'. Throws the
        --   'HasNoRepresentation'-Exception if it encounters an unrepresentable
        --   character.
        encode :: enc -> String -> ByteString
        -- | Encode a 'String' into a lazy 'Data.ByteString.Lazy.ByteString'.
        encodeLazy :: enc -> String -> LBS.ByteString
        encodeLazy e str = LBS.fromChunks [encode e str]
        -- | Whether or not the given 'Char' is representable in this encoding. Default: 'True'.
        encodable :: enc -> Char -> Bool
        encodable _ _ = True
        -- | Decode a strict 'ByteString' into a 'String'. If the string is not
        --   decodable, a 'DecodingException' is thrown.
        decode :: enc -> ByteString -> String
        decodeLazy :: enc -> LBS.ByteString -> String
        decodeLazy e str = concatMap (decode e) (LBS.toChunks str)
        -- | Whether or no a given 'ByteString' is decodable. Default: 'True'.
        decodable :: enc -> ByteString -> Bool
        decodable _ _ = True
Notice that this interface provides exceptions for handling unrepresentable characters. Instances of Encoding can be found by importing charset-specific modules; each simply exports a value with the same name as the module, ie. Data.Encoding.ISO88592 exports ISO88592, which is an instance of Encoding. Here is a "Hello World" CGI in Polish, using ISO-8859-2:
import Data.Encoding
import Data.Encoding.ISO88592
import Data.ByteString.Char8
import Network.CGI hiding (Html)
import Text.Html

main :: IO ()
main = runCGI $ handleErrors cgiMain

cgiMain :: CGI CGIResult
cgiMain = do
    setHeader "Content-Type" "text/html; charset=iso-8859-2"
    output $ renderHtml $ h1 << (unpack $ encode ISO88592 "Cześć")
You'll notice the call to the unpack to convert the ByteString into a plain String as expected by Html. The encoding library also provides a way to select an encoding by name:
-- | Takes the name of an encoding and creates a dynamic encoding from it.
encodingFromString :: String -> DynEncoding
(Anything which is a DynEncoding is by definition an instance of Encoding). So we could choose the encoding at runtime, or we can just be lazy and pick encodings by name. If we do this, we don't need to import the charset-specific module, and we can replace the last line of our CGI with:
    let enc = encodingFromString "ISO-8859-2"
    output $ renderHtml $ h1 << (unpack $ encode enc "Cześć")
The encoding library also provides a pair of functions for converting character sets directly between two ByteStrings:
-- | This decodes a string from one encoding and encodes it into another.
recode :: (Encoding from,Encoding to) => from -> to -> ByteString -> ByteString

recodeLazy :: (Encoding from,Encoding to) => from -> to -> Lazy.ByteString -> Lazy.ByteString
The System.IO.Encoding module does not try to provide as many convenience functions as the similar module provided by utf8-string, providing only the generic hGetContents and hPutStr. However, it does provide a way of retrieving the current system's default encoding (when used on systems supporting POSIX.1-2001 nl_langinfo()), which utf8-string lacks.
-- | Like the normal 'System.IO.hGetContents', but decodes the input using an
--   encoding.
hGetContents :: Encoding e => e -> Handle -> IO String

-- | Like the normal 'System.IO.hPutStr', but encodes the output using an
--   encoding.
hPutStr :: Encoding e => e -> Handle -> String -> IO ()

-- | Returns the encoding used on the current system.
getSystemEncoding :: IO DynEncoding
As this library is native Haskell it is portable, and as it uses lazy ByteStrings it can be fast. While it does not (yet) provide as many character sets as your system's iconv(), it does support many of the most commonly used ones.

Notes

The libraries surveyed here are under fairly active maintenance, and there are rumours of unifying their implementations. Nevertheless the existing interfaces are fairly similar where common functionality exists. Historically, all serialized data was handled in Haskell as Strings, and there was a legitimate concern that transparently converting the character set of arbitrary Strings could mangle data. The newer ByteString and Binary interfaces may allow future Haskell standards to clearly disambiguate binary and textual data, and simply serialize Strings as UTF-8 by default. Although it might be nice to "simply" serialize Strings as UTF-8, show is the wrong place to do it. Haskell's Read/Show serialization serializes to String, which is a list of Char, ie. a list of abstract Unicode code points. Character set conversion should rather happen on conversion to [Word8], at which point byte values become significant. This also encompasses direct conversions to ByteString, and the internals of primitive IO functions such as:
putChar    :: Char -> IO ()
putChar    =  primPutChar

getChar    :: IO Char
getChar    =  primGetChar
, getContents, readFile, writeFile, and appendFile defined in the Haskell Prelude, and the various character IO functions on Handles defined in System.IO. Whether or not this conversion can be done everywhere transparently, and backwards-compatibly, is an open issue for Haskell Prime. Meanwhile these libraries provide useful interfaces for explicit [Word8] and ByteString conversion, and various IO wrappers.

Summary

Although all Haskell Strings are Unicode, Haskell98 does not specify a character set representation for their IO. Unicode strings can be written directly into Haskell source files and hence exist as data within a program, but character set conversion is required if you wish to read or write these Strings in files, user input or on the network. We looked at ways of dealing with Unicode in Haskell, surveyed some useful libraries and provided working examples. Although we might hope that a future version of Haskell will provide a way to handle UTF-8 conversions, in the meantime we need to choose an appropriate library for each project that handles Unicode text.

Updates

Fri Nov 16: Edited to incorporate some feedback from #haskell:
  • Thanks to Tim Newsham for clarifying GHC's default character encoding when printing Strings.
  • Thanks to Stefan N. O'Rear for pointing out that Show/Read is not the right place for serialization, but that it should instead occur on conversion to/from [Word8].

Labels: , ,

Friday, 9 November 2007

Release: libfishsound 0.8.1

libfishsound provides a simple programming interface for decoding and encoding audio data using Xiph.Org codecs (Vorbis and Speex). libfishsound 0.8.1 is a maintenance release, fixing a build error when configured with encoding disabled. Full documentation of the FishSound API, customization and installation, and complete examples of Ogg Vorbis and Speex decoding and encoding are provided.