Haskell source is interpreted as UTF-8, but internally the data is stored as Unicode code points. However the generic show method does not serialize Strings as UTF-8
(when using GHC).
So, when reading or writing documents it is necessary to introduce an explicit conversion from or to the desired character set. This article outlines how to use Unicode in Haskell, and surveys three alternatives for character set conversion:
iconv,
utf8-string and
encoding, providing working examples for each.
Unicode in Haskell source
The Haskell Prime standardization wiki contains discussions of
Unicode in Haskell Source, and of ways of handling
Char as Unicode.
In particular, GHC (as of release 6.6, early Jan 2006) interprets source files as UTF-8. Hence the following is a valid source file:
import System.Time
main :: IO ()
main = do
time <- getClockTime
cal <- toCalendarTime time
putStrLn $ dayName $ ctWDay cal
dayName :: Day -> String
dayName d = case d of
Monday -> "月曜日"
Tuesday -> "火曜日"
Wednesday -> "水曜日"
Thursday -> "木曜日"
Friday -> "金曜日"
Saturday -> "土曜日"
Sunday -> "日曜日"
The
dayName function provides the Japanese name for a given
Day. However the
main function, which tries to
print that onto
stdout, dumps it without any character set conversion, truncating each character to 8 bits. In order to control the output charset, we need to use a Unicode conversion library. The three libraries
iconv,
utf8-string and
encoding have similar purposes but some different features.
| Description: | Binding to C iconv() function |
| Author: | Duncan Coutts |
| darcs get | http://code.haskell.org/iconv/ |
| Exports: | Codec.Text.IConv |
| Interface: | ByteString.Lazy |
| Advantages: | Speed, coverage of charset support |
| Disadvantages: | Portability: requires POSIX iconv() |
This is a Haskell binding to the
iconv() C library function, providing a lazy ByteString interface.
The only module exported is
Codec.Text.IConv, which provides a single
function:
-- | Convert fromCharset toCharset input output
convert :: String -> String -> Lazy.ByteString -> Lazy.ByteString
where
fromCharset and
toCharset are the names of the input and output character set encodings, and input and output are the input and output text
as lazy ByteStrings.
An example program to convert the encoding of an input file, similar to the
GNU iconv program, is given in
examples/hiconv.hs.
The guts of that program is:
output = convert (fromEncoding config) (toEncoding config) input
which is somewhat clearer than the
brain-damaged interface exported by the C library. Exceptions are provided for handling unsupported conversions, invalid and incomplete characters. These errors can be silently ignored if desired by calling
convertFuzzy instead.
As this library wraps the system
iconv() implementation, all character sets supported on the underlying system are available. The Lazy.ByteString interface works directly on the memory buffers used by the C library, which may give a speed advantage for large conversions.
Note however that the
iconv() C library function is defined by POSIX.1-2001 and may not be available on some older systems. In most such cases it should be possible to install
GNU libiconv separately.
| Description: | Simple UTF-8 conversion library |
| Author: | Eric Mertens |
| darcs get | http://code.haskell.org/utf8-string/ |
| Exports: | Codec.Binary.UTF8.String, System.IO.UTF8 |
| Interface: | String |
| Advantages: | Simplicity |
| Disadvantages: | Only supports UTF-8 conversions |
This library contains both a simple module for data conversion with a String interface, and a useful IO module.
The String conversion module,
Codec.Binary.UTF8.String, provides two pairs of complementary encoding and decoding functions:
-- | Encode a string using 'encode' and store the result in a 'String'.
encodeString :: String -> String
-- | Decode a string using 'decode' using a 'String' as input.
-- | This is not safe but it is necessary if UTF-8 encoded text
-- | has been loaded into a 'String' prior to being decoded.
decodeString :: String -> String
-- | Encode a Haskell String to a list of Word8 values, in UTF8 format.
encode :: String -> [Word8]
-- | Decode a UTF8 string packed into a list of Word8 values, directly to String
decode :: [Word8] -> String
I guess "not safe" in the comment for
decodeString refers to type-safety; for example this function doesn't stop you from trying to decode the same text twice, whereas if you tried that with the plain
decode function, the compiler would point out your bug for you.
To see how this might look in the wild, the following is a complete "Hello World" web application (err, CGI script) in Japanese:
import Codec.Binary.UTF8.String
import Network.CGI hiding (Html)
import Text.Html
main :: IO ()
main = runCGI $ handleErrors cgiMain
cgiMain :: CGI CGIResult
cgiMain = do
setHeader "Content-Type" "text/html; charset=utf-8"
output $ renderHtml $ h1 << encodeString "おはよう御座います!"
The
utf8-string library also includes an entire IO module,
System.IO.UTF8, exporting
print, putStr, putStrLn, getLine, readLn, readFile, writeFile, appendFile, getContents, hGetLine, hGetContents, hPutStr, hPutStrLn. These essentially wrap the default IO functions in
encodeString and
decodeString, which you may find convenient if you are doing lots of UTF-8 processing.
This library is tiny, and implemented natively in Haskell so there are no portability issues. As it works directly on ByteStrings it should be sufficiently fast for practical purposes. Of course, if you need to do conversions to or from character sets other than UTF-8, you will need to use a different library.
| Description: | Native Haskell charset conversion library |
| Author: | Henning Günther |
| darcs get | http://code.haskell.org/encoding/ |
| Exports: | Data.Encoding.*, System.IO.Encoding |
| Interface: | ByteString.Lazy |
| Advantages: | Portable; covers more charsets than utf8-string |
| Disadvantages: | Covers fewer charsets than iconv |
Data.Encoding provides native Haskell implementations for encoding and decoding of many common character sets: ASCII, UTF8, UTF16, UTF32, ISO8859[1-16],
CP125[0-8], KOI8R, and GB18030, as well as BootString (for
Punycode). For each of these, it implements an
Encoding interface:
{- | Represents an encoding, supporting various methods of de- and encoding.
Minimal complete definition: encode, decode
-}
class Encoding enc where
-- | Encode a 'String' into a strict 'ByteString'. Throws the
-- 'HasNoRepresentation'-Exception if it encounters an unrepresentable
-- character.
encode :: enc -> String -> ByteString
-- | Encode a 'String' into a lazy 'Data.ByteString.Lazy.ByteString'.
encodeLazy :: enc -> String -> LBS.ByteString
encodeLazy e str = LBS.fromChunks [encode e str]
-- | Whether or not the given 'Char' is representable in this encoding. Default: 'True'.
encodable :: enc -> Char -> Bool
encodable _ _ = True
-- | Decode a strict 'ByteString' into a 'String'. If the string is not
-- decodable, a 'DecodingException' is thrown.
decode :: enc -> ByteString -> String
decodeLazy :: enc -> LBS.ByteString -> String
decodeLazy e str = concatMap (decode e) (LBS.toChunks str)
-- | Whether or no a given 'ByteString' is decodable. Default: 'True'.
decodable :: enc -> ByteString -> Bool
decodable _ _ = True
Notice that this interface provides exceptions for handling unrepresentable characters.
Instances of
Encoding can be found by importing charset-specific modules; each simply exports a value with the same name as the module, ie.
Data.Encoding.ISO88592 exports
ISO88592, which is an instance of
Encoding. Here is a "Hello World" CGI in Polish, using ISO-8859-2:
import Data.Encoding
import Data.Encoding.ISO88592
import Data.ByteString.Char8
import Network.CGI hiding (Html)
import Text.Html
main :: IO ()
main = runCGI $ handleErrors cgiMain
cgiMain :: CGI CGIResult
cgiMain = do
setHeader "Content-Type" "text/html; charset=iso-8859-2"
output $ renderHtml $ h1 << (unpack $ encode ISO88592 "Cześć")
You'll notice the call to the
unpack to convert the
ByteString into a plain
String as expected by
Html.
The
encoding library also provides a way to select an encoding by name:
-- | Takes the name of an encoding and creates a dynamic encoding from it.
encodingFromString :: String -> DynEncoding
(Anything which is a DynEncoding is by definition an instance of Encoding). So we could choose the encoding at runtime, or we can just be lazy and pick encodings by name. If we do this, we don't need to import the charset-specific module, and we can replace the last line of our CGI with:
let enc = encodingFromString "ISO-8859-2"
output $ renderHtml $ h1 << (unpack $ encode enc "Cześć")
The
encoding library also provides a pair of functions for converting character sets directly between two ByteStrings:
-- | This decodes a string from one encoding and encodes it into another.
recode :: (Encoding from,Encoding to) => from -> to -> ByteString -> ByteString
recodeLazy :: (Encoding from,Encoding to) => from -> to -> Lazy.ByteString -> Lazy.ByteString
The
System.IO.Encoding module does not try to provide as many convenience functions as the similar module provided by
utf8-string, providing only the generic
hGetContents and
hPutStr. However, it does provide a way of retrieving the current system's default encoding (when used on systems supporting POSIX.1-2001
nl_langinfo()), which
utf8-string lacks.
-- | Like the normal 'System.IO.hGetContents', but decodes the input using an
-- encoding.
hGetContents :: Encoding e => e -> Handle -> IO String
-- | Like the normal 'System.IO.hPutStr', but encodes the output using an
-- encoding.
hPutStr :: Encoding e => e -> Handle -> String -> IO ()
-- | Returns the encoding used on the current system.
getSystemEncoding :: IO DynEncoding
As this library is native Haskell it is portable, and as it uses lazy ByteStrings it can be fast. While it does not (yet) provide as many character sets as your system's
iconv(), it does support many of the most commonly used ones.
Notes
The libraries surveyed here are under fairly active maintenance, and there are rumours of unifying their implementations. Nevertheless the existing interfaces are fairly similar where common functionality exists.
Historically, all serialized data was handled in Haskell as Strings, and there was a legitimate concern that transparently converting the character set of arbitrary Strings could mangle data.
The newer ByteString and Binary interfaces may allow future Haskell standards to clearly disambiguate binary and textual data, and simply serialize Strings as UTF-8 by default.
Although it might be nice to "simply" serialize Strings as UTF-8,
show is the wrong place to do it. Haskell's
Read/Show serialization serializes to
String, which is a list of
Char, ie. a list of abstract Unicode code points. Character set conversion should rather happen on conversion to
[Word8], at which point byte values become significant. This also encompasses direct conversions to
ByteString, and the internals of primitive IO functions such as:
putChar :: Char -> IO ()
putChar = primPutChar
getChar :: IO Char
getChar = primGetChar
,
getContents,
readFile,
writeFile, and
appendFile defined in the
Haskell Prelude, and the various character IO functions on
Handles defined in
System.IO.
Whether or not this conversion can be done everywhere transparently, and backwards-compatibly, is an open issue for Haskell Prime. Meanwhile these libraries provide useful interfaces for explicit
[Word8] and
ByteString conversion, and various IO wrappers.
Summary
Although all Haskell Strings are Unicode, Haskell98 does not specify a character set representation for their IO. Unicode strings can be written directly into Haskell source files and hence exist as data within a program, but character set conversion is required if you wish to read or write these Strings in files, user input or on the network.
We looked at ways of dealing with Unicode in Haskell, surveyed some useful libraries and provided working examples. Although we might hope that a future version of Haskell will provide a way to handle UTF-8 conversions, in the meantime we need to choose an appropriate library for each project that handles Unicode text.
Updates
Fri Nov 16: Edited to incorporate some feedback from #haskell:
- Thanks to Tim Newsham for clarifying GHC's default character encoding when printing Strings.
- Thanks to Stefan N. O'Rear for pointing out that Show/Read is not the right place for serialization, but that it should instead occur on conversion to/from [Word8].
Labels: haskell, unicode, utf8