The ultimate guide to Haskell Strings
May 7, 2024, Posted by Julian OspaldThis guide is aimed at Haskellers who seek to improve their understanding of String types, be it beginners or seasoned developers. It is also meant to be a quick reference/cheat sheet for deciding which string type to use in a given situation.
TOC
- Motivation
- String in Prelude
- Unicode
- Back to Haskell String type
- String types
- Lazy vs Strict
- Slicable vs non-slicable
- Pinned vs unpinned
- String Types Cheat Sheet
- Construction
- Conversions
- A word on lazy IO
- Streaming
- A note on FilePaths
- Reflection
- Special thanks to
- Links and relevant stuff
Motivation
In 2022 I implemented the Abstract FilePath proposal, which lead to several new String types, such as OsString
.
At the time of writing, I’m also serving on the Core Libraries Committee, which oversees the base API. In the context of base, there have been recurring discussions about String types, e.g.:
When discussing this topic with other Haskellers, I realized it can indeed be quite confusing and we don’t have comprehensive, over-arching documentation. After all, there is no equivalent of The Rust book.
I hope this blog post can fill some of the documentation gaps and also explain the newly introduced types and why I think that we don’t have too many String types.
String in Prelude
The most widely used String type in Haskell is defined by the Haskell Standard in Chapter 9 Standard Prelude:
-- Lists
data [a] = [] | a : [a]
-- Not legal Haskell; for illustration only
-- Character type
data Char = ... 'a' | 'b' ... -- Unicode values
type String = [Char]
Since lists are one of the most idiomatic data types in Haskell, this allows us to easily pattern match on strings, because they are just a list of characters. E.g. the following function returns the first character of a string and its remainder or Nothing
if the list is empty.
uncons :: [a] -> Maybe (a, [a])
= Nothing
uncons [] :xs) = Just (x, xs) uncons (x
Char
If we look closely at the pseudo code definition of Char
from the Haskell standard, we realize the comment saying -- Unicode values
. This is a bit vague, in fact. If we look at the documentation in Data.Char from base, we see that it is actually implemented as a Unicode Code Point.
This can be seen by the smart constructor chr
as well:
chr :: Int -> Char
chr i@(I# i#)
| isTrue# (int2Word# i# `leWord#` 0x10FFFF##) = C# (chr# i#)
| otherwise
= errorWithoutStackTrace ("Prelude.chr: bad argument: " ++ showSignedInt (I# 9#) i "")
So Char
is basically just an Int
with an upper bound on 0x10FFFF
. In order to understand this, we actually have to take a short dive into Unicode.
Unicode
The Unicode Standard is a standard to identify and encode visible “characters” that comprise “text”, supporting all of the world’s major writing systems.
The exact terminology can be very confusing. We’ll focus only on a couple of core concepts. If you want to read up yourself on the standard, here are some pointers:
- the actual standard: https://www.unicode.org/versions/latest/
- Wikipedia Unicode article: https://en.wikipedia.org/wiki/Unicode
- brief Unicode introduction for JavaScript developers: https://exploringjs.com/impatient-js/ch_unicode.html
The goal of Unicode is to be universal, efficient and unambiguous. In order to achieve that, it needs:
- a character encoding: translating e.g.
a
or쟬
to something unambiguous - a text encoding: translating a sequence of characters into an efficient byte format
The term “character” is quite overloaded and we will go through different definitions along the way.
Unicode Code Point
Unicode Code Points are a way of encoding a single character through numerical values. It ranges from the hexadecimal values 0 to 10FFFF, which we saw before in the definition of chr :: Int -> Char
. The formal notation of code points is U+0000
to U+10FFFF
.
It is essentially a static mapping, e.g.:
character | code point |
---|---|
a | U+0061 |
b | U+0062 |
쟬 | U+C7EC |
🇯 | U+1F1EF |
🇵 | U+1F1F5 |
🇯🇵 | U+1F1EF, U+1F1F5 |
This allows us a couple of observations:
- the hex values
61
fora
and62
forb
correspond to the ASCII character set (cool) - it can express Chinese and other non-Latin characters
- some “characters” (in this case actually emoji) are expressed by multiple code points, such as 🇯🇵
However, this is just a mapping for a single character. In order to efficiently represent a whole text, several Unicode Transformation Formats were developed, most notably:
- UTF-32
- UTF-16
- UTF-8
Such transformation formats are necessary to understand code point boundaries in a sequence of bytes and make searching and splitting feasible. UTF-16 and UTF-8 are also optimized for size.
UTF-32
The most simple encoding for text would be to just use the code point values. The issue with this is that the maximum code point value is U+10FFFF
, which only fits into 21 bits.
UTF-32 is a fixed-length encoding that uses 32 bits (four bytes) and as such can hold all possible Unicode values without any actual transformation.
The upside of this is that it’s simple, the downside is that it’s wasting space, because most values don’t need the whole 21 bits (e.g. ASCII just needs 7 bits).
UTF-32 is not ASCII compatible, meaning a program that only understands ASCII won’t accidentally work with UTF-32 text, even if all of the characters used are in the ASCII set (e.g. only Latin characters from [a-zA-Z]
).
UTF-16
This is a variable-width character encoding, most notably used on Windows.
Code points from U+0000
to U+FFFF
are expressed “directly” via 2 bytes (16 bits), with the exception of surrogates, which I will explain later.
Code points from U+10000
to U+10FFFF
don’t fit into 2 bytes. In order to encode these without being accidentally ambiguous, surrogates were introduced (another option would have been magic bits as used by UTF-8, but I guess the format wasn’t designed with extension in mind). These surrogates must always come in pairs (so 4 bytes) and are in the following ranges:
- low surrogates:
U+DC00
toU+DFFF
- high surrogates:
U+D800
toU+DBFF
Through bit shuffling, these 2-byte pairs allow to map to values in the U+10000
to U+10FFFF
range. For the interested reader, the algorithm is as follows (cited from Wikipedia):
- 0x10000 is subtracted from the code point (U), leaving a 20-bit number (U’) in the hex number range 0x00000–0xFFFFF.
- The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.
- The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF
UTF-16 is not ASCII compatible either. It is more space efficient than UTF-32 though. For some languages, it can even be more space efficient than UTF-8.
Unicode Scalar Values
It is important to understand that the Haskell Char
type (which is essentially a Code Point) can represent surrogates that are used in UTF-16.
The Unicode standard also defines the concept of Unicode Scalar Values:
Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.
So, code point without surrogates. This will become relevant for UTF-8.
UTF-8
This is similar to UTF-16 a variable-width character encoding. It’s often used in web APIs (most notably JSON) and is often the default on Unix systems.
Here, a Unicode Code Point is represented by a sequence of bytes. The number of bytes required depends on the range of the code point and varies between 1 and 4 bytes. The whole bit conversion between code point and UTF-8 is illustrated in the following table (adopted from Wikipedia):
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+010000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Here we see a different technique than surrogates. UTF-8 uses magic bits in the first byte to signal how many bytes in total must be read for translating to a code point.
Notable properties of UTF-8 are:
- it is ASCII backwards compatible: a program written for UTF-8 will also understand plain ASCII encoding
- Unicode code points in the surrogate range
U+D800
toU+DFFF
are considered invalid byte sequences- as a result: UTF-8 only expresses Unicode Scalar Values
Unicode summary
Given the above encodings, let’s have another look at our table from above:
character | code point | Hex UTF-8 | Hex UTF-16 | Hex UTF-32 |
---|---|---|---|---|
a | U+0061 | 61 | 0061 | 00000061 |
b | U+0062 | 62 | 0062 | 00000062 |
쟬 | U+C7EC | ec 9f ac | c7ec | 0000c7ec |
🇯 | U+1F1EF | f0 9f 87 af | d83c ddef | 0001f1ef |
🇵 | U+1F1F5 | f0 9f 87 b5 | d83c ddf5 | 0001f1f5 |
🇯🇵 | U+1F1EF, U+1F1F5 | f0 9f 87 af, f0 9f 87 b5 | d83c ddef, d83c ddf5 | 0001f1ef, 0001f1f5 |
The interested reader is welcome to verify those values (at least for UTF-8 and UTF-16).
We now understand:
- the character encoding is the mapping of code points to visible characters
- UTF-8, UTF-16 and UTF-32 are text encodings with different trade offs
- surrogates are a special case for UTF-16 (
Unicode Scalar Values = Unicode Code Points - surrotages
)
Going back to the definition of “character”, we now see the confusion:
- a surrogate can hardly be a visible character
- the visible character 🇯🇵 needs two code points to be expressed (and there are many others)
This has lead to yet another definition: “Grapheme Cluster”. This is specified by the Unicode Standard Annex #29, which deals with determining boundaries between characters, words and sentences. It is, again, quite technical, but is much closer to “user visible character”.
Back to Haskell String type
Now that we know what a Unicode Code Point is, we also understand that the Haskell String type has essentially no text encoding. It is just a linked list of those code points (a subset of Int
, in fact). This can be a nice property, e.g. as an intermediate representation when converting between encodings (say UTF-8 to UTF-16).
However, it is a questionable default for a String type, because:
- it is inefficient for large text (carries the overhead of a linked list with thunks for every
Char
); the haddock documentation of Data.String goes into more detail - it is often confusing for users who don’t have a good mental model of what a Unicode Code Point is
- it causes problems for certain conversions (e.g.
String -> Text
), because of surrogates (it should have been Unicode Scalar Values instead or maybe even Grapheme Clusters)
Unfortunately, since it’s defined by the Haskell Standard and has been around since the beginning of time, we won’t be able to get rid of it ever.
This type should only be used for small little projects, prototypes and hello worlds and maybe intermediate representations in some algorithms.
The Show
instance of Char
/String
will print the Unicode Code Point value as a decimal for non-ASCII ranges:
ghci> "a"
"a"
ghci> "쟬"
"\51180"
Show is for debugging, so that seems fine. However this behavior has been challenged before: Proposal: showLitChar
(and show @Char
) shouldn’t escape readable Unicode characters.
String types
In this section, we will examine each string like type and what its properties and use cases are. String
was already discussed and we don’t recommend it, so it’s omitted here.
- proper Unicode text
- byte sequences
- ByteString (strict and lazy)
- ShortByteString
- Bytes
- Chunks
- byte sequences dealing with platform API differences
- FFI types
- filepath types (just type synonyms)
If we delve more into filepaths, there are actually even more, e.g. strongly typed filepaths. But those are out of scope.
Text
If you are not sure what you need, you most likely want Text
from the text package, which is shipped with GHC. This type is meant for human readable Unicode text and has all the primitives you need. The API is in fact more complete than the one for String
, containing functions like stripPrefix
and toLower
.
Internally, Text uses a UTF-8 encoded byte array since version 2.0 and UTF-16 before version 2.0. So it is always guaranteed to be valid Unicode.
The current definition for strict Text
is (as of 2.1.1):
-- | A space efficient, packed, unboxed Unicode text type.
data Text = Text
{-# UNPACK #-} !A.Array -- ^ bytearray encoded as UTF-8
{-# UNPACK #-} !Int -- ^ offset in bytes (not in Char!), pointing to a start of UTF-8 sequence
{-# UNPACK #-} !Int -- ^ length in bytes (not in Char!), pointing to an end of UTF-8 sequence
As we can see here, this type allows efficient slicing to avoid unnecessary memcpy
for many operations. E.g. init
and tail
are O(1) time and space. splitAt
is O(1) space, but O(n) time, because UTF-8 complicates the offset computation (remember, a Unicode Code Point encoding can be anywhere between 1 and 4 bytes in UTF-8).
We explain more about this later in Slicable vs non-slicable.
The lazy Text variant is as follows:
data Text = Empty
| Chunk {-# UNPACK #-} !T.Text Text
This has the same structure as a list, and as such can also be potentially streamed in constant space or allow the GC to clean up unused chunks after splitting/slicing.
Text does not allow to represent surrogates. It is a sequence of Unicode Scalar Values. Invalid values will be converted to the replacement character U+FFFD
silently when using e.g. pack
. You might be thinking that’s not a problem… but I have to disappoint you. There is a reason String
allows surrogates: PEP-383. This is an abomination and base uses it: On Unix, it uses getFileSystemEncoding
and mkTextEncoding
to pick a round-trippable encoding for filepaths. E.g. if your locale returns en_US.UTF-8
you’ll get UTF-8//ROUNDTRIP
TextEncoding
, which is based on PEP-383 and invalid bytes get translated to some special representation (lone surrogates) in order to be roundtripped. This has been described in my blog Fixing ‘FilePath’ in Haskell.
Text summary
Invariants:
- is always Unicode
- never encodes surrogates (uses replacement char
U+FFFD
) - unpinned memory (can be moved by the GC at any time, see the Pinned vs unpinned section)
- strict and lazy variants
Useful for:
- anything that fits ASCII or Unicode
- large human readable text processing that requires efficient formats
- complex Unicode handling via advanced libraries such as text-icu
- quite efficient slicing
Not so useful for:
- dealing with C FFI
- trying to store or deal with non-Unicode encodings
- dealing with filepaths
- lots of small Unicode texts
Lazy variants are useful for streaming and incremental processing, as the strict variant requires the whole content to be in memory.
ShortText
This is an alternative Unicode text type that is meant for lots of small text sequences. It is part of the text-short package. The definition is as follows:
newtype ShortText = ShortText ShortByteString
So there is no length or offset field. This means it has all the same properties as an unpinned ShortByteString
, except that the data is guaranteed to be valid UTF-8.
ShortText summary
Invariants:
- is always Unicode
- never encodes surrogates (uses replacement char
U+FFFD
) - unpinned memory (can be moved by the GC at any time)
- strict
Useful for:
- anything that fits ASCII or Unicode
- lots of small text sequences
Not so useful for:
- using with
text-icu
package, which expectsText
- efficient slicing
- dealing with C FFI
- trying to store or deal with non-Unicode encodings
- dealing with filepaths
ByteString
This is a low level type from the bytestring package, shipped with GHC. It is just a sequence of bytes and carries no encoding information. It uses pinned memory (see Pinned vs unpinned section). As such, it doesn’t require copying when dealing with the FFI. It is also often more desirable when interacting with FFI, see the GHC user guide:
ByteString is quite efficient and has a large API, but (obviously) lacks text processing facilities, because it has no knowledge of Unicode (or other textual formats). Most operations work on Word8
boundaries.
The definition for strict ByteString is (as of 0.12.1.0):
data ByteString = BS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- length
This allows, similar to Text, slicing without copying memory (through pointer arithmetic and the length field). Since we’re not dealing with Unicode, but just Word8
boundaries, operations like splitAt
are O(1) time and space. We don’t need an offset field, because we can just advance the pointer instead.
And the lazy counterpart, which looks similar to lazy Text:
data ByteString = Empty
| Chunk {-# UNPACK #-} !S.StrictByteString ByteString
There is an API variant Data.ByteString.Char8, which allows operations to work on Char
boundaries. However, it can be misleading to newcomers, because it actually truncates all Chars to 8 bits. You should avoid this, unless you know what you are doing. It is more likely that you are looking for decoding libraries, where you can specify which encoding to use, e.g. bytestring-encoding.
It also has to be noted that pinned memory can cause memory fragmentation for lots of small ByteStrings (this is also discussed in Fixing ‘FilePath’ in Haskell). An alternative type is ShortByteString
, which will be discussed next.
ByteString summary
Invariants:
- pinned memory
- strict and lazy variants
Useful for:
- large data
- very efficient slicing
- dealing with raw bytes (e.g. web servers)
- dealing with C FFI
- storing non-Unicode encodings e.g. via a newtype wrapper
- fast parsers, see the excellent blog post from Chris Done on Fast Haskell: Competing with C at parsing XML
Not so useful for:
- dealing with Unicode or human readable text
- dealing with lots of small byte sequences
Lazy variants, again, are useful for streaming and incremental processing, as the strict variant requires the whole content to be in memory.
ShortByteString
This type is from the bytestring package as well and lives under Data.ByteString.Short.
It has the same API as ByteString
since 0.11.3.0, so can be used as a drop-in replacement. The main difference is that it is usually backed by unpinned memory, so causes no heap fragmentation. It is possible to construct it pinned via internal API, but slicing operations like splitAt
will return unpinned byte strings.
The definition as of 0.12.1.0 is:
newtype ShortByteString =
ShortByteString
unShortByteString :: ByteArray
{ }
This makes it suitable for things like Unix filepaths. But we will explore better filepath types later.
The name is maybe a little bit misleading. It can very well be used for large data as well, if you don’t mind its strictness (the whole content is always in memory). However, this type does not allow slicing, unlike Text
and ByteString
, and so a lot of operations cause memcpy
. This however has the advantage that we save at least 2 words compared to e.g. Text
, because we don’t need an offset or length field.
If you want a similar type, but with slicing capability, use Bytes.
Interfacing with C FFI triggers memory copy as well, because we need pinned memory.
There is no lazy variant.
ShortByteString summary
Invariants:
- unpinned memory (when using the default API)
- always strict
Useful for:
- lots of small to medium sized byte sequences
- large data, if strictness is desired and efficient slicing not needed
- dealing with C FFI (although it incurs
memcpy
) - storing non-Unicode encodings e.g. via a newtype wrapper
Not so useful for:
- dealing with Unicode or human readable text
- fast parsers, because no lazy variant and no efficient slicing
Bytes
This type is from the byteslice
package and lives under Data.Bytes. It is not shipped by GHC.
It is a essentially a ShortByteString
with 0-copy slicing (init
, splitAt
etc.). It can be constructed as a pinned or unpinned byte sequence and all the usual operations for it will maintain that invariant.
The definition as of 0.2.13.2 is:
data Bytes = Bytes
array :: {-# UNPACK #-} !ByteArray
{ offset :: {-# UNPACK #-} !Int
, length :: {-# UNPACK #-} !Int
, }
This is exactly the same definition as the Text
type. But it does not maintain UTF-8. It uses ByteArray
like ShortByteString
does. Compared to ShortByteString
however, we have three words more memory overhead.
The API allows to convert to ByteString
and ShortByteString
. Depending on whether it was pinned or unpinned, sliced or unsliced, those may be 0-copy operations as well.
There’s another variant called Chunks
in Data.Bytes.Chunks:
data Chunks
= ChunksCons {-# UNPACK #-} !Bytes !Chunks
| ChunksNil
Although This is quite similar to how lazy Text
is defined, this type is not lazy at all. It has bang patterns on both the value and the recursion, so it is spine-strict.
The only real use case the Chunk
type has is when you want to avoid the overhead of constant appending of ByteArray
s, because you’re e.g. reading a file incrementally.
Bytes summary
Invariants:
- can be both pinned or unpinned
- is always strict
Useful for:
- when you want an unpinned strict ByteString…
- or a slicable ShortByteString
- dealing with C FFI
- parsers, if we don’t mind strictness
Not so useful for:
- dealing with Unicode or human readable text
OsString, PosixString and WindowsString
These are relatively new types, which were first added to filepath-1.4.100.0 as part of a user-space implementation of the Abstract FilePath Proposal. More details here.
Starting with filepath-1.5.0.0, the types were moved to a new home in the os-string package.
These types are meant to abstract over platform differences and their encodings when dealing with operating system API. It is similar to the rust type OsString, but the implementation is quite different.
Simplified, the Haskell definitions are:
-- | Commonly used Windows string as wide character bytes.
newtype WindowsString = WindowsString ShortByteString
-- | Commonly used Posix string as uninterpreted @char[]@ array.
newtype PosixString = PosixString ShortByteString
-- | Newtype representing short operating system specific strings.
--
-- Internally this is either 'WindowsString' or 'PosixString',
-- depending on the platform. Both use unpinned
-- 'ShortByteString' for efficiency.
newtype OsString = OsString
#if defined(mingw32_HOST_OS)
WindowsString
#else
PosixString
#endif
As we can see, on Unix, we’re basically dealing with Word8
sequences (char[]
), but on Windows, we’re dealing with Word16
sequences (wchar_t*
).
The constructors are internal and it is impossible to pattern match on the wrong platform in OsString
, due to the CPP.
OsString provides a rich API just like ByteString.
This allows packages like unix
and Win32
to provide alternatives to String
, where the bytes that are received from operating system API is not transformed, decoded or otherwise roundtripped. It is unchanged. E.g.:
And at the same time, we are able to write safe, platform agnostic code utilizing OsString
. E.g.:
This strategy has been used for filepaths, where unix
package uses PosixString
, Win32
package uses WindowsString
and the platform agnostic directory
and file-io
packages use OsString
, combining the APIs of Unix and Windows. More information on this with examples and API explanation can be found here.
It is not restricted to filepaths, but may be extended to dealing with environment variables, program arguments and other bits of operating system API. It is always safer than String
and more type safe than ByteString
.
OsString, PosixString and WindowsString summary
Invariants:
- unpinned memory
- OsString abstracts over platforms
- PosixString is char array
- WindowsString is wide char array
Useful for:
- writing type safe operating system APIs
- while maintaining the original bytes without decoding
- abstracting over Unix and Windows
- making minimal assumptions on underlying encodings
Not so useful for:
- very large data
- data that is not platform specific or doesn’t originate from operating system API
- efficient slicing
OsPath, PosixPath and WindowsPath
These are equivalent to OsString
, PosixString
and WindowsString
and are part of the filepath package as of 1.4.100.0. They are just type synonyms:
-- | FilePath for Windows.
type WindowsPath = WindowsString
-- | FilePath for posix systems.
type PosixPath = PosixString
-- | Abstract filepath, depending on current platform.
-- Matching on the wrong constructor is a compile-time error.
type OsPath = OsString
Use them whenever you can with the new filepath API. Refer to the Fixing Haskell filepaths blog post for more details.
CString and CStringLen
These are part of base
and low-level FFI types.
The definitions are very straight forward:
-- | A C string is a reference to an array of C characters terminated by NUL.
type CString = Ptr CChar
-- | A string with explicit length information in bytes instead of a
-- terminating NUL (allowing NUL characters in the middle of the string).
type CStringLen = (Ptr CChar, Int)
The haddock also explains the expected properties.
As an interesting edge case: if you’re converting from ByteString
to CString
and happen to have a NUL byte in your ByteString, then useAsCString
will over-allocate bytes:
useAsCString :: ByteString -> (CString -> IO a) -> IO a
BS fp l) action =
useAsCString (+1) $ \buf -> do
allocaBytes (l$ \p -> copyBytes buf p l
unsafeWithForeignPtr fp 0::Word8)
pokeByteOff buf l ( action (castPtr buf)
So it can make sense, in some cases, to check your ByteString for NUL bytes.
We won’t dive into the Haskell C FFI, but this is literally the only proper use case. Refer to the wikibook article on Haskell FFI.
FilePath
This type is a legacy filepath type, but is still the most widespread across the ecosystem at the time of writing. It is part of the filepath package, which is also shipped with GHC.
The definition is:
type FilePath = String
This is not a very good choice for filepaths. Use the new OsPath instead.
Lazy vs Strict
The properties of lazy vs strict variants for Text
and ByteString
might already be obvious for many Haskellers:
- Lazy:
- can be streamed and incrementally processed, potentially in constant space
- can allow the GC to clean up unused chunks after slicing/splitting
- can express infinite data streams
- slightly less efficient in terms of time complexity, depending on number of chunks (compared to their strict counterparts)
- can work with lazy IO (more on that later)
- Strict:
- is the most efficient in terms of time complexity
- is always forced into memory
- has less overhead than lazy types
A lot of time, people use lazy types in conjunction with lazy IO. However, another use case is to use Builders. These exist for both Text and ByteString:
In general, streaming libraries can be a more elegant and performant alternative to lazy Text/ByteString. We talk about that later in the chapter Streaming. But since much of the ecosystem uses lazy types, these are still relevant for practical purposes.
Slicable vs non-slicable
All strings are slicable, but some strings can slice without copying data. E.g. compare Text
and ShortText
:
data Text = Text
{-# UNPACK #-} !A.Array -- ^ bytearray encoded as UTF-8
{-# UNPACK #-} !Int -- ^ offset in bytes (not in Char!), pointing to a start of UTF-8 sequence
{-# UNPACK #-} !Int -- ^ length in bytes (not in Char!), pointing to an end of UTF-8 sequence
newtype ShortText = ShortText ShortByteString
E.g. when we call splitAt
on a Text
value, we get back two new Text
values that just differ in the “offset” and “length” fields, but can point at the same byte array. If we slice a lot, this can save a lot of memcpy
, especially on large data.
This means that slicing comes at two costs. First, if we split a text in half, the memory of the original byte array can’t be cleaned up by the GC. We just changed the offset and length fields, nothing else. This can be alleviated by using explicit copy operations when you don’t need the whole data anymore, e.g. via Data.Text.copy.
Second, we carry two unboxed Int
s around for the “offset” and “length” fields, which is 2 words “overhead”. For more information on boxed and unboxed types, see the GHC user guide:
ShortText
in contrast, on e.g. splitAt
, will create two new byte arrays and copy the data. Here we’re not only saving two words memory overhead (no offset and length field), but also have a bit less indirection at runtime and a bit less memory pressure (which might be useful to fit into CPU cache) as explained in this comment.
As such, as the name of the types suggest, a simplified criteria could be:
- slicable type: if you have large strings or need a lot of slicing
- non-slicable type: if you have relatively short strings or don’t need a lot of slicing
In the end, only profiling can really tell which one is better.
Pinned vs unpinned
Pinned memory means it can’t be moved by the GC. This is useful if we want to move the data directly to foreign code (FFI), without first copying the entire unpinned data to a pinned memory region at the FFI boundary. But it also means that we get memory fragmentation, exactly because the GC cannot move stuff around. If you have lots of small pieces of data with pinned memory, that can severely fragment the heap.
This and the problems it can cause is explained in more detail in the Well-Typed blog Understanding Memory Fragmentation.
The problem of memory fragmentation was also one of the things that motivated the original Abstract FilePath proposal and later the new OsPath
type.
String Types Cheat Sheet
A few notes on the below table:
- Unicode aware means whether we have access to text processing functions (e.g. split by Unicode Code Point etc.)
- memory overhead means: total words required modulo the payload
- the overhead for lazy types is multiplied by the number of chunks
- some types are unpinned by default (e.g.
ShortByteString
) but can manually be constructed as pinned via internal API
The memory overhead measurements are best effort and explained in more detail in this gist.
Type | purpose | Unicode aware | internal representation | memory overhead | pinned | slicing | FFI suitable | streaming |
---|---|---|---|---|---|---|---|---|
String | simplicity | yes | List of Unicode Code Points | 4 words per char + 1 word | no | -- | -- | yes |
Text | human readable text | yes | UTF-8 byte array | 7 words | no | + | - | no |
Lazy Text | human readable text | yes | List of chunks of UTF-8 byte arrays | 9 words per chunk + 1 word | no | + | - | yes |
ShortText | short human readable texts | yes | UTF-8 byte array | 4 words | no | - | - | no |
ByteString | large byte sequences | no | Word8 byte array (pointer) | 10 words | yes | ++ | ++ | no |
Lazy ByteString | large byte sequences | no | List of chunks of Word8 byte arrays | 12 words per chunk + 1 word | yes | ++ | ++ | yes |
ShortByteString | short byte sequences | no | Word8 byte array | 4 words | no | - | + | no |
Bytes | slicable ShortByteString / pinned ByteString | no | Word8 byte array | 7 words | both | ++ | + | no |
Chunks | Like “Bytes”, but for incremental building | no | List of chunks of Word8 byte arrays | 9 words per chunk + 1 word | both | ++ | + | no |
OsString | interfacing with OS API | no | Word8 or Word16 byte array | 4 words | no | - | + | no |
Construction
Now that we know about the different types, we will take a quick look about different ways to construct strings.
String literals
The Haskell report defines Character and String literals as part of the language.
Whenever you write "string"
in a Haskell file, the compiler will convert it to/consider it as [Char]
. Likewise, 'c'
will be considered Char
.
String Classes
A popular String class is IsString, which is defined as:
class IsString a where
fromString :: String -> a
So this allows to convert from String
to some other compatible type. Note how the type signature does not allow failure. So the conversion must be total.
Text
, ByteString
and ShortByteString
have IsString
instances. OsString
does not. All these instances have problems though:
- Text: as explained earlier, surrogate Unicode Code Points in a String cannot be converted to Text, so you’ll end up with the replacement char
U+FFFD
- ByteString/ShortByteString: these instances truncate to 8 bits and are as such arguably broken, see Surprising behavior of ByteString literals via IsString
My personal recommendation is to stay away from this class and use explicit functions like pack
instead. However, we could also use QuasiQuoters (more on that later).
OverloadedStrings
This language extensions extends the support for string literals to allow all types that have an IsString
instance. This can be convenient when dealing with lots of Text literals. However, it poses two problems:
- it can make type inference harder (since literals are not merely “String”), so sometimes, having a type annotation is necessary
- the caveats explained for the
IsString
class apply here as well: ByteString doesn’t behave well
Example use:
{-# LANGUAGE OverloadedStrings #-}
= "hello world" :: Text myText
I personally advise against using it.
QuasiQuoters
This is yet another method to construct string like types. An alternative to literals. It uses Template Haskell, which are essentially expressions that are run at compile time. This allows us to validate literals much more rigorously and have GHC fail at compile time if we attempt to e.g. construct an invalid UTF-8 sequence as Text.
There are many libraries that support quasiquotation. Lots of them also support interpolation (using Haskell expressions/variables) inside the string) e.g.:
I personally prefer string-interpolate
. The README gives a nice comparison to some other libraries (copy-pasted for convenience):
string-interpolate | interpolate | formatting | Interpolation | interpolatedstring-perl6 | neat-interpolation | |
---|---|---|---|---|---|---|
String/Text support | ✅ | ✅ | ✅ | ⚠️ | ✅ | ⚠️ |
ByteString support | ✅ | ✅ | ❌ | ⚠️ | ✅ | ❌ |
Can interpolate arbitrary Show instances | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
Unicode-aware | ✅ | ❌ | ⚠️ | ❌ | ❌ | ⚠️ |
Multiline strings | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Indentation handling | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ |
Whitespace/newline chomping | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
An example use case:
showWelcomeMessage :: Text -> Integer -> Text
=
showWelcomeMessage username visits [i|Welcome to my website, #{username}! You are visitor #{visits}!|]
It is important to note that having many quasi-quotations in your source files can slow down compilation time. There are also (sometimes) issues with tooling, such as code formatters or Haskell Language Server.
The OsString
type provides its own quasi-quoter osstr.
The main advantage, again, is that quasi-quoters can properly fail and do so at compile-time.
Conversions
There are many ways to convert from one type to another. I propose here the most safe conversions. For some cases, we will have to provide the encoding, because it cannot be guessed.
The Data.ByteString.Encode
module listed further down below is part of the bytestring-encoding package, which is not shipped with GHC. There are other similar packages like utf8-string.
Other than that, we only need the packages that provide the types we’re dealing with.
We’re omitting ShortText
, because conversions are similar to Text
. Bytes
can be converted to ByteString
or ShortByteString
depending on the pinned/unpinned nature and from there we can follow the below strategies.
From String to…
Let’s write a neat conversion module:
module StringConversions where
import Data.ByteString (ByteString)
import Data.ByteString.Encoding (TextEncoding)
import Data.ByteString.Short (ShortByteString)
import Data.Text (Text)
import System.OsString
import System.OsString.Encoding (EncodingException)
import qualified Data.ByteString.Encoding as BE
import qualified Data.ByteString.Short as SBS
import qualified Data.Text as T
import qualified System.OsString as OS
toString :: String -> String
= id
toString
toText :: String -> Text
= T.pack
toText
toByteString :: TextEncoding -> String -> ByteString
= BE.encode encoding . T.pack
toByteString encoding
toShortByteString :: TextEncoding -> String -> ShortByteString
= SBS.toShort . BE.encode encoding . T.pack
toShortByteString encoding
toOsString :: (TextEncoding, TextEncoding) -> String -> Either EncodingException OsString
= OS.encodeWith unixEncoding windowsEncoding toOsString (unixEncoding, windowsEncoding)
For converting to ByteString
and ShortByteString
, we have to explicitly specify an encoding for the resulting byte sequence. For OsString
we have to provide encodings per platform, since this type is platform agnostic.
The caveat wrt. Text’s pack
not dealing well with surrogates applies.
From Text to…
module TextConversions where
import Data.ByteString (ByteString)
import Data.ByteString.Encoding (TextEncoding)
import Data.ByteString.Short (ShortByteString)
import Data.Text (Text)
import System.OsString
import System.OsString.Encoding (EncodingException)
import qualified Data.ByteString.Encoding as BE
import qualified Data.ByteString.Short as SBS
import qualified Data.Text as T
import qualified System.OsString as OS
toString :: Text -> String
= T.unpack
toString
toText :: Text -> Text
= id
toText
toByteString :: TextEncoding -> Text -> ByteString
= BE.encode encoding
toByteString encoding
toShortByteString :: TextEncoding -> Text -> ShortByteString
= SBS.toShort . BE.encode encoding
toShortByteString encoding
toOsString :: (TextEncoding, TextEncoding) -> Text -> Either EncodingException OsString
= OS.encodeWith unixEncoding windowsEncoding . T.unpack toOsString (unixEncoding, windowsEncoding)
When converting from Text
, we can essentially reuse all the API that deals with just String
and vice versa.
From ByteString to…
module ByteStringConversions where
import Data.ByteString (ByteString)
import Data.ByteString.Encoding (TextEncoding)
import Data.ByteString.Short (ShortByteString)
import Data.Text (Text)
import System.OsString
import System.OsString.Encoding (EncodingException)
import qualified Data.ByteString.Encoding as BE
import qualified Data.ByteString.Short as SBS
import qualified Data.Text as T
import qualified System.OsString as OS
toString :: TextEncoding -> ByteString -> String
= T.unpack . BE.decode encoding
toString encoding
toText :: TextEncoding -> ByteString -> Text
= BE.decode encoding
toText encoding
toByteString :: ByteString -> ByteString
= id
toByteString
toShortByteString :: ByteString -> ShortByteString
= SBS.toShort
toShortByteString
-- | This is hard to write correctly. It depends on where the @ByteString@
-- comes from. It may not be possible to interpret it on both platforms.
-- @OsString@ is meant to interface with operating system API, not to manually
-- construct arbitrary strings. Use the @osstr@ quasi quoter if you need
-- literals. Or look at the internals in 'System.OsString.Internal.Types'.
toOsString :: ByteString -> OsString
= undefined toOsString
For converting to String
and Text
, we have to provide an encoding for the ByteString in order to decode it.
Converting from a byte sequence of unknown origin to OsString
is hard. The way this usually happens is at the FFI boundaries in Win32
and unix
package. The question is what does the given byte sequence represent… where does it come from, what is its encoding, if any? If it comes from operating system API, we can just wrap it into our types, see System.OsString.Internal.Types. Otherwise, we may need to decode the bytes first and then pick a target encoding.
From ShortByteString to…
module ByteStringConversions where
import Data.ByteString (ByteString)
import Data.ByteString.Encoding (TextEncoding)
import Data.ByteString.Short (ShortByteString)
import Data.Text (Text)
import System.OsString
import System.OsString.Encoding (EncodingException)
import qualified Data.ByteString.Encoding as BE
import qualified Data.ByteString.Short as SBS
import qualified Data.Text as T
import qualified System.OsString as OS
toString :: TextEncoding -> ShortByteString -> String
= T.unpack . BE.decode encoding . SBS.fromShort
toString encoding
toText :: TextEncoding -> ShortByteString -> Text
= BE.decode encoding . SBS.fromShort
toText encoding
toByteString :: ShortByteString -> ByteString
= SBS.fromShort
toByteString
toShortByteString :: ShortByteString -> ShortByteString
= id
toShortByteString
-- | This is hard to write correctly. It depends on where the @ShortByteString@
-- comes from. It may not be possible to interpret it on both platforms.
-- @OsString@ is meant to interface with operating system API, not to manually
-- construct arbitrary strings. Use the @osstr@ quasi quoter if you need
-- literals. Or look at the internals in 'System.OsString.Internal.Types'.
toOsString :: ShortByteString -> OsString
= undefined toOsString
The same caveats as for ByteString apply.
From OsString to…
module OsStringConversions where
import Control.Monad.Catch (MonadThrow)
import Data.ByteString (ByteString)
import Data.ByteString.Encoding (TextEncoding)
import Data.ByteString.Short (ShortByteString)
import Data.Text (Text)
import System.OsString
import System.OsString.Encoding (EncodingException)
import qualified Data.ByteString.Encoding as BE
import qualified Data.ByteString.Short as SBS
import qualified Data.Text as T
import qualified System.OsString as OS
toString :: MonadThrow m => OsString -> m String
= OS.decodeUtf
toString
toText :: MonadThrow m => OsString -> m Text
= fmap T.pack . OS.decodeUtf
toText
-- | It depends whether we want the original bytes passed unchanged
-- and platform specific or whether we want to convert to a unified
-- representation that is the same on both platforms, but in ByteString
-- format.
toByteString :: OsString -> ByteString
= undefined
toByteString
-- | Same as 'toByteString'.
toShortByteString :: OsString -> ShortByteString
= undefined
toShortByteString
toOsString :: OsString -> OsString
= id toOsString
OsString always comes with 3 families of decoding and encoding functions:
encodeUtf
/decodeUtf
: assumes UTF-8 on Unix and UTF-16 LE on Windows- we are using this in the code above for simplicity
encodeWith
/decodeWith
: here we have to pass the encoding for both platforms explicitlyencodeFS
/decodeFS
: this mimics the behavior of the base library, using PEP-383 style encoding on Unix and permissive UTF-16 on Windows
To JSON
A lot of times we want to send our strings over the wire, possibly via JSON. We will examine this via the popular aeson library.
Both Text
and String
already have ToJSON
instances. These are easy, because they are Unicode and JSON demands UTF-8.
For ByteString
, ShortByteString
and OsString
this gets a bit more complicated. It depends on the exact use case. What is the byte sequence used for on the machine receiving the JSON? Also see the discussion Add saner ByteString instances on the aeson issue tracker.
From my perspective, there are 3 possibilities:
- convert to
String
(e.g. by assuming UTF-8 or UTF-16), use the existing ToJSON instance and hope the receiver knows how to interpret the data - if you’re dealing with binary data, you can convert to e.g. base64 String or Text and then again use the existing instances (there’s the base64-bytestring-type library that does this via a newtype)
- convert the byte sequence to
[Word8]
, which has a valid instance as well
For the case of OsString
, keep in mind that the raw bytes depend on the current platform (char[]
array on Unix and wchar_t*
on Windows). So you may have to attach more information if you choose methods 2 and 3 (e.g. encoding of the byte sequence and platform). And you need a strategy to deal with e.g. a Windows machine sending binary data to a Unix machine. As such, I recommend using decodeUtf
to get a String. The target machine can then use encodeUtf
to get back an OsString.
A word on lazy IO
Some of the named packages expose API for reading and writing files via their lazy variants:
Lazy IO is a hack to use incremental reading/processing without the use of a proper streaming library. The bytestring documentation warns about it:
- The program reads a file and writes the same file. This means that the file may be locked because the handler has not been released when writeFile is executed.
- The program reads thousands of files, but due to lazy evaluation, the OS’s file descriptor limit is reached before the handlers can be released.
Lazy IO makes it hard to reason about resources, order of execution etc. It is better to use a proper streaming library.
Streaming
Streaming can not only solve the lazy IO problem, but may also solve some of the inefficiency of the [Char]
type and can be more perfomant than lazy Text/ByteString, while keeping a similarly simple API.
There are many popular streaming libraries. A few of them are:
Via Streamly
A couple of years ago I wrote the blog post From conduit to streamly, which gives an introduction into both streamly and conduit. The streamly API has diverged quite a bit since then, with multiple major versions. So I won’t go into much detail about it.
However, streamly is one notable example which provides an alternative to the [Char]
type in Streamly.Unicode.Stream:
decodeUtf8 :: Monad m => Stream m Word8 -> Stream m Char
encodeUtf8 :: Monad m => Stream m Char -> Stream m Word8
A very simple program to print the last Unicode char of a file via streamly is:
import System.Environment (getArgs)
import Streamly.Data.Stream (Stream)
import Streamly.Data.Fold (Fold)
import qualified Streamly.Data.Fold as Fold
import qualified Streamly.FileSystem.File as File
import qualified Streamly.Unicode.Stream as Unicode
main :: IO ()
= do
main :_) <- getArgs
(file<- getLastCharFromFile file
c print c
getLastCharFromFile :: FilePath -> IO (Maybe Char)
= stream `Fold.drive` fold
getLastCharFromFile file where
stream :: Stream IO Char
= Unicode.decodeUtf8Chunks $ File.readChunks file
stream
fold :: Monad m => Fold m a (Maybe a)
= Fold.latest fold
To compile this program you need the streamly-core
package. As we can see here we can create streams of Unicode Chars easily while reading from a file… without lazy IO and without the need for the lazy Text type.
If you want to compare the performance of string vs text vs streamly, you can check out the code here in my example repository. My results on a 189MB file are:
- string: 1,152s
- lazy text: 0,654s
- streamly: 0,222s
A note on FilePaths
Just a quick reminder:
String
for filepaths is very wrongText
for filepaths is wrongByteString
for filepaths is questionableOsPath
for filepaths is good
For more details, read up on:
Reflection
What we should know
Almost at the end of the post, we should now have some insights into Unicode and understand:
- what a character encoding is (Unicode Code Point)
- what a text encoding is (UTF-8, UTF-16, UTF-32)
- how the different Unicode Transformation Formats work
- and their trade offs (word boundaries, searching, spaces)
- the problems with Code Points and Surrogates
- and how this affects the
Char
type,Text
and theIsString
instance
- and how this affects the
- that grapheme clusters are the closest definition of “visible symbol”
- and that they can consist of multiple code points
- that only UTF-8 is ASCII compatible
We understand the weird numbers that the Show
instance of Char
/String
sometimes returns.
We have seen a summary of the different string types:
- Text/ShortText for Unicode
- ByteString/ShortByteString for binary data
- The very flexible Bytes type
- OsString for operating systems API
- String for the bin
We know how to construct strings safely, can utilize QuasiQuoters to do compile-time validation and know how to convert between different types and how to deal with JSON.
We know the dangers of lazy IO and how to utilize streaming libraries instead.
Too many Strings
After all these topics, I want to address the opinion that gets thrown around on the internet a lot: “Haskell has too many String types”, e.g. on Hacker News.
If we take another look at the String Types Cheat Sheet, we don’t really see any type that could be replaced by another. They all have different properties and trade-offs. ByteString
vs ShortByteString
may be a bit less intuitive, but e.g. Text
is clearly different. OsPath
is a specialized type that exists in Rust too.
Maybe people dislike the Lazy variants and prefer proper streaming libraries, which is a fair point. But even if the community decides to shift, now you have another type (it’s just a streaming type), have to learn streaming library API and decide which of those 5+ libraries to use. So while we could technically do away with them, they’re a useful low-entry barrier alternative and are still widely used.
In the end, once all these properties are well understood, I find it hard to make an argument for less types. However, it is clear that not everyone thinks so:
I am still unable to see the bigger picture, other than more unification of internal representations, but less so of public APIs.
E.g. if we compare the following 3 types, we see a pattern:
data Text = Text
{-# UNPACK #-} !A.Array -- ^ bytearray encoded as UTF-8
{-# UNPACK #-} !Int -- ^ offset in bytes (not in Char!), pointing to a start of UTF-8 sequence
{-# UNPACK #-} !Int -- ^ length in bytes (not in Char!), pointing to an end of UTF-8 sequence
data Vector a = Vector {-# UNPACK #-} !Int
{-# UNPACK #-} !Int
{-# UNPACK #-} !(Array a)
data Bytes = Bytes
array :: {-# UNPACK #-} !ByteArray
{ offset :: {-# UNPACK #-} !Int
, length :: {-# UNPACK #-} !Int
, }
A unification of internal representations would e.g. allow 0-cost conversions, unless an invariant needs to be checked (like valid unicode).
Text
could maybe be a newytpe over Bytes
. But that won’t actually save us a type. We still need the newtype to write an API that maintains the “valid unicode” invariant, which Bytes
does not guarantee.
It is also hard to argue for the removal of the “short” types ShortText
and ShortByteString
as described in the section Slicable vs non-slicable.
Writing a new string type from scratch can be really hard. But with the rich APIs of ByteString
, ShortByteString
and Bytes
, coming up with newtypes might not be that difficult.
What are we missing
We don’t have types for:
- Unicode Scalar Values (away with those surrogates)
- Grapheme Clusters
Especially the latter is something that seems to be potentially useful. We don’t just want to know the boundaries of Unicode code points, but of the actual user visible symbols, don’t we? The text-icu
package seems to have an API for breaking on grapheme boundaries, but it doesn’t look very straight forward. I must admit I haven’t looked very hard though.
We also don’t have a good streaming solution in base. And maybe we never will. But that, probably, also means we will never get rid of lazy IO, which is a foot-gun for newcomers and everyone else.
My next project is likely going to be strongly typed filepaths, which do already exist, just not in combination with OsPath
.
Special thanks to
- Andrew Lelechenko
- Jonathan Knowles
- Mike Pilgrem
- John Ericson
- streamly maintainers for their cutting edge API
- all the text, bytestring, byteslice, short-text etc. maintainers
- Other people I pinged about this topic
Links and relevant stuff
String type blog posts
- Fixing Haskell filepaths, by Julian Ospald
- String types, by FPComplete
- Eat Haskell String Types for Breakfast, by Ziyang Liu
- Untangling Haskell’s Strings
- Haskell Strings, by Chris Warburton
- vector: Efficient Packed-Memory Data Representations, by FPComplete
Other blog posts
- From conduit to streamly
- Fast Haskell: Competing with C at parsing XML
- Beating C With 80 Lines Of Haskell: Wc
- Haskell base proposal: unifying vector-like types
- Haskell base proposal, part 2: unifying vector-like types
- The text package: finally with UTF-8, by Andrew Lelechenko
Interesting issues
String types not discussed here
- monoid-subclasses
- Data.Vector
- Data.ByteString.Builder (not really a string type)
- GHC.Data.FastString
- Data.JSString