November 14, 2023,
Posted by Julian Ospald
Misunderstandings
Over the past few years, there have been recurring questions or misunderstandings about GHCup. E.g.:
- GHCup only installs bindists from upstream (e.g. GHC/Cabal/HLS CI)
- GHCup never applies patches to tools it distributes
Both those assumptions do not apply, because GHCup is primarily a distribution channel, not just an installer. The distribution channel is basically the ghcup-0.0.8.yaml
metadata file.
Users who strictly only want upstream bindists (whether they’re broken or not) can use a different distribution channel and opt out of all unofficial things: ghcup-vanilla-0.0.8.yaml
. More information in the README.
Policies and priorities
GHCup follows similar philosophies and policies like Debian or Fedora. Most of them are outlined here in more details. The most important points, simplified, are:
- The end-user experience is our primary concern
- We strive to collaborate with all maintainers of all the tools we support and maintain a good relationship
- We may fix build system or other distribution bugs in upstream bindists
- We may even patch source code of supported tools in very rare cases if that is required to ensure that the end-user experience does not break
As such, we prioritize the end-user over interests of upstream developers. E.g. it frequently happens that upstream developers want a new release to be ‘recommended’ (that is: installed by default if you run GHCup for the first time). However, experience shows that it’s usually better to wait.
So far, we have never patched source code. In case that ever happens, it would be communicated to both upstream and the end user via post-install messages.
Affects on maintenance
Following these priorities, the smallest part of GHCup maintenance sometimes seems to be the codebase. The following tasks have come up repeatedly, until I decreased my workload considerably to avoid a proper burnout:
- building unofficial bindists for missing platforms (e.g. GHC alpine i386, armv7, FreeBSD or stack darwin aarch64)
- patching upstream bindists in case of issues discovered post-release
- tracking releases and bugs of all tools to decide which release is to be ‘recommended’
- being involved in CI code and release issues of most tools
- meetings and communication with HF, GHC HQ, other tooling maintainers and especially HLS
- developing and supporting new ideas (dynamic HLS bindists, GHC nightlies, …)
- advocating and pushing for prioritizing end user experience, e.g. here
- supporting users having installation issues via IRC, Discord, email, different issue trackers, …
Most of this has now stalled, until GHCup gets more support (e.g. from Obsidian, which I’m excited about).
Possible future
GHCup being a distribution channel also means that, theoretically, we might completely stop relying on upstream bindists and roll our own. For this idea I already have prepared a document about Midstream bindists that could be submitted as a HF tech proposal. As I don’t have the capacity, I have not submitted it yet and maybe I never will.
In a perfect world, we want full control over the bindists, decide on the provided configurations, distribution support, platform support, etc.
This is what Linux distributions do too. They rarely use upstream bindists, except for bootstrapping purposes.
What we want from upstream is not bindists
What distributors really want from upstream is not bindists, but something else:
- feasibility to run test suites on the end-users system (that’s where it matters, not just in CI)
- and have processes and mechanisms to get feedback for failing test suites (send report via cli that gets aggregated and analyzed somewhere)
- awareness that the build system is not just a dev tool for hackers, but an interface for distributors and end users
- mindfulness about platform support (including less common ones)
- not relying on hermetically built binaries: instead make sure a manually compiled binary works on all platforms and have sufficient mechanisms to detect when it wouldn’t (through bindist configure, runtime checks, test suite, …)
- have prereleases as much as possible, including minor releases
- communicate everything that potentially impacts distributors
- longer patch/security maintenance windows for older versions
If the build system interface was stable, we could easily use ghcup compile ghc
in our own CI, followed by a ghcup test ghc
(yes, that exists!) and be done. Then tell end users to utilize ghcup test ghc
after installation to make sure it really works with their environment (that’s not a given even with official bindists). However, the test suite is flaky and the test bindists are buggy and not very portable, so this goal is far out.
Conclusion
I hope that this clears up some of the confusion and expectations and that end users understand that they have a choice by utilizing different metadata files.
June 29, 2022,
Posted by Julian Ospald
I’m pleased to announce that the Haskell type type FilePath = String
has a successor, which was first discussed many years ago as the Abstract FilePath proposal (AFPP).
The new type shipped with the filepath-1.4.100.0 package is:
-- * Path types
-- | FilePath for windows.
type WindowsPath = WindowsString
-- | FilePath for posix systems.
type PosixPath = PosixString
-- | Abstract filepath, depending on current platform.
-- Matching on the wrong constructor is a compile-time error.
type OsPath = OsString
-- * String types
-- Constructors are not public API.
newtype WindowsString = WindowsString ShortByteString
newtype PosixString = PosixString ShortByteString
newtype OsString = OsString
#if defined(mingw32_HOST_OS)
WindowsString
#else
PosixString
#endif
The reason we have two sets of types here is simply to maintain the current weak distinction in filepath for functions that deal with not-quite-filepaths, e.g.: splitSearchPath :: String -> [FilePath]
. This also allows us to provide slightly different API (e.g. QuasiQuoter for OsString
differs from OsPath
). OsPath is not a newtype, because it doesn’t provide any additional guarantees over OsString. ‘filepath’ remains a low-level library and does not provide strong guarantees for filepaths (such as validity).
Libraries with stronger filepath guarantees are listed in the README.
Unlike the original proposal, this is additional API (not part of base
) and will not break any existing code. Core libraries are expected to upgrade their API and provide additional variants that support this new type. Migration strategies are discussed further down. The ecosystem might need some time to migrate. This is also a call for help!
But let’s look at the reasons why String
is problematic first.
TOC
What’s wrong with String?
Filepaths are resources on the (users) system. We create, delete, copy them. Any corner case with filepaths can have devastating effects: deleting the wrong file, comparing the wrong files, failing whitelists, security bugs, etc.
To recap, the definition of String is:
So a String is a list of Char
. And Char
is encoded as UTF-8, right? Unfortunately not, it’s a Unicode code point.
A unicode code point is an integer in the Unicode codespace. The standard gets a little technical here, but let’s just say UTF-8 is one of many encodings of [Char]
.
That out of the way, let’s look at how filepaths are actually represented on the system level.
On windows, filepaths are just wide character arrays (wchar_t*
, so basically [Word16]
). On unix, filepaths are character arrays (char[]
, so basically [Word8]
).
In both cases, there’s no encoding specified, although on windows we can mostly assume UTF-16LE. So… to go from String
to CString
/CWString
at the outer FFI layer, we need to make a decision.
base
currently does the following:
- On unix, it uses
getFileSystemEncoding
and mkTextEncoding
to pick a round-trippable encoding for filepaths. E.g. if your locale returns en_US.UTF-8
you’ll get UTF-8//ROUNDTRIP
TextEncoding, which is based on PEP 383 and invalid bytes get translated to some special representation (lone surrogates) in order to be roundtripped.
- On windows, it uses a private permissive UTF-16 encoding that allows to roundtrip coding errors as well.
Windows isn’t too problematic here. The encoding is total. However, on unix, the interpretation of filepaths depends on the currently set locale. This is wrong for a number of reasons:
- there’s no guarantee that the currently set locale corresponds to the encoding of a specific filepath (the filepath could be on a USB drive that has a japanese encoding, such as
CP932
)
- as the documentation of mkTextEncoding says, only very specific encodings actually roundtrip properly (
CP932
does not)
- on conversion to
String
, you “lose” the underlying encoding and may end up with weirdly escaped Unicode codepoints. Roundtripping can break if a call to setFileSystemEncoding
interleaves the conversions.
- it’s hard to get the original bytes back… this may have security implications for e.g. filepath whitelists
So, how do other languages solve this? Python simply enforces UTF-8
(with PEP 383 escaping) on unix. That makes the roundtripping almost sound. But this comes with its own set of problems:
- if the underlying filepath is not UTF-8, the
[Char]
representation is lossless (from CString
to [Char]
), but may be somewhat non-sensical for further interpretation, because you might have excessive escaping or your Char
s don’t correspond to what the user sees on their system
- this has really bad interoperability, because the roundtrip encoding can in fact produce invalid UTF-8. The unicode consortium itself has voiced their concerns with this approach
- since Haskell
Char
also includes surrogates, the conversion from String
to e.g. UTF-8 CString
can in fact fail, so is not total
I have assembled a list of correctness issues with these approaches for in-depth reading.
The solution
Just stop converting filepaths!
We can just keep the original bytes from the system API. Many filepath operations actually don’t need to know the exact underlying encoding. E.g. the filepath separator /
on unix is a pre-defined byte (0x2F
). You can just scan the byte array for this byte. The position doesn’t matter, the encoding doesn’t matter. File names cannot include this byte, period.
However, since unix and windows are different ([Word8]
vs [Word16]
), any API that deals with low-level filepaths in a cross-platform manner needs to understand this and write correct code. More on this in the migration strategy section below.
We decided to use ShortByteString
as the internal representation of filepaths, because:
- these are raw, uninterpreted bytes, a wrapper around
ByteArray#
, which has many efficient primops
- it’s unpinned, so doesn’t contribute to memory fragmentation (proof)
- provides convenient API via
bytestring
, which has been greatly enhanced as part of this proposal
So, in general the idea is to avoid dealing with String
at all. There may still be use cases for String though, e.g.:
- dealing with legacy APIs
- reading filepaths from a UTF-8 encoded text file (you probably want
Text
here, but it’s trivial to convert to String)
- a unified representation across platforms (e.g. to send over the wire or to serialize)
How to use the new API
Many examples are here: https://github.com/hasufell/filepath-examples
Note that not all libraries have released support for the new API yet, so have a look at this cabal.project if you want to start right away. Generally, you should be able to use these packages already:
- filepath: provides filepath manipulation and the new
OsPath
type
- unix: provides new API variants, e.g.
System.Posix.Files.PosixString
(as an alternative to System.Posix.Files
)
- Win32: similarly, provides new variants, e.g.
System.Win32.WindowsString.File
- directory: provides the new API under
System.Directory.OsPath
- file-io: companion package that provides base-like file reading/writing/opening operations
Most end-users developing applications should be able to convert to the new API with little effort, given that their favorite libraries already support this new type.
System.OsPath exports the same API as System.FilePath
with some additional helpers to convert from and to String
. Likewise System.OsPath.Posix
/System.OsPath.Windows
are equivalent to System.FilePath.Posix
/System.FilePath.Windows
.
So, you can just:
- update your dependencies lower bounds to the minimum version that supports
OsPath
(might need source-repository-package stanzas)
- for
filepath
import System.OsPath
instead of System.FilePath
- use the specialised API from your dependencies (e.g. for unix
System.Posix.Directory.PosixPath
instead of System.Posix.Directory
)
- to write OsPath literals, use the provided QuasiQuoters. There’s no
IsString
instance, see the faq.
- if you’re just using an ASCII subset or strict unicode scalar values, you can use
fromJust . encodeUtf
and fromJust . decodeUtf
to pack/unpack literals
- since
base
doesn’t support this new type, you’ll need the already mentioned companion library file-io for opening a Handle
and writing/reading files
- if you use legacy APIs that still use
FilePath
, there are examples on how to deal with them (usually System.OsPath.encodeFS
and System.OsPath.decodeFS
)
A table for encoding/decoding strategies follows:
encodeUtf |
FilePath |
OsPath |
UTF-8 (strict) |
UTF-16 (strict) |
not total |
encodeWith |
FilePath |
OsPath |
user specified |
user specified |
depends on input |
encodeFS |
FilePath |
OsPath |
depends on getFileSystemEncoding |
UTF-16 (escapes coding errors) |
requires IO, used by base for roundtripping |
decodeUtf |
OsPath |
FilePath |
UTF-8 (strict) |
UTF-16 (strict) |
not total |
decodeWith |
OsPath |
FilePath |
user specified |
user specified |
depends on input |
decodeFS |
OsPath |
FilePath |
depends on getFileSystemEncoding |
UTF-16 (escapes coding errors) |
requires IO, used by base for roundtripping |
These conversions are particularly useful if you’re dealing with legacy API that is still FilePath
based. An example on how to do that with the process package is here.
Migration for library authors
Core libraries or other libraries exporting API that is heavy on filepaths generally have 3 options:
1. drop String based API and just provide OsPath
This is feasible, because users can themselves convert via System.OsPath.encodeFS
and System.OsPath.decodeFS
to and from String
.
2. provide a shim compatibility API for String
This is what this directory
PR does: https://github.com/haskell/directory/pull/136/files… see System/Directory.hs
.
The idea is to write the core against OsPath
and then create a String
based API that wraps the core via System.OsPath.encodeFS
and System.OsPath.decodeFS
to mimic behavior of base
. This usually requires IO, though.
3. using CPP to export two APIs
This is what filepath itself does. It contains an abstract module, which is then imported while setting specific types and platform information (PosixPath, WindowsPath, System.FilePath.Posix and System.FilePath.Windows).
The main trick here is to not use any String based API (e.g. no pattern matching or use of :
). Instead, we only use uncons
/unsnoc
, head
/last
etc, so the intersection of String and ShortByteString APIs… and then adjust the imports based on the type.
E.g. the following code:
splitSearchPath :: String -> [FilePath]
splitSearchPath = f
where
f xs = case break isSearchPathSeparator xs of
(pre, [] ) -> g pre
(pre, _:post) -> g pre ++ f post
g "" = ["." | isPosix]
g ('\"':x@(_:_)) | isWindows && last x == '\"' = [init x]
g x = [x]
became:
splitSearchPath :: STRING -> [FILEPATH]
splitSearchPath = f
where
f xs = let (pre, post) = break isSearchPathSeparator xs
in case uncons post of
Nothing -> g pre
Just (_, t) -> g pre ++ f t
g x = case uncons x of
Nothing -> [singleton _period | isPosix]
Just (h, t)
| h == _quotedbl
, (Just _) <- uncons t -- >= 2
, isWindows
, (Just (i, l)) <- unsnoc t
, l == _quotedbl -> [i]
| otherwise -> [x]
The windows include site is something like:
-- word16 based bytestring functions
import System.OsPath.Data.ByteString.Short.Word16
-- defining types
#define FILEPATH ShortByteString
#define WINDOWS
-- include the CPP module
#include "Internal.hs"
Then we can have a WindowsPath
/PosixPath
/OsPath
wrappers:
splitPath :: FILEPATH_NAME -> [FILEPATH_NAME]
splitPath (OSSTRING_NAME bs) = OSSTRING_NAME <$> C.splitPath bs
And that is included like so:
import System.OsPath.Types
import System.OsString.Windows
import qualified System.OsPath.Windows.Internal as C
#define FILEPATH_NAME WindowsPath
#define WINDOWS
#include "PathWrapper.hs"
Not very pretty, but avoids a lot of repetition and doesn’t require a partial wrapper layer that converts between ShortByteString
and String
.
Some libraries might need access to the raw bytes of the filepaths, e.g. because the filepath
API is insufficient. It’s important to understand that on unix, we’re basically dealing with [Word8]
and on windows with [Word16]
, where both lists are represented as a compact ShortByteString
.
E.g. a cross-platform function might look like this:
module MyModule where
import System.OsPath.Types
import System.OsString.Internal.Types
#if defined(mingw32_HOST_OS)
-- word 16 based windows API
import qualified System.OsPath.Data.ByteString.Short.Word16
as SBS
import qualified System.OsPath.Windows as PFP
#else
-- word 8 based posix API
import qualified System.OsPath.Data.ByteString.Short as SBS
import qualified System.OsPath.Posix as PFP
#endif
crossPlatformFunction :: OsPath -> IO ()
#if defined(mingw32_HOST_OS)
crossPlatformFunction (OsString pfp@(WindowsString ba)) = do
-- use filepath functions for windows specific
-- operating system strings
let ext = PFP.takeExtension pfp
-- operate directly on the underlying bytestring
-- (which is a wide character bytestring, so uses Word16)
let foo = SBS.takeWhile
...
#else
crossPlatformFunction (OsString pfp@(PosixString ba)) = do
-- use filepath functions for posix specific
-- operating system strings
let ext = PFP.takeExtension pfp
-- operate directly on the underlying bytestring
-- (which is just Word8 bytestring)
let foo = SBS.takeWhile
...
#endif
History of the proposal
- first wiki proposal: https://gitlab.haskell.org/ghc/ghc/-/wikis/proposal/abstract-file-path
- Revival attempts
- PRs:
- Haskell Foundation thread: https://github.com/haskellfoundation/tech-proposals/issues/35
- Reddit discussion: https://www.reddit.com/r/haskell/comments/vivjdo/abstract_filepath_coming_soon/
Contributors
- Author, filepath maintainer and proposal champion: Julian Ospald (me)
- Bodigrim providing help and support as CLC chair, giving reviews as bytestring maintainer and providing help with questions about encoding
bytestring
maintainers providing review for the ShortByteString
PR
unix
maintainers providing PR review
- Tamar Christina (
Win32
maintainer) providing PR review and further guidance for the file-io
library
directory
maintainer providing PR review
- Ericson2314 via various dicussions
- Koz Ross helping with encoding questions
- GHC team helping with getting this into 9.6
- HF encouraging me
- reddit community giving loads of opinions on function names ;)
- various people on IRC discussing alternatives like PEP-383/UTF-8b/WTF-8
Patch load
- filepath: 11126 insertions(+), 3062 deletions(-)
- bytestring: 1795 insertions(+), 145 deletions(-)
- Win32: 2668 insertions(+), 986 deletions(-)
- unix: 8705 insertions(+), 3 deletions(-)
- directory: 2959 insertions(+), 939 deletions(-)
- file-io: 296 insertions(+)
Total: 27549 insertions(+), 5135 deletions(-)
How to help
- create issues for your favorite libraries to support
OsPath
linking to this blog
- create PRs for existing issues:
FAQ
Why is there no IsString instance (OverloadedStrings)?
IsString
has a broken API: https://github.com/haskell/bytestring/issues/140
It can’t express failure. Conversion to OsPath
can fail. Use the provided QuasiQuoters instead.
Why is this not in base?
Nothing is stopping this from eventually getting into base. But the barrier of doing so is much higher. It may happen eventually.
When will ‘FilePath’ be dropped?
Probably never. It would break loads of code. We don’t want to do that, for now.
Yet another String type?
Right… I suggest using python if you don’t like types ;)
October 22, 2021,
Posted by Julian Ospald
Motivation
At GHCup I recently put a lot of effort into reducing the dependency footprint to improve build times. Since conduit
was not a direct dependency and only used for yaml parsing and some other things, I replaced those deps with alternatives or re-implemented them (like logging).
yaml, which uses conduit under the hood, was replaced with HsYAML, but to my despair… that turned out to be 10 times slower, which also caused issues for pandoc.
Conduit is an excellent fully featured streaming library, but I didn’t want to go back to it by re-introducing yaml, since GHCup previously depended on streamly
and will likely do so in the future. So I simply decided to migrate yaml to streamly: https://hackage.haskell.org/package/yaml-streamly.
Streamly is a very general streaming library with a the strong focus on performance through inlining and stream fusion optimizations. As such, it may exceed other implementations performance, but also depends quite heavily on GHC behavior, flags, INLINE pragmas etc. It can also be used as an alternative for async, for reactive programming and much more.
So in this post, I will shortly explain conduit and streamly and provide a simple migration guide.
Recap on conduit
There are many approaches on streaming. Conduit and streamly diverge quite heavily in terms of paradigm and API.
Conduit expresses streaming by providing a type that captures i
nput, o
utput and a possible final r
esult, all in one type (and the obligatory effect m
):
As such, it expresses:
Producers
These are generators from a seed value. Conduit defines it generically as such:
unfold :: Monad m => (b -> Maybe (a, b)) -> b -> ConduitT i a m ()
A simple unfold that lets us turn a list into a stream would be:
-- this is also provided by conduit
sourceList :: Monad m => [a] -> ConduitT i a m ()
sourceList = unfold gen
where
gen :: [a] -> Maybe (a, [a])
gen (x:xs) = Just (x,xs)
gen _ = Nothing -- stream aborts
-- our own stream of "output" chars with no final result
chars :: Monad m => ConduitT i Char m ()
chars = sourceList "abc"
As can be seen, the o
in data ConduitT i o m r
gets fixed to Char
. A Producer can then be be “piped” into another conduit, e.g. a transformer.
A producer focuses on the o
utput.
A transformer is like map :: (a -> b) -> [a] -> [b]
. It transforms the stream and may yield a different type.
-- provided by conduit, notice how it has only one argument
map :: Monad m => (a -> b) -> ConduitT a b m ()
-- transforms Char to Int
charToInt :: Monad m => ConduitT Char Int m ()
charToInt = map ord
-- applies the transformation to the chars, yielding a Producer
-- we'll explaing '.|' shortly
ints :: Monad m => ConduitM a Int m ()
ints = chars .| charToInt
Notable is also that the Functor fmap
isn’t a transformation. It would map on the final value, not the produced values. That’s why we need Data.Conduit.List.map
. Streamly is very different here.
A transformer maps the i
nput to the o
utput.
To apply a transformation, we use the (.|)
pipe operator, which reminds us of shell pipes:
(.|) :: Monad m
=> ConduitM a b m () -- ^ producer of values 'b'
-> ConduitM b c m r -- ^ transformer (b -> c), or consumer
-> ConduitM a c m r
It takes a little while to see what’s going on. The type variables guide us.
Consumer
A consumer works on the input stream, much like a transformer, but may also yield a final result. E.g. If we wanted to return all the Int’s we just converted from the Char stream, we’d do:
-- provided by conduit
foldl :: Monad m => (a -> b -> a) -> a -> ConduitT b o m a
-- 'a' (the input) gets folded as a list, so the final result is '[a]'
toList :: Monad m => ConduitT a o m [a]
toList = foldl (\a b -> b:a) []
-- applying the fold on the stream of Ints
foldedInts :: Monad m => ConduitM a c m [Int]
foldedInts = ints .| toList
The consumer focuses on the i
nput to produce a final r
esult (however, consumers may also drop elements from the stream).
As demonstrated, one has to look closely at the type parameters in data ConduitT i o m r
to understand a conduit.
All concepts are unified in one type. Most operations need specific combinators.
Wrapping up conduit
Finally, we can get our Ints:
ints :: Monad m => m [Int]
ints = runConduit foldedInts
That’s basically conduit. A conduit as such doesn’t really express streams. Instead we’re dealing with stream processors (functions).
Streamly
Streamly’s approach is very different. It focuses on the simple concept of a stream of elements. It has 4 main types:
As can be seen, this is nothing like data ConduitT i o m r
. I also note that IsStream t
is abstract to allow for different types of streams like SerialT
or AsyncT
, which I won’t go into detail about here.
We’ll now figure out how these concepts translate to conduit.
Producers
Conduits producers are basically Unfolds.
The simplest function to create an Unfold
is:
unfoldr :: Applicative m => (a -> Maybe (b, a)) -> Unfold m a b
…which actually looks a lot like conduit:
unfold :: Monad m => (b -> Maybe (a, b)) -> b -> ConduitT i a m ()
The difference in streamly is that we provide the initial seed value when we turn the Unfold into a Stream.
So, let’s do the same procedure as above. We’ll create a list of Chars:
-- equivalent to conduits 'sourceList', also provided by streamly
fromList :: Monad m => Unfold m [a] a
fromList = unfoldr gen
where
gen :: [a] -> Maybe (a, [a])
gen (x:xs) = Just (x,xs)
gen _ = Nothing -- stream aborts
-- provided by streamly
-- given a seed value, turn an Unfold into a stream
unfold :: (IsStream t, Monad m) => Unfold m a b -> a -> t m b
-- we turn the unfold into a stream of chars
chars :: (IsStream t, Monad m) => t m Char
chars = Streamly.Prelude.unfold fromList "abc"
This type t m Char
looks a lot simpler. It’s basically a glorified list with possible effects run for every element.
A transformer doesn’t have its own type. It’s in my opinion much simpler than conduit. Here, we can simply reuse the Prelude’s fmap
. The main difference is that we have an input and an output stream, so:
-- transforms Char to Int
charToInt :: (IsStream t, Monad m, Functor (t m)) => t m Char -> t m Int
charToInt inputStream = fmap ord inputStream
-- applies the transformation to the chars, yielding a stream of Ints
ints :: (IsStream t, Monad m, Functor (t m)) => t m Int
ints = charToInt chars
This feels much more like lists! Compare with fmap ord "abc"
. Streams can be passed around and transformed just like lists. If you want to run effects for every item, you just use the Monad interface:
charToInt :: (IsStream t, Monad m, Monad (t m)) => t m Char -> t m Int
charToInt inputStream = inputStream >>= pure . ord
However, this creates a data dependency (as we’re used from Monad). There’s the more general mapM
that can run effects in parallel:
mapM :: (IsStream t, MonadAsync m) => (a -> m b) -> t m a -> t m b
Excellent. So Functor, Monad etc. follow our intuition.
Consumers
Simple consumers in streamly terms are usually Folds.
E.g. if we wanted to convert our stream of Ints to an actual list of Ints we would combine our input stream with a Fold.
Remember the Fold type data Fold m a b
, where a
are the values of the input stream and b
is the final folded value.
-- provided by streamly for creating a Fold
foldl' :: Monad m => (b -> a -> b) -> b -> Fold m a b
-- provided by streamly for executing a fold over a stream
fold :: Monad m => Fold m a b -> Stream m a -> m b
-- A Fold that turns any input stream into a list
toList :: Monad m => Fold m a [a]
toList = foldl' (\a b -> b:a) []
-- Applying the Fold to an actual stream already executes it
foldedInts :: Monad m => m [Int]
foldedInts = fold toList ints
Parsers
Folds don’t have a monadic interface (yet). If we want backtracking and a monadic interface to choose the next step depending on the current element in the stream, we can use a Parser.
In conduit, we can use consumers like head and peek and utilize the Monad interface of ConduitT
to make our decisions. Theoretically, we could do the same in the Stream type of streamly via uncons, but the parser feels more idiomatic here.
I note that there is a parser-like package conduit-parse, but the yaml conduit code doesn’t utilize that and this blog was written while I converted yaml to streamly.
The streamly parser type is the same as a Fold: newtype Parser m a b
.
It parses a streamed value a
into b
. Much of the API resembles what you’re used to of parsec
or attoparsec
etc.
Let’s look at this conduit code (not tested to compile):
import qualified Data.Conduit.Combinators as C
chars :: Monad m => ConduitT i Char m ()
chars = sourceList "a1b2c3"
-- We parse '1' from 'a1', '2' from 'b2' and so on, no matter
-- the order the pairs appear in.
parse' :: MonadIO m => ConduitT Char o m [Int]
parse' = do
mc <- C.head
case mc of
Just 'a' -> do
mcn <- C.head
case mcn of
Just '1' -> (1:) <$> parse'
Just cn -> liftIO $ throwIO $ "Unexpected char: " ++ [cn]
Nothing -> pure []
Just ... -- and so on
Nothing -> pure []
To translate this to streamly, we would write:
chars :: (IsStream t, Monad m) => t m Char
chars = Streamly.Prelude.unfold fromList "a1b2c3"
-- we define a helper that acts like conduits C.head
anyChar :: MonadCatch m => Parser m Char (Maybe Char)
anyChar = (Just <$> satisfy (const True)) <|> pure Nothing
-- We parse '1' from 'a1', '2' from 'b2' and so on, no matter
-- the order the pairs appear in.
parse' :: MonadIO m => Parser m Char [Int]
parse' = do
mc <- anyChar
case mc of
Just 'a' -> do
mcn <- anyChar
case mcn of
Just '1' -> (1:) <$> parse'
Just cn -> liftIO $ throwIO $ "Unexpected char: " ++ [cn]
Nothing -> pure []
Just ... -- and so on
Nothing -> pure []
This looks exactly like the conduit code, except we replaced head
with anyChar
. Although we could likely reduce it further instead of pattern matching on the chars.
Running a parser is like running a fold. We need an input stream:
parse :: MonadThrow m => Parser m a b -> SerialT m a -> m b
Wrapping up streamly
Running a stream is usually done by applying a Fold
, as we’ve done above. We can also turn a stream into a list directly:
toList :: Monad m => SerialT m a -> m [a]
Or just evaluate the stream and discard the values:
drain :: Monad m => SerialT m a -> m ()
All these functions also exist as Folds, so these are just convenience wrappers.
As can be seen, streamly isn’t based on stream processors like conduit. Instead it composes stream data directly and behaves pretty much like lists. Usually we don’t need special operators. Functor, Monad etc. follow our intuition from lists.
We’ve also seen that there’s an abstract IsStream
class and specific streaming types like SerialT
(for serially processed streams), AsyncT
(for concurrent streams) and so on. These are explained in more detail in the streamly documentation.
Back to yaml
So how does this translate to yaml parsing? Well, the yaml
package uses the libyaml C library for parsing, which is an event driven parser. So we get a stream of events and then turn that into a single JSON value and then let aeson
do its magic.
Finally, for reference, here’s the migration patch: https://github.com/hasufell/streamly-yaml/commit/bfd1da498588af906cbc5d3bb519f1ccdf7ad63e
In fact, it didn’t require a rewrite at all. Simply applying the concepts from above was enough. Figuring out that we need a Parser type etc. took a while (I tried with Fold first). Thanks to the helpful streamly developers for providing guidance. There were some rough edges here and there, since much of the streamly API is still marked as Internal.
Did it actually improve performance?
On my first attempt, I used the wrong inefficient internal ParserD type, which seemed to cause exponential allocations. After fixing that, I was still slower than conduit. Since streamly heavily relies on GHCs inliner, this wasn’t a surprise. It required some effort, but finally the performance was on-par with conduit (tested informally via the yaml2json
executable on a 100mb YAML file).
Streamly also provides some guidance for optimization.
I guess since the actual parsing is done by the C code and the event->json
conversion is really a slow element-by-element monadic parsing transformation, there’s not much space to improve performance anyway.
If you find ideas about how to improve it further, please let me know.
Did this actually reduce dependency footprint?
Well, no. But the point was to only depend on a single streaming framework. I also note that streamly is planning to split up the streamly
package into streamly-core
(only depends on boot packages) and separate out further feature-packages.
Conclusion
- migrating conduit code to streamly is easier than I thought
- performance optimization in streamly requires some time and effort
- you definitely want performance regression tests with streamly to ensure new GHC versions or refactorings don’t cause regressions
What’s next?
Writing a streamly yaml parser in pure Haskell?
Discussion