2022-06-29-fixing-haskell-filepathsJune 29, 2022,
Fixing ‘FilePath’ in Haskell
I’m pleased to announce that the Haskell type
type FilePath = String has a successor, which was first discussed many years ago as the Abstract FilePath proposal (AFPP).
The new type shipped with the filepath-188.8.131.52 package is:
-- * Path types -- | FilePath for windows. type WindowsPath = WindowsString -- | FilePath for posix systems. type PosixPath = PosixString -- | Abstract filepath, depending on current platform. -- Matching on the wrong constructor is a compile-time error. type OsPath = OsString -- * String types -- Constructors are not public API. newtype WindowsString = WindowsString ShortByteString newtype PosixString = PosixString ShortByteString newtype OsString = OsString #if defined(mingw32_HOST_OS) WindowsString #else PosixString #endif
The reason we have two sets of types here is simply to maintain the current weak distinction in filepath for functions that deal with not-quite-filepaths, e.g.:
splitSearchPath :: String -> [FilePath]. This also allows us to provide slightly different API (e.g. QuasiQuoter for
OsString differs from
OsPath). OsPath is not a newtype, because it doesn’t provide any additional guarantees over OsString. ‘filepath’ remains a low-level library and does not provide strong guarantees for filepaths (such as validity).
Libraries with stronger filepath guarantees are listed in the README.
Unlike the original proposal, this is additional API (not part of
base) and will not break any existing code. Core libraries are expected to upgrade their API and provide additional variants that support this new type. Migration strategies are discussed further down. The ecosystem might need some time to migrate. This is also a call for help!
But let’s look at the reasons why
String is problematic first.
- What’s wrong with String?
- The solution
- How to use the new API
- Migration for library authors
- History of the proposal
- Patch load
- How to help
What’s wrong with String?
Filepaths are resources on the (users) system. We create, delete, copy them. Any corner case with filepaths can have devastating effects: deleting the wrong file, comparing the wrong files, failing whitelists, security bugs, etc.
To recap, the definition of String is:
type String = [Char]
So a String is a list of
Char is encoded as UTF-8, right? Unfortunately not, it’s a Unicode code point.
A unicode code point is an integer in the Unicode codespace. The standard gets a little technical here, but let’s just say UTF-8 is one of many encodings of
That out of the way, let’s look at how filepaths are actually represented on the system level.
On windows, filepaths are just wide character arrays (
wchar_t*, so basically
[Word16]). On unix, filepaths are character arrays (
char, so basically
In both cases, there’s no encoding specified, although on windows we can mostly assume UTF-16LE. So… to go from
CWString at the outer FFI layer, we need to make a decision.
base currently does the following:
- On unix, it uses
mkTextEncodingto pick a round-trippable encoding for filepaths. E.g. if your locale returns
UTF-8//ROUNDTRIPTextEncoding, which is based on PEP 383 and invalid bytes get translated to some special representation (lone surrogates) in order to be roundtripped.
- On windows, it uses a private permissive UTF-16 encoding that allows to roundtrip coding errors as well.
Windows isn’t too problematic here. The encoding is total. However, on unix, the interpretation of filepaths depends on the currently set locale. This is wrong for a number of reasons:
- there’s no guarantee that the currently set locale corresponds to the encoding of a specific filepath (the filepath could be on a USB drive that has a japanese encoding, such as
- as the documentation of mkTextEncoding says, only very specific encodings actually roundtrip properly (
- on conversion to
String, you “lose” the underlying encoding and may end up with weirdly escaped Unicode codepoints. Roundtripping can break if a call to
setFileSystemEncodinginterleaves the conversions.
- it’s hard to get the original bytes back… this may have security implications for e.g. filepath whitelists
So, how do other languages solve this? Python simply enforces
UTF-8 (with PEP 383 escaping) on unix. That makes the roundtripping almost sound. But this comes with its own set of problems:
- if the underlying filepath is not UTF-8, the
[Char]representation is lossless (from
[Char]), but may be somewhat non-sensical for further interpretation, because you might have excessive escaping or your
Chars don’t correspond to what the user sees on their system
- this has really bad interoperability, because the roundtrip encoding can in fact produce invalid UTF-8. The unicode consortium itself has voiced their concerns with this approach
- since Haskell
Charalso includes surrogates, the conversion from
Stringto e.g. UTF-8
CStringcan in fact fail, so is not total
I have assembled a list of correctness issues with these approaches for in-depth reading.
Just stop converting filepaths!
We can just keep the original bytes from the system API. Many filepath operations actually don’t need to know the exact underlying encoding. E.g. the filepath separator
/ on unix is a pre-defined byte (
0x2F). You can just scan the byte array for this byte. The position doesn’t matter, the encoding doesn’t matter. File names cannot include this byte, period.
However, since unix and windows are different (
[Word16]), any API that deals with low-level filepaths in a cross-platform manner needs to understand this and write correct code. More on this in the migration strategy section below.
We decided to use
ShortByteString as the internal representation of filepaths, because:
- these are raw, uninterpreted bytes, a wrapper around
ByteArray#, which has many efficient primops
- it’s unpinned, so doesn’t contribute to memory fragmentation (proof)
- provides convenient API via
bytestring, which has been greatly enhanced as part of this proposal
So, in general the idea is to avoid dealing with
String at all. There may still be use cases for String though, e.g.:
- dealing with legacy APIs
- reading filepaths from a UTF-8 encoded text file (you probably want
Texthere, but it’s trivial to convert to String)
- a unified representation across platforms (e.g. to send over the wire or to serialize)
How to use the new API
Many examples are here: https://github.com/hasufell/filepath-examples
Note that not all libraries have released support for the new API yet, so have a look at this cabal.project if you want to start right away. Generally, you should be able to use these packages already:
- filepath: provides filepath manipulation and the new
- unix: provides new API variants, e.g.
System.Posix.Files.PosixString(as an alternative to
- Win32: similarly, provides new variants, e.g.
- directory: provides the new API under
- file-io: companion package that provides base-like file reading/writing/opening operations
Most end-users developing applications should be able to convert to the new API with little effort, given that their favorite libraries already support this new type.
System.OsPath exports the same API as
System.FilePath with some additional helpers to convert from and to
System.OsPath.Windows are equivalent to
So, you can just:
- update your dependencies lower bounds to the minimum version that supports
OsPath(might need source-repository-package stanzas)
- use the specialised API from your dependencies (e.g. for unix
- to write OsPath literals, use the provided QuasiQuoters. There’s no
IsStringinstance, see the faq.
- if you’re just using an ASCII subset or strict unicode scalar values, you can use
fromJust . encodeUtfand
fromJust . decodeUtfto pack/unpack literals
basedoesn’t support this new type, you’ll need the already mentioned companion library file-io for opening a
Handleand writing/reading files
- if you use legacy APIs that still use
FilePath, there are examples on how to deal with them (usually
A table for encoding/decoding strategies follows:
|API function||from||to||posix encoding||windows encoding||remarks|
||FilePath||OsPath||UTF-8 (strict)||UTF-16 (strict)||not total|
||FilePath||OsPath||user specified||user specified||depends on input|
||FilePath||OsPath||depends on getFileSystemEncoding||UTF-16 (escapes coding errors)||requires IO, used by
||OsPath||FilePath||UTF-8 (strict)||UTF-16 (strict)||not total|
||OsPath||FilePath||user specified||user specified||depends on input|
||OsPath||FilePath||depends on getFileSystemEncoding||UTF-16 (escapes coding errors)||requires IO, used by
These conversions are particularly useful if you’re dealing with legacy API that is still
FilePath based. An example on how to do that with the process package is here.
Migration for library authors
Core libraries or other libraries exporting API that is heavy on filepaths generally have 3 options:
1. drop String based API and just provide OsPath
This is feasible, because users can themselves convert via
System.OsPath.decodeFS to and from
2. provide a shim compatibility API for String
This is what this
directory PR does: https://github.com/haskell/directory/pull/136/files… see
The idea is to write the core against
OsPath and then create a
String based API that wraps the core via
System.OsPath.decodeFS to mimic behavior of
base. This usually requires IO, though.
3. using CPP to export two APIs
This is what filepath itself does. It contains an abstract module, which is then imported while setting specific types and platform information (PosixPath, WindowsPath, System.FilePath.Posix and System.FilePath.Windows).
The main trick here is to not use any String based API (e.g. no pattern matching or use of
:). Instead, we only use
last etc, so the intersection of String and ShortByteString APIs… and then adjust the imports based on the type.
E.g. the following code:
splitSearchPath :: String -> [FilePath] = f splitSearchPath where = case break isSearchPathSeparator xs of f xs -> g pre (pre,  ) :post) -> g pre ++ f post (pre, _ "" = ["." | isPosix] g '\"':x@(_:_)) | isWindows && last x == '\"' = [init x] g (= [x] g x
splitSearchPath :: STRING -> [FILEPATH] = f splitSearchPath where = let (pre, post) = break isSearchPathSeparator xs f xs in case uncons post of Nothing -> g pre Just (_, t) -> g pre ++ f t = case uncons x of g x Nothing -> [singleton _period | isPosix] Just (h, t) | h == _quotedbl Just _) <- uncons t -- >= 2 , ( , isWindowsJust (i, l)) <- unsnoc t , (== _quotedbl -> [i] , l | otherwise -> [x]
The windows include site is something like:
-- word16 based bytestring functions import System.OsPath.Data.ByteString.Short.Word16 -- defining types #define FILEPATH ShortByteString #define WINDOWS -- include the CPP module #include "Internal.hs"
Then we can have a
splitPath :: FILEPATH_NAME -> [FILEPATH_NAME] OSSTRING_NAME bs) = OSSTRING_NAME <$> C.splitPath bssplitPath (
And that is included like so:
import System.OsPath.Types import System.OsString.Windows import qualified System.OsPath.Windows.Internal as C #define FILEPATH_NAME WindowsPath #define WINDOWS #include "PathWrapper.hs"
Not very pretty, but avoids a lot of repetition and doesn’t require a partial wrapper layer that converts between
Accessing the raw bytes in a cross-platform manner
Some libraries might need access to the raw bytes of the filepaths, e.g. because the
filepath API is insufficient. It’s important to understand that on unix, we’re basically dealing with
[Word8] and on windows with
[Word16], where both lists are represented as a compact
E.g. a cross-platform function might look like this:
module MyModule where import System.OsPath.Types import System.OsString.Internal.Types #if defined(mingw32_HOST_OS) -- word 16 based windows API import qualified System.OsPath.Data.ByteString.Short.Word16 SBS as import qualified System.OsPath.Windows as PFP #else -- word 8 based posix API import qualified System.OsPath.Data.ByteString.Short as SBS import qualified System.OsPath.Posix as PFP #endif crossPlatformFunction :: OsPath -> IO () #if defined(mingw32_HOST_OS) OsString pfp@(WindowsString ba)) = do crossPlatformFunction (-- use filepath functions for windows specific -- operating system strings let ext = PFP.takeExtension pfp -- operate directly on the underlying bytestring -- (which is a wide character bytestring, so uses Word16) let foo = SBS.takeWhile ... #else OsString pfp@(PosixString ba)) = do crossPlatformFunction (-- use filepath functions for posix specific -- operating system strings let ext = PFP.takeExtension pfp -- operate directly on the underlying bytestring -- (which is just Word8 bytestring) let foo = SBS.takeWhile ... #endif
History of the proposal
- first wiki proposal: https://gitlab.haskell.org/ghc/ghc/-/wikis/proposal/abstract-file-path
- Revival attempts
- Haskell Foundation thread: https://github.com/haskellfoundation/tech-proposals/issues/35
- Reddit discussion: https://www.reddit.com/r/haskell/comments/vivjdo/abstract_filepath_coming_soon/
- Author, filepath maintainer and proposal champion: Julian Ospald (me)
- Bodigrim providing help and support as CLC chair, giving reviews as bytestring maintainer and providing help with questions about encoding
bytestringmaintainers providing review for the
unixmaintainers providing PR review
- Tamar Christina (
Win32maintainer) providing PR review and further guidance for the
directorymaintainer providing PR review
- Ericson2314 via various dicussions
- Koz Ross helping with encoding questions
- GHC team helping with getting this into 9.6
- HF encouraging me
- reddit community giving loads of opinions on function names ;)
- various people on IRC discussing alternatives like PEP-383/UTF-8b/WTF-8
- filepath: 11126 insertions(+), 3062 deletions(-)
- bytestring: 1795 insertions(+), 145 deletions(-)
- Win32: 2668 insertions(+), 986 deletions(-)
- unix: 8705 insertions(+), 3 deletions(-)
- directory: 2959 insertions(+), 939 deletions(-)
- file-io: 296 insertions(+)
Total: 27549 insertions(+), 5135 deletions(-)
How to help
- create issues for your favorite libraries to support
OsPathlinking to this blog
- create PRs for existing issues:
Why is there no IsString instance (OverloadedStrings)?
IsString has a broken API: https://github.com/haskell/bytestring/issues/140
It can’t express failure. Conversion to
OsPath can fail. Use the provided QuasiQuoters instead.
Why is this not in base?
Nothing is stopping this from eventually getting into base. But the barrier of doing so is much higher. It may happen eventually.
When will ‘FilePath’ be dropped?
Probably never. It would break loads of code. We don’t want to do that, for now.
Yet another String type?
Right… I suggest using python if you don’t like types ;)