Transpiling a large PureScript codebase into Haskell, part 2: Records are trouble

July 28, 2021
Written by Artyom Kazak @ Monadfix

Intro

As described in the introduction to the series, we wrote a PureScript to Haskell transpiler. It is used by Juspay, an Indian fintech company, to migrate a large backend to Haskell.

In this post we will look at row types and anonymous records. PureScript has them and Haskell doesn't, so we have to do something.

Anonymous records and row types in PureScript

If you don't know PureScript, this section is for you.

Here is an anonymous record:

type Response = { foo :: Int, bar :: Int }

Anonymous records are just like any other values. They can be passed into functions, pattern-matched, etc.

In fact, PureScript doesn't have Haskell-like records at all. They just wrap anonymous records into newtypes:

newtype Response = 
  Response                      -- Constructor
    { foo :: Int, bar :: Int }  -- Anonymous record contained in the newtype

Since our goal after transpilation is idiomatic Haskell code, we want to turn these into normal data declarations.

Now, row types are an important implementation detail of records in PureScript. Specifically, PureScript has (a) built-in syntax, (b) a special kind, and (c) special typing rules for unordered collections of name/type pairs. These lists are called rows, and they are used for more than just records — see this Reddit comment for details.

Rows look similar to records, but using parentheses ( ) instead of braces { }:

type ResponseRow = ( foo :: Int, bar :: Int )
  -- the kind of ResponseRow is 'Row Type'

Records are tagged with rows specifying the content of the record. { foo :: Int, bar :: Int } is syntax sugar for Record ( foo :: Int, bar :: Int ), where Record is a built-in:

data Record :: Row Type -> Type

Rows can be extended:

type ResponseRow = ( foo :: Int, bar :: Int )
type ResponseRow2 = ( qux :: Int | ResponseRow )

Rows (and therefore records) can be partial, and the PureScript compiler has built-in rules for type inference regarding rows. For instance, here we have a function that accepts any record with a field foo :: Int:

changeFoo :: forall r. { foo :: Int | r } -> { foo :: Int | r }
changeFoo x = x {foo = x.foo + 1}

In this last example, r has the kind Row Type — not a record, but a row. If you want to define two record types that overlap heavily, this is how you would do it in PureScript:

type UserCommon r =
  { name :: String
  , workspace :: WorkspaceId
  , subscriptionPlan :: Plan
  , ...
  | r
  }

-- Note that we are passing rows, not records!
type User = UserCommon ( id :: UserId )       -- a row with one element
type UserWithoutId = UserCommon ()            -- () is an empty row

Alright! Let's see how to transpile all of this.

One thing to keep in mind is that we want to output normal records whenever possible. The PureScript idiom of newtype User = User {...} should be transpiled into data User = User {...}.

However, sometimes we will have to resort to anonymous records. For example, if we have a data X = X {...} {...}, containing two anonymous records, we won't be able to convert this definition into a Haskell-style record. Same if a function takes an anonymous records as a parameter.

Anonymous records

There are many ways to handle anonymous records — e.g. we could generate a new data type for each encountered anonymous record. For a while we actually did that.

However, there is an easier solution: use type-level lists to define a type for actual anonymous records. There are many libraries for doing that, e.g. vinyl. We have settled on superrecord, and then had to modify it a bit, resulting in a new library: jrec.

I have written a post about the implementation of jrec — see Forget about lenses, let's all implement array-backed anonymous records for fun. Here's how the end result looks on the user side:

-- PureScript
type User = { name :: String, workspace :: WorkspaceId, ... }

-- Haskell
type User = Rec '["name" := String, "workspace" := WorkspaceId, ...]

Uniform record access

Our transpiler does not do type inference, so we can't distinguish between different record kinds at use sites. We want some way of accessing record fields that will always work.

GHC 9.2 is slated to bring us the -XRecordDotSyntax extension, which is exactly what we need:

it enables PureScript-like record access with a dot (x.foo);
it lets us overload record access for arbitrary types by implementing a HasField instance.

Unfortunately, GHC 9.2 is not out yet — but Neil Mitchell wrote record-dot-preprocessor, which is a GHC plugin that lets us use the same syntax with today's GHC. A few fixes were required (PR #34, PR #41) to get it to work for our usecase.

record-dot-preprocessor generates HasField instances for normal data declarations, and we also have a HasField instance for Rec. This means that when our transpiler emits something like user.name, it works regardless of whether user is an anonymous record or a data declaration.

Partial rows

PureScript records aren't ordered — unlike Haskell records and unlike type-level lists we use for anonymous records.

This means that when a function accepts an {a :: X, b :: Y} and we pass a {b :: Y, a :: X} into it, PureScript will do just fine, while Haskell will complain about mismatched types — assuming that we're talking about anonymous records here.

Most of the time, it turns out we can ignore this problem entirely — the source codebase has very few cases where such order mismatches happen. We can just patch them directly in the source.

However, this becomes a big problem for partial rows. If a function accepts something like {a :: X | r}, meaning "any record that has field a :: X", it's quite likely that the field won't be the first field of the record. We can't just convert that into Rec ("a" := X ': r). So, we work around this by transpiling all partial rows in type signatures into HasField constraints:

-- PureScript
f :: forall r. {a :: X, b :: Y | r} -> {c :: X, d :: Y | r}

-- Haskell
foo :: forall r. (HasField "a" r X, HasField "b" r Y,
                  HasField "c" r X, HasField "d" r Y) 
    => r1 -> r2

Detecting constructor applications

I mentioned that we want to avoid anonymous records whenever we can — data declarations are more idiomatic in Haskell.

This means that we have to detect expressions like Foo $ {...}, recognize that we have record construction going on, check whether Foo is a data constructor or a newtype constructor, and remove / keep the $ accordingly. This seemingly small task is one of the main reasons why we had to implement a name resolver in the transpiler.

I will talk in more detail about the name resolver later.

Mangling field names

PureScript allows arbitrary field names. Even something like "123!!" can be a field name:

type User = { name :: String, "type" :: UserType, "123!!" :: Foo, ... }

Haskell does not support it, so we have to mangle and unmangle record field names. Specifically, we encode all forbidden chars in field names as their ASCII codes: 123!! becomes ____123_33__33_ (33 being the code for '!'). It's crude, but it works.

Many types in the codebase have JSON instances that are derived from the field names, so we also wrote a custom JSON encoder/decoder that recognizes such field names and de-mangles them.

JSON encoding/decoding

Speaking of JSON encoding/decoding — we could not piggyback on Aeson because the default PureScript encoder/decoder is subtly different. We had to reimplement it in Haskell, opting for generics-sop instead of GHC.Generics.

The generics-sop library is somewhat complicated to use, but the resulting code is fairy concise — and once you've learned how to use generics-sop, you have a pretty cool minor superpower in your repertoire.

`large-records`

Right after we implemented all this, it turned out that GHC exhibits quadratic compilation times for records.

You might already know that Generic instances have a quadratic compilation time, but folks from Well-Typed have discovered that even typechecking the generated record accessors (one per field) is already quadratic. AFAIK, the problem is due to pattern-matching that is

generated internally for each field accessor,
looks like Foo _ _ _ _ x _ _ _ _,
and has a type annotation for each field.

Yikes.

Well-Typed wrote a library, large-records, that sidesteps the problem entirely by letting us wrap record declarations into a bit of Template Haskell that parses the declaration and substitutes it with an array-based record. So, we are in the process of switching to that.

Next?

For the next post in the series, we will either briefly go through the death-by-a-thousand-papercuts AST transformations we had to implement, or take a look at the name resolver.

Follow @monadfix on Twitter to get notifications about future posts from this series. You can also leave a comment on Reddit.

All work described in this post has been done for Juspay. In their own words:

Juspay is a leading fintech company in India with over 5Bn txn processed, 150Mn SDK installs & $27Mn in funding.
And we’re hiring. Please check out the open positions here: https://juspay.in/careers.