Transpiling a large PureScript codebase into Haskell, part 1: The pipeline

May 19, 2021
Written by Artyom Kazak @ Monadfix

Intro

As described in the introduction to the series, we wrote a PureScript to Haskell transpiler. It is used by Juspay, an Indian fintech company, to migrate a large backend to Haskell.

In this post, I will describe the overall structure of the transpiler.

Our pipeline for PureScript files is:

parse .purs files into PureScript AST;
build a module dependency graph;
convert everything into our own AST (a union of Haskell and PureScript ASTs);
do name resolution;
apply AST transformations;
pretty-print the resulting Haskell source.

Why did I say "for PureScript files"? Because PureScript has FFI bindings to JavaScript, and we have to handle JavaScript as well. We will look into this in one of the next posts.

The pipeline: parsing

The PureScript compiler is written in Haskell — see the purescript package on Hackage. So, we can reuse the official parser in Language.PureScript.Parser.Declarations.

This is the easiest bit in the whole series. LEX. RUN TOKEN PARSER. Boom, done.

parse str = 
  PS.lex name str >>= 
  PS.runTokenParser name' (PS.parseModule <* Parsec.eof)

We also have our own AST that has to handle both Haskell and PureScript code. Our transformation rules will, step by step, get rid of the PureScript-only constructs. Here is a tiny part of the AST:

data Expr
  -- | Wildcards in operator sections like (1+_) (PS-only)
  = ExprSection
  -- | Variables
  | ExprIdent (QualifiedName TermU)
  -- | Constructors
  | ExprConstructor (QualifiedName TermU)
  -- Overloaded #labels (Haskell-only)
  | ExprLabel Text
  -- ... 29 branches in total, and there are other types as well

Next we have 500 lines of code converting the official AST into our AST. Lots and lots of case statements. Credit where credit's due: the heartbreakingly monumental transition to our own AST was done by Shaurya Gupta, who was an intern at Juspay back then.

One interesting bit here is the comments. We want to preserve as many comments as possible — the resulting codebase has to remain maintainable. However, we don't want to litter the AST with comment annotations in every node — this would complicate the transformation rules. We have opted to preserve comments for top-level declarations and do statements only. For instance, here is how the do statement comments are handled:

data DoStatement
  = DoLet [LetBinding]
  | DoDiscard Expr
  | DoBind Binder Expr
  | DoComment Comment     -- 🌟 NEW 🌟

fromPsDoElement :: PS.DoNotationElement -> [DoStatement]
fromPsDoElement = \case
  PS.DoNotationValue expr -> 
    [DoDiscard $ fromPsExpr expr]
  PS.DoNotationBind binder expr ->
    [DoBind (fromPsBinder binder) (fromPsExpr expr)]
  PS.DoNotationLet decls -> 
    [DoLet $ map fromPsLetBinding decls]
  -- 🌟 Annotations are out, standalone comment nodes are in 🌟
  PS.PositionedDoNotationElement _ comments element ->
    map (DoComment . fromPsComment) comments ++ fromPsDoElement element

Alright. Let's move on to the next bit.

The pipeline: building the dependency graph

Since some of the transformation rules rely on name resolution (we'll talk about this later), we need to transpile the modules in their transitive order.

algebraic-graphs does the trick:

buildDependencyGraph :: Nau.Config -> Nau.DirPath -> IO DependencyGraph
buildDependencyGraph config dir = do
  fmap Graph.overlays $ traversePureScriptModules config dir \mod -> do
    foldFor (mod ^. #functions) \func -> do
      let qualifiedName = QualifiedName (mod ^. #name) (func ^. #name)
      foldFor (func ^. #body) \equation -> do
        let dependencies = getEquationDependencies mod equation
        Streamly.yield $! Graph.star qualifiedName dependencies
  where
    foldFor xs handler = Graph.overlays <$> for xs handler

Honestly, I'm only posting it here because it is so pleasantly rectangular. Like a building block. Probably the most rectangular piece of code in the whole codebase. I'm happy about it.

(What is Streamly? It is the streaming library of choice at Juspay. I've only had limited exposure to it, but it looks good.)

As a bonus, the same code powers some of the diagnostic tools we provide for the transpiler users — e.g. "what are the dependencies of this function?".

The pipeline: name resolution

We looked into reusing the official PureScript name resolver and it turned out to be too cumbersome, so we wrote our own instead. Specifically, Kana did. In a weekend. If he ever leaves Monadfix, I expect that ten Haskell consultancies will fight for him. They should.

Name resolution happens in a Reader:

newtype Resolver a = Resolver
  { unResolver :: 
      WriterT (Set ResolverWarning) (Reader (Scope, Environment)) a
  } deriving newtype (Functor, Applicative, Monad)

local makes the whole thing a breeze. For example, when doing resolution on a lambda expression, we need to express that we want the outer scope names to be shadowed by the locally introduced bindings, and it looks super natural with local:

instance Resolve Expr where
  resolve expr = case expr of
    ...
    ExprLambda binders expr -> do
      binders' <- traverse resolve binders
      expr' <- foldr withBinder (resolve expr) binders'
      pure (ExprLambda binders' expr')
     
-- | Introduce names from a binder into the scope 
withBinder :: Binder -> Resolver a -> Resolver a
withBinder binder next = case binder of
  ...
  BinderVar name -> 
    withTermName 
      (QualifiedName Nothing Nothing name, ResolvedName "#LOCAL#" name)
      name
      next

-- This is where the actual insertion happens
withTermName 
  :: (QualifiedName TermU, ResolvedName TermU)
  -> Resolver a -> Resolver a
withTermName (qname, resolvedName) (Resolver next) = Resolver $
  local (_1 . #termNames %~ HashMap.insert qname resolvedName) next

However, this is not enough. We also have a few Haskell files written by hand — e.g. containing FFI functions we decided to rewrite manually rather than transpile. Those Haskell files bring extra types and functions into scope, and they have to be taken into account when doing name resolution.

So, we just ended up writing a whole separate module for parsing those Haskell files and extracting metadata from them. Thank God for being able to say import GHC.

(This Haskell-parsing code was mostly written by Donya. Fun fact: when he joined Monadfix, the test task was to reimplement the PostgreSQL binary protocol. I can't express how glad I am that I can just tell somebody "yeah let's use GHC API here" and they go ahead and figure it out on their own. This is a special kind of feeling.)

The pipeline: AST transformations

I said we will talk about AST transformations. We have about 90 of them, and many are project-specific (which is why open-sourcing the transpiler is a questionable idea, but if you want it, please shout as loudly as you can and maybe it will move the needle).

The AST transformations deserve a whole post of their own, and additionally I deserve not to stay in the coffee shop for the whole day. So, we will have to skip the AST transformations for now.

The pipeline: rendering Haskell code

The renderer is about 500 lines of prettyprinter code. AST goes in, plain text goes out. Again, a lot of case statements.

The only thing we care about in the renderer is indenting the code a bit where necessary, and then pass it through Ormolu to make it prettier. Initially we tried Brittany to get intelligent line-wrapping for free, but hit a few nondeterministic bugs and gave up.

Here is a tiny sample of how the rendering code looks:

instance Pretty Expr where
  pretty = \case
    ...
    ExprIf cond trueBranch falseBranch ->
      vsep
        [ "if" <+> pretty cond,
          indent 2 ("then" <+> pretty trueBranch),
          indent 2 ("else" <+> pretty falseBranch)
        ]

Fun was had

To give you a taste of the authentic transpiler developer experience, I will also list a few small issues we had to work around. This excludes horrible things like "PureScript allows arbitrary field names" or "PureScript has row types". Just the small things off the top of my head.

We can't use -- line comments in do statements, because if it is the last line of the block and the block is in parentheses, things will break. Luckily there are {- -} comments available.
We can't allow spaces in record update clauses, because we emit code that uses record-dot-preprocessor and it only kicks in for record updates that looks like x{...}, without a space after x. Easy to do, but Ormolu used to break those record updates, so we had to fix Ormolu as well.
Did you know that GHC can sometimes successfully typecheck a function without a type signature, but fail if you have :: _ as the signature? I didn't. Now I do. One more corner case to handle.
PureScript and Haskell have slightly different scoping rules — if function foo is defined locally and also defined in an open import, PureScript will choose the local one while Haskell will complain about the ambiguity. There are like five cases of this in the whole codebase, but we have to detect it anyway and add qualification.
Can't allow -- | Doc comments inside where clauses, or else GHC will complain. Have to detect them and convert into normal comments.
Names like pattern and default are keywords in Haskell, but allowed in PureScript. Alright, let's rename them. One more AST transformation.
If you define foo = bar, GHC will not necessarily infer the same type for foo as it does for bar. Things will break. This makes handling PureScript's operator aliases slightly more complicated than it should have been.
Also in type inference news: flip f will sometimes fail to typecheck, while \x y -> f y x will be fine. Doesn't happen in PureScript.

Many of such things only happen a few times in the codebase. It was tempting to just fix them in the original PureScript codebase, and for a while we'd been doing it — but we had to abandon this approach and start implementing the fixes in the transpiler itself. As I mentioned in the introduction, the codebase is a moving target, and every extra change creates complications. For me, this project really hammered down the idea that automatic things are better than manual things.

Next?

Next we will look at the specific AST transformations needed to bridge the gap between PureScript and Haskell. Maybe specifically the row types. We will see.

P. S.

In a galaxy far far away, before the rest of the Monadfix team joined the project, the transpiler was being written by myself and three interns from the Juspay side. We were not using the official PureScript parser. We were doing brutal things. It was super fun. I will never talk about it in writing, but I wanted to give some credit here anyway.

Follow @monadfix on Twitter to get notifications about future posts from this series. You can also leave a comment on Reddit.

All work described in this post has been done for Juspay. In their own words:

Juspay is a leading fintech company in India with over 5Bn txn processed, 150Mn SDK installs & $27Mn in funding.
And we’re hiring. Please check out the open positions here: https://juspay.in/careers.

If you are still reading, you might be one of the people who reads everything. I like you! I also read everything. So here's a fun completely unrelated fact for you: I was at Functional Conf 2020 and got told about a 17-line GPU-parallel APL compiler written in APL. Its author, Aaron Hsu, claims that the techniques he used are pretty universal, but can not be ported to Haskell because the type system gets in the way. He also says that every Haskeller he met tries to prove him wrong and gets nerdsniped. His PhD thesis, describing the compiler, is 200 pages long. You probably don't know APL. Enjoy!