Efficient parsing of large text files (4)

2015-12-14

This post is part of the "Efficient parsing" serie:

More careful measurements

The golden standard for benchmarking in Haskell is the criterion library. For this series I did not bother with writing smaller test cases that would play well with it, and my measurement have not be rigorous. For this post, I have run each sample ten times and computed an average. This is still not ideal, but should suffice to give a feeling of the performance.

However, I don't have the measurements for the last iteration of my program. I waited till I had time to collect these, but this never happened for various reasons. Gauging the performance gains of the last version is left as an exercise to the reader!

It turned out that the state of the program at the end of the previous episode was much better than I expected! It performed its task in 2.712 seconds, which is just as fast as the discarding C program. Now let's beat it!

Playing with compiler options

Finding the precise set of compiler options that will produce a fast program is more art than science. People are using genetic algorithms to find the best set of compiler flags, but I am only going to focus on a single flag here : -fllvm. It can do wonder for tight loops and numerical code, but it also can slow down the compilation process and the resulting program.

I have found it best to just test it on performance critical code, and keep it if there was a significant gain.

In this case, the gain is about 3%, which would not be sufficient for me to keep in usual circumstances. But this series is about performance, so I will keep this nice "free" speedup.

Alternative encodings of the monad

An optimization technique that is fun and sometimes very efficient is to move the critical parts of the computation to a continuation passing like implementation. In our case, that would be the Parser monad.

I am used to call this process Church encoding, but it seems from the responses from Reddit that it might not be correct. I have been pointed to a paper describing how to systematically encode all the stuff in what is called Boehm-Berarducci encoding. Despite my claims I did not read it (yet), so there might be inaccuracies in my terminology.

As an example, here is how Maybe and Either are usually encoded:

newtype MaybeChurch a = MaybeChurch
      { runMaybe :: forall x. x -> (a -> x) -> x }
newtype EitherChurch l r = EitherChurch
      { runEither :: forall x. (l -> x) -> (r -> x) -> x }

Those types look a lot like the following functions:

maybe :: b -> (a -> b) -> Maybe a -> b
either :: (a -> c) -> (b -> c) -> Either a b -> c

And it is not surprising, for simple ADT like those you "just" have to write their fold to get their Church encodings (by the way, it is a fun exercise to newtype them and try to write the usual instances).

But our Parser type is a bit more complicated, as it really is a function:

newtype Parser a = Parser
      { runParser :: ByteString -> Maybe (a, ByteString) }

I don't know how to mechanically advance here, so I am going to try to explain my intuition. The original parser takes a ByteString and stops the computation with an empty result (Nothing), or a result along with the remaining input.

Once transformed, it will not directly return anything, but will pass the result of the current computation to a pair of continuation functions, one for each case:

newtype Parser a = Parser
      { runParser :: forall r. ByteString -> nothing -> just -> r }

Now, both the nothing and just part should be functions that must return r. The nothing part handles errors. The just part keeps on parsing, based on the result of the current function. It must be fed with the remaining input and said result:

newtype Parser a = Parser
      { runParser :: forall r. ByteString -> (error -> r) -> (ByteString -> a -> r) -> r }

But as a final optimisation, we might see that we don't care about fancy error messages for the "failure" case, so the final form will be:

newtype Parser a = Parser
      { runParser :: forall r. ByteString -> r -> (ByteString -> a -> r) -> r }
      deriving Functor

For a parser with verbose error messages, a suitable type could be:

newtype Parser a = Parser
      { runParser :: forall r. ByteString
                  -> (String -> r) -- ^ Failing case with a String error message
                  -> (ByteString -> a -> r)
                  -> r }
      deriving Functor

Now, most of the combinators are trivial to write:

takeWhile1 :: (Char -> Bool) -> Parser ByteString
takeWhile1 prd = Parser $ \s failure success -> case BS8.span prd s of
                                                    ("", _) -> failure
                                                    (a,b) -> success b a

char :: Char -> Parser ()
char c = Parser $ \input failure success ->
      if BS.null input
          then failure
          else if BS8.head input == c
                   then success (BS.tail input) ()
                   else failure

And running the parser is easy too:

parseOnly :: Parser a -> ByteString -> Maybe a
parseOnly (Parser p) s = p s Nothing $ \b a -> if BS.null b
                                                   then Just a
                                                   else Nothing

Now, one of the fun properties of Haskell is that deriving the Applicative and Monad instances can actually be done as an exercise in type tetris, where you have no idea of what the code you write actually means, but where it works fine once it typechecks:

instance Applicative Parser where
    pure a = Parser $ \b _ s -> s b a
    Parser pf <*> Parser px = Parser $ \input failure success ->
        let succ' input' f = px input' failure (\i a -> success i (f a))
        in  pf input failure succ'

While this gives no speed gain with the usual compiler options, the performance under llvm is slightly better (2.56s vs 2.63s). This technique can do wonders in many use cases, but is clearly not that great here.

Oh well, this was fun anyway :)

Parallel execution

In the previous episode I moved from a parsing function that would handle the whole input to a parsing function that would only handle a single line. This function being pure, it should be very easy to parallelize its execution.

In practice, I had to change this part of the program:

mapM findline

into:

traverse id . withStrategy (parListChunk 2000 rseq) . map findline

Note that traverse id is sequence, but for some reason I didn't go with it ... This strategy stuff is very well covered in the Marlow book which I recommend even if you don't believe you really care about these topics. It is a really good book!

Basically this will cut the list in chunks of 2000 items and spark a thread for each of them. Each of these threads can be processed on an individual processor.

In practice, the gain is highly dependent on the sequential parts of a program and the complexity of the parallel chunks of work. The latter part means that sometimes it is more work to handle the complexities of the parallel runtime than to do the work in the first place. That means that algorithms that look like they should parallelize well really don't, and that you should benchmark your programs with many values of -N (which is the runtime option that governs how many cores are going to be used).

For this program, I found out that a value of -N6 gave the best results. The program runs in 1.95s with this value.

There isn't much speedup, and it is probably related to the fact that the program does the following sequential operations:

File loading
Parsing
Map creation
Printing the map size

I tried going with lazy IO to increase the productivity of the program, but this worsened the performance. I did not try a streaming library because I suspect the increased complexity will not be worth it.

Performance summary

Here is a breakdown of all that has been discussed in this series. It should be quite obvious that the alternative encoding of the parser could have been left out. It would also be interesting to get a metric of how fast would the final version be while keeping attoparsec. Introducing a custom parser did wonders for the performance of the program, but it might be technical debt if more complex features were to be required.

In the end, the Haskell program could be made faster than a truncated C program that wouldn't do as much work. You will have to take my word for it, as I will not release the C program, but it wasn't a bad program I selected for an easy win.

Program version	Time spent (seconds)	Memory usage (MB)	Gain over first version (CPU / RAM)
parsec	95.741	14542	n/a
attoparsec	19.609	3154	4.88x / 4.61x
thyme	18.302	1975	5.23x / 7.36x
bytestring	13.890	1959	6.89x / 7.42x
BangPatterns	8.705	406	11x / 35.82x
parseDTime	7.7	406	12.43x / 35.82x
Custom parser	5.628	382	17.01x / 38.07x
Break	4.754	396	20.14x / 36.72x
Inlining	2.712	396	35.3x / 36.72x
With LLVM	2.63		36.4x
Church	2.71		35.3x
Church LLVM	2.561		37.4x
Parallel -N2	2.51		38.1x
Parallel -N4	2.13		44.9x
Parallel -N6	1.95		49.1x
original C (mem)	12
original C (discard)	2.7

Conclusion

This series was meant as a way to showcase some simple and effective optimization techniques, such as

Using efficient libraries. For parsing I suggested using attoparsec, but I would suggest using something like the parsers package, as it has a great interface and is really a front-end to several other parsers (attoparsec, parsec and trifecta). That way you can have a "slow" version with good error messages and a "fast" one powered by attoparsec (or your own parser!) using the same code base.
Using BangPatterns, even if a bit randomly, while roughly benchmarking.
Profiling to find the hot spots. This one is obvious, but it is not the first thing I usually do when optimizing. Also be warned that the cost centers might accounted as you might expect when using aggressive optimizations or continuation passing style.
Replacing some well used library with custom, more specialized, code. This is something that should be done only as last resort in production code, as this will reduce its maintainability (compared to a well known library with many useful features).
Experiment with the compiler flags. This can give some "free" speedups, but can also negatively impact compilation times.
Embarrassingly parallel tasks are embarrassingly easy to parallelize in Haskell.

One technique that didn't work was the alternative encoding of the custom parser. This is a technique I use a lot, but I suspect mainly because I like the soothing feeling of successful type-Tetris.