Efficient parsing of large text files (2)

The simplest way to increase the performance of a program is to use more efficient libraries. Well, at least in Haskell, where refactoring is easy and strong common abstractions are shared by most of the ecosystem. In this post, I will make trivial changes to the first implementation and measure the associated gains.

Measuring performance

I used the RTS option -s to measure performance for this section. Here is the output for the original (parsec) implementation:

 215,523,656,344 bytes allocated in the heap
  33,301,747,648 bytes copied during GC
   7,227,434,968 bytes maximum residency (15 sample(s))
     209,338,048 bytes maximum slop
           14542 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     417618 colls,     0 par   24.950s  24.940s     0.0001s    0.0016s
  Gen  1        15 colls,     0 par   23.081s  23.115s     1.5410s    10.3409s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   43.614s  ( 43.598s elapsed)
  GC      time   48.031s  ( 48.055s elapsed)
  EXIT    time    4.061s  (  4.087s elapsed)
  Total   time   95.708s  ( 95.741s elapsed)

  %GC     time      50.2%  (50.2% elapsed)

  Alloc rate    4,941,573,670 bytes per MUT second

  Productivity  49.8% of total user, 49.8% of total elapsed

The most important metrics here are the total memory use and total time elapsed. I ran the programs several times to warm the file cache and picked one output.

Speed and memory usage were pretty stable, but these are not very precise benchmarks. They should be more than sufficient for this study though!

A faster parser combinator libraries : attoparsec

Everybody knows parsec isn’t fast, and attoparsec is much faster.

On the top of my head, the most notable difference between the two are:

  • Parsec has actual error messages. With attoparsec it is almost always something like “satisfy failed” without the error location. This is the reason I almost always write a new parser with parsec, and only switch to attoparsec when bugs have been ironed out. If error messages are paramount for your application, I suggest taking a look at megaparsec or trifecta.

  • Attoparsec is much faster than parsec, and uses a lot less memory. By the way, if you are in the never-ending intermediate phase of Haskell proficiency, I really suggest that you learn about Church encoding like transformations. They are at work behind the scenes. We will make use of them in a subsequent episode.

  • Attoparsec is always backtracking. This means you don’t need to use the try keyword.

  • With attoparsec you can grab consecutive characters efficiently. With parsec, you will get a stream of Char that will need to be repacked, whereas attoparsec can extract slices of the parsed text.

  • Attoparsec has a modern interface. Code is generally more idiomatic, mostly thanks to the use of the standard Applicative operators.

Other alternatives are trifecta (quite slow but great features and error messages) and megaparsec (a fork of parsec with a focus on modernizing its interface and with better error messages). We recently ported language-puppet to megaparsec without any hurdle.

As expected with attoparsec, the source is a bit shorter and tidier. You can find the result here, and the diff here.

There is a questionable choice where a pair of parseInt have been turned into scientific. We will see in subsequent episodes that this had a negative performance impact.

On the performance side, this change divided the execution time by 4.88 and the memory usage by 4.61. That is, with my sample output, 19.61s of clock time and 3154MB of memory.

Reducing the memory footprint : thyme

The other library that I knew could make a big difference was the thyme library. It is rewrite of time with a focus on performance, alledgedly at the expense of correctness.

The change is very localized, but resulted in a small speed boost, and 1GB of memory saved! The reason thyme is more memory efficient than time is that it stores the UTCTime as a newtype of Int64 instead of the more involved structure that time uses.

With this change, the program currently uses 1975MB of memory with the sample input. This sample input only weights about 120MB for 570k entries, so I expected a much higher efficiency at that stage. Of course, unevaluated thunks are crippling the program, but this will be handled in the next episode!

From Text to ByteString

The input is readable text, so the correct data type to represent it is Text. However, as I do not need to perform any text manipulation and am not too concerned about properly handling non-ASCII characters, it might make sense to move to bytestring performance-wise.

The change is trivial, and gives a nice speed boost.

In the actual program I wrote the UnixFileGen was parametrized for the user and group names, file path and link target fields. That way, the parsing and initial analysis would use ByteString for efficiency, and the files flagged as problematic would be converted to Text for further processing.

A note about the original C module

The original C module did the same parsing, but stored all data in a Sqlite database for further processing with Ruby. In order to capture the performance difference between my Haskell implementation and the C program, I modified it with two variants:

  • The in-memory database version (dubbed mem): this would do the parsing, and insert all the files in an in-memory database. This is a bit unfair compared to my program, as I don’t need to serialize to external storage.

  • The discard version: this version would just parse the file and not do anything about it. This is also a bit unfair, as my program stores the actual results in a Map.

Optimization summary

Program version Time spent
Memory usage
Gain over first version
parsec 95.741 14542 n/a
attoparsec 19.609 3154 4.88x / 4.61x
thyme 18.302 1975 5.23x / 7.36x
bytestring 13.890 1959 6.89x / 7.42x
original C (mem) 12
original C (discard) 2.7

As we can see, with a sensible choice of libraries we approach the execution speed of the in-memory original C program, at the expense of trivial changes.

In the next episode, I will introduce actual optimisations to the program that will only slightly affect its readability and maintainability.

Can we beat the discard version of the C program? Can we optimize much more without sacrificing readability and maintainability? Without rummaging in the core? All of these questions will be answered in the final episode!