Efficient parsing of large text files (3)

2015-11-29

This post is part of the "Efficient parsing" serie:

Estimating the expected memory usage

The UnixFile type is made of a single constructor containing five Ints, one sum type, three UTCTime, three ByteString and a Maybe ByteString. Given this source, we can compute the size of a "typical" entry, where the user and group names are less than 8 characters long, there is no link and the file name is something like has an average length of 57 bytes:

5 Int = 5 * 8 = 40 bytes
3 UTCTime = 3 Int64 = 24 bytes
2 "small" ByteString = 2 * (9 * 8 + 8) = 160 bytes
1 "large" ByteString = 9 * 8 + 64 = 136 bytes
1 simple sum type = 8 bytes
1 Nothing = 8 bytes
1 tag = 8 bytes

Total: 384 bytes, because all the fields are strict. Note that I didn't check that the sizes for my sum type, Nothing and tag are accurate.

As there are around 570k entries, we can expect a baseline of 213MB of live data, so a total memory usage of 427MB to account for GC copying (if I understand this correctly).

When we left our study, the program was consuming about 2GB of memory, so that's something like five times the expected memory usage!

How to fix the memory leak?

The Haskell runtime has built in options to trace the memory usage of a program, and more importantly the source of the memory allocations. It usually is a good idea to start running in profiling mode, at least with -p to find the hot spots.

I did that, and realized that a lot of time was spent in the timestamp function, along with a large part of the memory allocation.

This function can be divided in several parts : parsing and representing the day, parsing and representing the time, parsing the timezone and adjusting the UTC time. I decided to extract the two first parts in order to have finer grained performance metrics. In the process, I made them stricter by adding bangs. I really wish I could act all wise and knowledgeable in the ways of GHC and its core language, but what I often do when I hit a performance problem is to generously pepper the code with bangs. This is cargo-cult optimization, but it tends to work quite well.

In this particular case, this gave a 37% reduction in execution time, and, more importantly, reduced the memory usage to 406MB, which is very close to the previous guestimation.

This means that the memory leak was successfully discovered and plugged. In retrospect, it was very obvious that this function was part of the problem as this is the only place where calculations are being performed, so this is the only place where thunks could be piling up.

Anyway, now that I am happy with the memory usage, it is time to gun for the execution speed.

Using the simplest alternative

Now that the function was separated in three parts, I decided to run another profiling campaign. Note that profiling isn't accurate in the presence of optimizations or CPS/Church encodings! This example exhibits both sources of profiling inaccuracy: aggressive compilation options (-O2) and the attoparsec implementation of the Parser. Nevertheless, here was the result:

parseDTime     Main       57.0   46.4
findline.t     Main       22.0   33.3
parseYMD       Main        9.3   11.6
main.resultmap Main        4.2    3.5
timestamp      Main        2.5    1.7
myOctal        Main        2.1    2.0
findline       Main        1.4    0.1

What was surprising was the parseDTime was much more expensive than parseYMD. It should be obvious that this was caused by the fact that parseYMD only deals with decimal numbers, whereas I wastefuly used the scientific function to parse the hour and minutes fields. Using decimal on them gave a good (13%) speed increase for little effort.

State of the code and performance

Here is a recap of the performance so far:

Program version	Time spent (seconds)	Memory usage (MB)	Gain over first version (CPU / RAM)
parsec	95.741	14542	n/a
attoparsec	19.609	3154	4.88x / 4.61x
thyme	18.302	1975	5.23x / 7.36x
bytestring	13.890	1959	6.89x / 7.42x
BangPatterns	8.705	406	11x / 35.82x
parseDTime	7.7	406	12.43x / 35.82x
original C (mem)	12
original C (discard)	2.7

At this point of our optimization journey, the code is still safe, concise and easy to understand for a haskeller that knows about applicative functors, and much faster than the original C code!

Can we do better? Let's try!

Writing a custom parser

Writing a custom parser is a simple yet rewarding kata-like exercise. It is even featured in the NICTA course.

I do not suggest doing it for production code though, but I will do it here anyway, because I thought this would be fun (and it will allow for even more fun in the next episode) and really quick.

In this use case, there are only very few features that are required of the parser. No backtracking or error messages are required, and only a few combinators are used. We can start with this:

newtype Parser a = Parser { getParser :: BS.ByteString
                                      -> Maybe (a, BS.ByteString)
                          } deriving Functor

Deriving the required instance is trivial, and just a few primitives, all of them one-liners where useful to me:

This much simpler parser resulted in a nice speed increase (it runs 5.6s, which places the Haskell program as twice as slow as the "discarding" C program!). It turned out later that the Alternative instance wasn't even useful.

The performance gain is nice, but it comes at a cost in code readability: a quarter of the code is now made of stuff really belongs in a library.

Not parsing everything

This technique is one I really would like to emphasize here, as it is really simple, gives good speed boosts and can bring other benefits.

First of all, it could be observed that this file format is line-based. Splitting them and parsing each of them individually can be beneficial in conjunction with parsers that give poor error messages. In this particular case, knowing there is a parse error somewhere in the source file is next to useless, but knowing which line is affected can be sufficient to diagnose the problem.

Another benefit is that instead of running the parsing function on a single large piece of data, it is run on many smaller pieces. This makes parallelization trivial, which will be described in the next episode.

Finally, in this case it is also possible to split each field in the line very easily, which will give a nice speed boost.

The resulting implementation isn't difficult to parse or understand. The main takeaway would be the use of the pattern matching on the result of the split function on line 103.

The code is not as safe as before, as the getInt function doesn't check that it parses actual digits anymore. This check could be added back, but at this point I felt that the performance of the C "discard" code was within reach, and I started being more aggressive in the optimization techniques.

Inlining all the things

A final optimization step was to add INLINE pragmas to the instances of the custom parser and the simple combinators. This resulted in a dramatic speed increase:

Program version	Time spent (seconds)	Memory usage (MB)	Gain over first version (CPU / RAM)
parsec	95.741	14542	n/a
attoparsec	19.609	3154	4.88x / 4.61x
thyme	18.302	1975	5.23x / 7.36x
bytestring	13.890	1959	6.89x / 7.42x
BangPatterns	8.705	406	11x / 35.82x
parseDTime	7.7	406	12.43x / 35.82x
Custom parser	5.628	382	17.01x / 38.07x
Break	4.754	396	20.14x / 36.72x
Inlining	2.871	396	33.34x / 36.72x
original C (mem)	12
original C (discard)	2.7

As a reminder, the "discard" C implementation does parse the file and converts each fields in a manageable format, but doesn't do anything with them and just discards the results. The Haskell programs store all the files in a Map for further processing.

In the next episode, I will perform more accurate benchmarks and introduce a pair of advanced techniques that will let the program run in less than 2 seconds on my sample input, namely Church-encoding the custom parser and using simple parallelism primitives.