Efficient parsing of large text files (2)
This post is part of the "Efficient parsing" serie:
The simplest way to increase the performance of a program is to use more efficient libraries. Well, at least in Haskell, where refactoring is easy and strong common abstractions are shared by most of the ecosystem. In this post, I will make trivial changes to the first implementation and measure the associated gains.
Measuring performance
I used the RTS option -s
to measure performance for this section. Here is the output for the original (parsec
) implementation:
215,523,656,344 bytes allocated in the heap
33,301,747,648 bytes copied during GC
7,227,434,968 bytes maximum residency (15 sample(s))
209,338,048 bytes maximum slop
14542 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 417618 colls, 0 par 24.950s 24.940s 0.0001s 0.0016s
Gen 1 15 colls, 0 par 23.081s 23.115s 1.5410s 10.3409s
INIT time 0.000s ( 0.000s elapsed)
MUT time 43.614s ( 43.598s elapsed)
GC time 48.031s ( 48.055s elapsed)
EXIT time 4.061s ( 4.087s elapsed)
Total time 95.708s ( 95.741s elapsed)
%GC time 50.2% (50.2% elapsed)
Alloc rate 4,941,573,670 bytes per MUT second
Productivity 49.8% of total user, 49.8% of total elapsed
The most important metrics here are the total memory use and total time elapsed. I ran the programs several times to warm the file cache and picked one output.
Speed and memory usage were pretty stable, but these are not very precise benchmarks. They should be more than sufficient for this study though!
A faster parser combinator libraries : attoparsec
Everybody knows parsec
isn't fast, and attoparsec
is much faster.
On the top of my head, the most notable difference between the two are:
Parsec
has actual error messages. Withattoparsec
it is almost always something like "satisfy failed" without the error location. This is the reason I almost always write a new parser withparsec
, and only switch toattoparsec
when bugs have been ironed out. If error messages are paramount for your application, I suggest taking a look atmegaparsec
ortrifecta
.Attoparsec
is much faster thanparsec
, and uses a lot less memory. By the way, if you are in the never-ending intermediate phase of Haskell proficiency, I really suggest that you learn about Church encoding like transformations. They are at work behind the scenes. We will make use of them in a subsequent episode.Attoparsec
is always backtracking. This means you don't need to use thetry
keyword.With
attoparsec
you can grab consecutive characters efficiently. Withparsec
, you will get a stream ofChar
that will need to be repacked, whereasattoparsec
can extract slices of the parsed text.Attoparsec
has a modern interface. Code is generally more idiomatic, mostly thanks to the use of the standardApplicative
operators.
Other alternatives are trifecta
(quite slow but great features and error messages) and megaparsec
(a fork of parsec
with a focus on modernizing its interface and with better error messages). We recently ported language-puppet
to megaparsec
without any hurdle.
As expected with attoparsec
, the source is a bit shorter and tidier. You can find the result here, and the diff here.
There is a questionable choice where a pair of parseInt
have been turned into scientific
. We will see in subsequent episodes that this had a negative performance impact.
On the performance side, this change divided the execution time by 4.88 and the memory usage by 4.61. That is, with my sample output, 19.61s of clock time and 3154MB of memory.
Reducing the memory footprint : thyme
The other library that I knew could make a big difference was the thyme
library. It is rewrite of time
with a focus on performance, alledgedly at the expense of correctness.
The change is very localized, but resulted in a small speed boost, and 1GB of memory saved! The reason thyme
is more memory efficient than time
is that it stores the UTCTime
as a newtype
of Int64
instead of the more involved structure that time
uses.
With this change, the program currently uses 1975MB of memory with the sample input. This sample input only weights about 120MB for 570k entries, so I expected a much higher efficiency at that stage. Of course, unevaluated thunks are crippling the program, but this will be handled in the next episode!
From Text to ByteString
The input is readable text, so the correct data type to represent it is Text
. However, as I do not need to perform any text manipulation and am not too concerned about properly handling non-ASCII characters, it might make sense to move to bytestring
performance-wise.
The change is trivial, and gives a nice speed boost.
In the actual program I wrote the UnixFileGen
was parametrized for the user and group names, file path and link target fields. That way, the parsing and initial analysis would use ByteString
for efficiency, and the files flagged as problematic would be converted to Text
for further processing.
A note about the original C module
The original C module did the same parsing, but stored all data in a Sqlite database for further processing with Ruby. In order to capture the performance difference between my Haskell implementation and the C program, I modified it with two variants:
The in-memory database version (dubbed
mem
): this would do the parsing, and insert all the files in an in-memory database. This is a bit unfair compared to my program, as I don't need to serialize to external storage.The
discard
version: this version would just parse the file and not do anything about it. This is also a bit unfair, as my program stores the actual results in aMap
.
Optimization summary
Program version | Time spent (seconds) | Memory usage (MV) | Gain over first version (CPU / RAM) |
---|---|---|---|
parsec | 95.741 | 14542 | n/a |
attoparsec | 19.609 | 3154 | 4.88x / 4.61x |
thyme | 18.302 | 1975 | 5.23x / 7.36x |
bytestring | 13.890 | 1959 | 6.89x / 7.42x |
original C (mem) | 12 | ||
original C (discard) | 2.7 |
As we can see, with a sensible choice of libraries we approach the execution speed of the in-memory original C program, at the expense of trivial changes.
In the next episode, I will introduce actual optimisations to the program that will only slightly affect its readability and maintainability.
Can we beat the discard
version of the C program? Can we optimize much more without sacrificing readability and maintainability? Without rummaging in the core? All of these questions will be answered in the final episode!