Efficient parsing of large text files (1)

Efficiently parsing large files is one of those exercises that is not obvious for beginner haskellers, especially if ones goal is to keep the source code concise and readable.

In this serie, I will describe an actual use case where I had to greatly optimize my initial implementation. This is a litterate Haskell file, so you can copy and paste to experiment.

The problem is analyzing the output of the following command:

$ find / -fstype nfs -prune -o -path '/proc' -prune -o -path '/sys' -prune -o -printf '%i %n %A++%AZ %T++%TZ %C++%CZ %u %g %k %y %#m %s %p -> %l\n'
2 28 2015-11-19+17:49:59.7719267730+CET 2015-11-10+08:38:41.9129851680+CET 2015-11-10+08:38:41.9129851680+CET root root 4 d 0755 4096 / -> 
15 1 2015-11-19+18:35:11.4559739950+CET 2015-11-10+08:38:41.8849851690+CET 2015-11-10+08:38:41.8849851690+CET root root 0 l 0777 33 /initrd.img -> boot/initrd.img-3.13.0-68-generic
1025 16 2015-11-19+16:38:15.0775743760+CET 2015-11-19+16:38:14.8055743710+CET 2015-11-19+16:38:14.8055743710+CET root root 0 d 0755 4300 /dev -> 
14749 1 2015-11-19+16:38:21.0655744800+CET 2015-11-19+16:38:21.0055744790+CET 2015-11-19+16:38:21.0055744790+CET root disk 0 b 0660 0 /dev/dm-0 -> 
9479 3 2015-11-19+16:38:01.7495741430+CET 2015-11-19+16:38:01.7535741430+CET 2015-11-19+16:38:01.7535741430+CET root vboxusers 0 d 0750 60 /dev/vboxusb -> 
10839 1 2015-11-19+16:38:01.4615741380+CET 2015-11-19+16:38:01.4615741380+CET 2015-11-19+16:39:44.2358533610+CET root root 0 c 0660 0 /dev/kvm -> 
8957 1 2015-11-19+17:35:12.8439113270+CET 2015-11-19+16:38:01.4415741380+CET 2015-11-19+16:38:01.4415741380+CET root root 0 l 0777 3 /dev/dvdrw -> sr0
8954 1 2015-11-19+17:35:12.8439113270+CET 2015-11-19+16:38:01.4415741380+CET 2015-11-19+16:38:01.4415741380+CET root root 0 l 0777 3 /dev/dvd -> sr0
8950 1 2015-11-19+17:35:12.8439113270+CET 2015-11-19+16:38:01.4415741380+CET 2015-11-19+16:38:01.4415741380+CET root root 0 l 0777 3 /dev/cdrw -> sr0
...

This will output the list of all files on the current system if run as root. This list must then be stored in a key-value map where the key is the file name for further analyzis.

I knew this task would be performance sensitive, as I implemented it when porting a Ruby program to Haskell. The Ruby program used a C module to perform the parsing and storage (in a sqlite database), while the analyzis was performed in native Ruby. It ended up a time and memory intensive task.

In this first post, I will write a very direct implementation of the parser and expose the data structures that will be used throughout the serie. In subsequent posts, I will introduce basic optimization strategies and benchmark them.

The import fest

> {-# LANGUAGE GeneralizedNewtypeDeriving #-}
> {-# LANGUAGE OverloadedStrings #-}
> module Main (main) where
> 
> import Data.Bits
> import Data.Text (Text)
> import qualified Data.Text as T
> import qualified Data.Text.IO as T
> import Data.Time
> import System.Environment
> import qualified Data.Map.Strict as M
> import Control.Applicative
> import Data.Char (isSpace, digitToInt)
> import Data.Functor.Identity
> 
> import Text.Parsec.Text
> import Text.Parsec.Char
> import qualified Text.Parsec.Token as TOK
> import Text.Parsec hiding (many, (<|>), optional)

As can be seen here, the venerable parsec library is used. We recently moved language-puppet to megaparsec for its much nicer error messages and imports (notice how I had to hide functions here), but I am used to parsec

I would recommend writing the inital version of performance sensitive parsers for text-based data with parsec, or even better megaparsec, as they give much better error messages than, say, attoparsec. This can prove invaluable when debugging the first versions, and it is usually straightforward to “upgrade” to attoparsec.

The main data structure

This is the record type that represents the meta-data that is available for a file. Note that the records are presented in the exact same order as in the parsed file, which helps write nicer code in applicative style.

I chose Text to represent user and group names, and let the file pathes as FilePath, which is just an alias to String. The reason is that most functions in base that deal with file pathes use this type (yes, this sucks). Time is represented with the time library. File permissions are stored in an Int, and bit operations are used to query them, instead of a more idiomatic data structure that could look like:

data Permissions = Permissions { _special :: Special
                               , _user    :: Perms
                               , _group   :: Perms
                               , _others  :: Perms
                               }

data AtomicPerm = R | W | X
                deriving (Eq, Ord)

newtype Perms = Perms (Set AtomicPerm)

The reason is that memory usage must be kept low, and manipulating the permissions with lenses is just as convenient as with a more explicit data structure (because you can write a Prism between the two representations).

> data UnixFile = UnixFileGen { _fileInode     :: Int
>                             , _fileHardLinks :: Int
>                             , _fileAtime     :: UTCTime
>                             , _fileMtime     :: UTCTime
>                             , _fileCtime     :: UTCTime
>                             , _fileUser      :: Text
>                             , _fileGroup     :: Text
>                             , _fileBlocks    :: Int
>                             , _fileType      :: FileType
>                             , _filePerms     :: FPerms
>                             , _fileSize      :: Int
>                             , _filePath      :: FilePath
>                             , _fileTarget    :: Maybe FilePath
>                             } deriving (Show, Eq)
> 
> data FileType = TFile
>               | TDirectory
>               | TLink
>               | TPipe
>               | TSocket
>               | TBlock
>               | TChar
>               | TDoor
>               deriving (Eq, Show)
> 
> newtype FPerms = FPerms Int
>                deriving (Show, Ord, Eq, Num, Bits)

The token declaration

With parsec, when you want to use built-in parsers such as integer, you have to generate a GenTokenParser structure by using a language definition.

This is quite cumbersome in our use case, as we will only be parsing very simple types. Also the pre-defined language definitions only work for String parsers, whereas I would like to use a Text parser! The following code is just copy-pasta from the default language definition.

> tok :: TOK.GenTokenParser Text () Identity
> tok = TOK.makeTokenParser TOK.LanguageDef
>         { TOK.commentStart   = ""
>         , TOK.commentEnd     = ""
>         , TOK.commentLine    = ""
>         , TOK.nestedComments = True
>         , TOK.identStart     = letter <|> char '_'
>         , TOK.identLetter    = alphaNum <|> oneOf "_'"
>         , TOK.opStart        = alphaNum
>         , TOK.opLetter       = oneOf ":!#$%&*+./<=>?@\\^|-~"
>         , TOK.reservedOpNames= []
>         , TOK.reservedNames  = []
>         , TOK.caseSensitive  = True
>         }

It is now possible to write an arbitrary numeric parser this way:

> parseInt :: Num a => Parser a
> parseInt = fromIntegral <$> TOK.integer tok

I had to write my own octal parser as the parsec one expected those literals to start with 0o.

> myOctal :: Parser Int
> myOctal = foldl (\acc n -> acc * 8 + digitToInt n) 0 <$> some digit

Parsing the file type

> char2ft :: Char -> Maybe FileType
> char2ft x = case x of
>                 '-' -> Just TFile
>                 'f' -> Just TFile
>                 'd' -> Just TDirectory
>                 'l' -> Just TLink
>                 'p' -> Just TPipe
>                 's' -> Just TSocket
>                 'b' -> Just TBlock
>                 'c' -> Just TChar
>                 'D' -> Just TDoor
>                 _ -> Nothing
> 
> filetype :: Parser FileType
> filetype = anyChar >>= maybe (fail "invalid file type") return . char2ft

Pretty straightfoward. The interested reader might notice the door file type, which isn’t that common.

Parsing the time stamp

I don’t really know how to manipulate timezones, so you can see I did something terrible in the timestamp function. I would be interested in a correction!

> timestamp :: Parser UTCTime
> timestamp = do
>     y <- parseInt <* char '-'
>     m <- parseInt <* char '-'
>     d <- parseInt <* char '+'
>     h <- parseInt <* char ':'
>     mi <- parseInt <* char ':'
>     s <- realToFrac <$> TOK.float tok <* char '+'
>     let day = fromGregorian y m d
>         difftime = h * 3600 + mi * 60 + s
>         tm = UTCTime day difftime
>     tz <- some upper <* spaces
>     return $ case tz of
>                  "CEST" -> addUTCTime (-7200) tm
>                  "CET" -> addUTCTime (-3600) tm
>                  _ -> tm

Parsing the actual file description

There is a trick here to parse the final part, as it could look as the following:

/foo -> 
/foo -> bar
/foo bar -> 
/foo bar -> baz
/foo bar -> baz qux

The end of the line is split using the words function and broken around the -> token to separate the file path from the link target. This is definitely not correct, as file names containing multiple spaces, tabs, or the -> sequence will be incorrectly decoded. This will however suffice for this discussion.

> findline :: Parser UnixFile
> findline = do
>     let t :: Parser a -> Parser a
>         t parser = parser <* spaces
>     meta <- UnixFileGen <$> parseInt
>                         <*> parseInt
>                         <*> timestamp
>                         <*> timestamp
>                         <*> timestamp
>                         <*> t (T.pack <$> some (satisfy (not . isSpace)))
>                         <*> t (T.pack <$> some (satisfy (not . isSpace)))
>                         <*> parseInt
>                         <*> t filetype
>                         <*> (FPerms <$> myOctal)
>                         <*> parseInt
>     rst <- words <$> t (some (satisfy ( /= '\n' )))
>     return $ case break (== "->") rst of
>                  (a, []) -> meta (unwords a) Nothing
>                  (a, ["->"]) -> meta (unwords a) Nothing
>                  (a, b) -> meta (unwords a) (Just (unwords b))

Annnnnnnnnd the actual parsing

The file is loaded at once using the readFile function, resulting in a strict Text. This means that the processing will only start once the file has been completely read.

> parseFile :: FilePath -> IO [UnixFile]
> parseFile fp = either (error . show) id . parse (many findline <* eof) fp <$> T.readFile fp

The resulting list of files is then stored in a strict map (where the key is the file path), and its size computed to make sure all entries have been inserted in the map.

> main :: IO ()
> main = do
>     parsed <- getArgs >>= mapM parseFile
>     let resultmap = M.fromList $ map (\f -> (_filePath f, f)) $  concat parsed
>     print $ M.size resultmap

Performance goals

The file list that needs to be parsed can be quite large (usually around 100mb, but I had gigabyte specimens), and must be processed on a typical laptop. I would like the end program to take at most 1GB of memory for a 100MB source file, and the parsing and map creation to be as fast as possible, and under 5 seconds.

On my workstation, and with my test input of 570k entries, this program consumes 14.5GB of memory and takes around 95 seconds to finish.

In the following episodes, I will expose common optimization techniques and the impact they will have on program performance. The final program will consume 400MB of memory and run in less than 3 seconds, while remaining concise.