Efficient parsing of large text files (1)
This post is part of the "Efficient parsing" serie:
Efficiently parsing large files is one of those exercises that is not obvious for beginner haskellers, especially if ones goal is to keep the source code concise and readable.
In this serie, I will describe an actual use case where I had to greatly optimize my initial implementation. This is a litterate Haskell file, so you can copy and paste to experiment.
The problem is analyzing the output of the following command:
$ find / -fstype nfs -prune -o -path '/proc' -prune -o -path '/sys' -prune -o -printf '%i %n %A++%AZ %T++%TZ %C++%CZ %u %g %k %y %#m %s %p -> %l\n'
2 28 2015-11-19+17:49:59.7719267730+CET 2015-11-10+08:38:41.9129851680+CET 2015-11-10+08:38:41.9129851680+CET root root 4 d 0755 4096 / ->
15 1 2015-11-19+18:35:11.4559739950+CET 2015-11-10+08:38:41.8849851690+CET 2015-11-10+08:38:41.8849851690+CET root root 0 l 0777 33 /initrd.img -> boot/initrd.img-3.13.0-68-generic
1025 16 2015-11-19+16:38:15.0775743760+CET 2015-11-19+16:38:14.8055743710+CET 2015-11-19+16:38:14.8055743710+CET root root 0 d 0755 4300 /dev ->
14749 1 2015-11-19+16:38:21.0655744800+CET 2015-11-19+16:38:21.0055744790+CET 2015-11-19+16:38:21.0055744790+CET root disk 0 b 0660 0 /dev/dm-0 ->
9479 3 2015-11-19+16:38:01.7495741430+CET 2015-11-19+16:38:01.7535741430+CET 2015-11-19+16:38:01.7535741430+CET root vboxusers 0 d 0750 60 /dev/vboxusb ->
10839 1 2015-11-19+16:38:01.4615741380+CET 2015-11-19+16:38:01.4615741380+CET 2015-11-19+16:39:44.2358533610+CET root root 0 c 0660 0 /dev/kvm ->
8957 1 2015-11-19+17:35:12.8439113270+CET 2015-11-19+16:38:01.4415741380+CET 2015-11-19+16:38:01.4415741380+CET root root 0 l 0777 3 /dev/dvdrw -> sr0
8954 1 2015-11-19+17:35:12.8439113270+CET 2015-11-19+16:38:01.4415741380+CET 2015-11-19+16:38:01.4415741380+CET root root 0 l 0777 3 /dev/dvd -> sr0
8950 1 2015-11-19+17:35:12.8439113270+CET 2015-11-19+16:38:01.4415741380+CET 2015-11-19+16:38:01.4415741380+CET root root 0 l 0777 3 /dev/cdrw -> sr0
...
This will output the list of all files on the current system if run as root
. This list must then be stored in a key-value map where the key is the file name for further analyzis.
I knew this task would be performance sensitive, as I implemented it when porting a Ruby program to Haskell. The Ruby program used a C module to perform the parsing and storage (in a sqlite database), while the analyzis was performed in native Ruby. It ended up a time and memory intensive task.
In this first post, I will write a very direct implementation of the parser and expose the data structures that will be used throughout the serie. In subsequent posts, I will introduce basic optimization strategies and benchmark them.
The import fest
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
import Data.Bits
import Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
import Data.Time
import System.Environment
import qualified Data.Map.Strict as M
import Control.Applicative
import Data.Char (isSpace, digitToInt)
import Data.Functor.Identity
import Text.Parsec.Text
import Text.Parsec.Char
import qualified Text.Parsec.Token as TOK
import Text.Parsec hiding (many, (<|>), optional)
As can be seen here, the venerable parsec
library is used. We recently moved language-puppet
to megaparsec
for its much nicer error messages and imports (notice how I had to hide functions here), but I am used to parsec
...
I would recommend writing the inital version of performance sensitive parsers for text-based data with parsec
, or even better megaparsec
, as they give much better error messages than, say, attoparsec
. This can prove invaluable when debugging the first versions, and it is usually straightforward to "upgrade" to attoparsec
.
The main data structure
This is the record type that represents the meta-data that is available for a file. Note that the records are presented in the exact same order as in the parsed file, which helps write nicer code in applicative style.
I chose Text
to represent user and group names, and let the file pathes as FilePath
, which is just an alias to String
. The reason is that most functions in base that deal with file pathes use this type (yes, this sucks). Time is represented with the time
library. File permissions are stored in an Int
, and bit operations are used to query them, instead of a more idiomatic data structure that could look like:
data Permissions = Permissions { _special :: Special
_user :: Perms
, _group :: Perms
, _others :: Perms
,
}
data AtomicPerm = R | W | X
deriving (Eq, Ord)
newtype Perms = Perms (Set AtomicPerm)
The reason is that memory usage must be kept low, and manipulating the permissions with lenses is just as convenient as with a more explicit data structure (because you can write a Prism
between the two representations).
data UnixFile = UnixFileGen { _fileInode :: Int
_fileHardLinks :: Int
, _fileAtime :: UTCTime
, _fileMtime :: UTCTime
, _fileCtime :: UTCTime
, _fileUser :: Text
, _fileGroup :: Text
, _fileBlocks :: Int
, _fileType :: FileType
, _filePerms :: FPerms
, _fileSize :: Int
, _filePath :: FilePath
, _fileTarget :: Maybe FilePath
,deriving (Show, Eq)
}
data FileType = TFile
| TDirectory
| TLink
| TPipe
| TSocket
| TBlock
| TChar
| TDoor
deriving (Eq, Show)
newtype FPerms = FPerms Int
deriving (Show, Ord, Eq, Num, Bits)
The token declaration
With parsec
, when you want to use built-in parsers such as integer
, you have to generate a GenTokenParser
structure by using a language definition.
This is quite cumbersome in our use case, as we will only be parsing very simple types. Also the pre-defined language definitions only work for String
parsers, whereas I would like to use a Text
parser! The following code is just copy-pasta from the default language definition.
tok :: TOK.GenTokenParser Text () Identity
= TOK.makeTokenParser TOK.LanguageDef
tok = ""
{ TOK.commentStart = ""
, TOK.commentEnd = ""
, TOK.commentLine = True
, TOK.nestedComments = letter <|> char '_'
, TOK.identStart = alphaNum <|> oneOf "_'"
, TOK.identLetter = alphaNum
, TOK.opStart = oneOf ":!#$%&*+./<=>?@\\^|-~"
, TOK.opLetter = []
, TOK.reservedOpNames= []
, TOK.reservedNames = True
, TOK.caseSensitive }
It is now possible to write an arbitrary numeric parser this way:
parseInt :: Num a => Parser a
= fromIntegral <$> TOK.integer tok parseInt
I had to write my own octal parser as the parsec
one expected those literals to start with 0o
.
myOctal :: Parser Int
= foldl (\acc n -> acc * 8 + digitToInt n) 0 <$> some digit myOctal
Parsing the file type
char2ft :: Char -> Maybe FileType
= case x of
char2ft x '-' -> Just TFile
'f' -> Just TFile
'd' -> Just TDirectory
'l' -> Just TLink
'p' -> Just TPipe
's' -> Just TSocket
'b' -> Just TBlock
'c' -> Just TChar
'D' -> Just TDoor
-> Nothing
_
filetype :: Parser FileType
= anyChar >>= maybe (fail "invalid file type") return . char2ft filetype
Pretty straightfoward. The interested reader might notice the door
file type, which isn't that common.
Parsing the time stamp
I don't really know how to manipulate timezones, so you can see I did something terrible in the timestamp
function. I would be interested in a correction!
timestamp :: Parser UTCTime
= do
timestamp <- parseInt <* char '-'
y <- parseInt <* char '-'
m <- parseInt <* char '+'
d <- parseInt <* char ':'
h <- parseInt <* char ':'
mi <- realToFrac <$> TOK.float tok <* char '+'
s let day = fromGregorian y m d
= h * 3600 + mi * 60 + s
difftime = UTCTime day difftime
tm <- some upper <* spaces
tz return $ case tz of
"CEST" -> addUTCTime (-7200) tm
"CET" -> addUTCTime (-3600) tm
-> tm _
Parsing the actual file description
There is a trick here to parse the final part, as it could look as the following:
/foo ->
/foo -> bar
/foo bar ->
/foo bar -> baz
/foo bar -> baz qux
The end of the line is split using the words
function and broken around the ->
token to separate the file path from the link target. This is definitely not correct, as file names containing multiple spaces, tabs, or the ->
sequence will be incorrectly decoded. This will however suffice for this discussion.
findline :: Parser UnixFile
= do
findline let t :: Parser a -> Parser a
= parser <* spaces
t parser <- UnixFileGen <$> parseInt
meta <*> parseInt
<*> timestamp
<*> timestamp
<*> timestamp
<*> t (T.pack <$> some (satisfy (not . isSpace)))
<*> t (T.pack <$> some (satisfy (not . isSpace)))
<*> parseInt
<*> t filetype
<*> (FPerms <$> myOctal)
<*> parseInt
<- words <$> t (some (satisfy ( /= '\n' )))
rst return $ case break (== "->") rst of
-> meta (unwords a) Nothing
(a, []) "->"]) -> meta (unwords a) Nothing
(a, [-> meta (unwords a) (Just (unwords b)) (a, b)
Annnnnnnnnd the actual parsing
The file is loaded at once using the readFile
function, resulting in a strict Text
. This means that the processing will only start once the file has been completely read.
parseFile :: FilePath -> IO [UnixFile]
= either (error . show) id . parse (many findline <* eof) fp <$> T.readFile fp parseFile fp
The resulting list of files is then stored in a strict map (where the key is the file path), and its size computed to make sure all entries have been inserted in the map.
main :: IO ()
= do
main <- getArgs >>= mapM parseFile
parsed let resultmap = M.fromList $ map (\f -> (_filePath f, f)) $ concat parsed
print $ M.size resultmap
Performance goals
The file list that needs to be parsed can be quite large (usually around 100mb, but I had gigabyte specimens), and must be processed on a typical laptop. I would like the end program to take at most 1GB of memory for a 100MB source file, and the parsing and map creation to be as fast as possible, and under 5 seconds.
On my workstation, and with my test input of 570k entries, this program consumes 14.5GB of memory and takes around 95 seconds to finish.
In the following episodes, I will expose common optimization techniques and the impact they will have on program performance. The final program will consume 400MB of memory and run in less than 3 seconds, while remaining concise.