Skip to content

split is inefficient; can be sped up over 2x for single character delimiter #39

@joeyh

Description

@joeyh

Data.List.Utils.split if you profile it, does a lot of allocation.

With a 600 mb "data" file containing short words, I profiled this program:

import Data.List.Utils
main = do
        s <- readFile "data"
        let l = split " " s
        print (length l)

Tue Jan 31 19:49 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =      117.11 secs   (117106 ticks @ 1000 us, 1 processor)
    total alloc = 139,419,374,656 bytes  (excludes profiling overheads)

COST CENTRE MODULE           %time %alloc

spanList    Data.List.Utils   46.2   59.3
startswith  Data.List.Utils   28.4   14.1
MAIN        MAIN              14.7   17.8
split       Data.List.Utils    8.3    6.4
breakList   Data.List.Utils    2.5    2.4

Compare with this program using a splitc specialized to a single character delimiter:

main = do
    s <- readFile "data"
    let l = splitc ' ' s
    print (length l)

splitc :: Char -> String -> [String]
splitc _ [] = []
splitc c s = case break (== c) s of
    (i, d:rest) -> i : splitc c rest
    (i, []) -> i : []

    Tue Jan 31 19:54 2017 Time and Allocation Profiling Report  (Final)

       foo +RTS -p -RTS

    total time  =       54.44 secs   (54435 ticks @ 1000 us, 1 processor)
    total alloc = 99,505,766,960 bytes  (excludes profiling overheads)

So over twice as fast!

A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions