split is inefficient; can be sped up over 2x for single character  delimiter

Data.List.Utils.split if you profile it, does a lot of allocation.

With a 600 mb "data" file containing short words, I profiled this program:

    import Data.List.Utils
    main = do
            s <- readFile "data"
            let l = split " " s
            print (length l)

    Tue Jan 31 19:49 2017 Time and Allocation Profiling Report  (Final)

           foo +RTS -p -RTS

        total time  =      117.11 secs   (117106 ticks @ 1000 us, 1 processor)
        total alloc = 139,419,374,656 bytes  (excludes profiling overheads)

    COST CENTRE MODULE           %time %alloc

    spanList    Data.List.Utils   46.2   59.3
    startswith  Data.List.Utils   28.4   14.1
    MAIN        MAIN              14.7   17.8
    split       Data.List.Utils    8.3    6.4
    breakList   Data.List.Utils    2.5    2.4

Compare with this program using a splitc specialized to a single character delimiter:

    main = do
        s <- readFile "data"
        let l = splitc ' ' s
        print (length l)

    splitc :: Char -> String -> [String]
    splitc _ [] = []
    splitc c s = case break (== c) s of
        (i, d:rest) -> i : splitc c rest
        (i, []) -> i : []

        Tue Jan 31 19:54 2017 Time and Allocation Profiling Report  (Final)

           foo +RTS -p -RTS

        total time  =       54.44 secs   (54435 ticks @ 1000 us, 1 processor)
        total alloc = 99,505,766,960 bytes  (excludes profiling overheads)

So over twice as fast!

A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

split is inefficient; can be sped up over 2x for single character delimiter #39

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

split is inefficient; can be sped up over 2x for single character delimiter #39

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions