Skip to content

Very skewed distribution of duration on large datasets #11

@franc00018

Description

@franc00018

Hi, I'm trying to use your code as part of an algorithm on a large dataset using Scala and Apache Spark. I'm having great results in terms of accuracy but I did if on several samples of GPS tracklog data and have a very skewed distribution of duration

Metric Min 25th percentile Median 75th percentile Max
Duration 10 s 1.2 min 6.2 min 12 min 51 min
GC Time 0.2 s 2 s 4 s 4 s 10 s
Input 25.5 MB 128.1 MB 128.1 MB 128.1 MB 128.1 MB
Output 18.4 MB 93.3 MB 93.6 MB 93.7 MB 94.0 MB

I would like to know it this is an expected behaviour of this algorithm or if you have some tips and tricks to have a more stable results for any dataset (having less variance, maybe at the cost of having an higher average duration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions