Very skewed distribution of duration on large datasets

Hi, I'm trying to use your code as part of an algorithm on a large dataset using Scala and Apache Spark. I'm having great results in terms of accuracy but I did if on several samples of GPS tracklog data and have a very skewed distribution of duration

| Metric | Min | 25th percentile | Median | 75th percentile | Max |
| --- | --- | --- | --- | --- | --- |
| Duration | 10 s | 1.2 min | 6.2 min | 12 min | 51 min |
| GC Time | 0.2 s | 2 s | 4 s | 4 s | 10 s |
| Input | 25.5 MB | 128.1 MB | 128.1 MB | 128.1 MB | 128.1 MB |
| Output | 18.4 MB | 93.3 MB | 93.6 MB | 93.7 MB | 94.0 MB |

I would like to know it this is an expected behaviour of this algorithm or if you have some tips and tricks to have a more stable results for any dataset (having less variance, maybe at the cost of having an higher average duration.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very skewed distribution of duration on large datasets #11

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Min	25th percentile	Median	75th percentile	Max
Duration	10 s	1.2 min	6.2 min	12 min	51 min
GC Time	0.2 s	2 s	4 s	4 s	10 s
Input	25.5 MB	128.1 MB	128.1 MB	128.1 MB	128.1 MB
Output	18.4 MB	93.3 MB	93.6 MB	93.7 MB	94.0 MB

Very skewed distribution of duration on large datasets #11

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions