Skip to content

Conversation

@parkertimmins
Copy link
Contributor

No description provided.

@parkertimmins
Copy link
Contributor Author

parkertimmins commented Jul 8, 2025

Ran some benchmarks using datasets from paper: https://github.com/cwida/fsst/tree/master/paper/dbtext
Notable difference if that fsst compression times are slower. This is because I have (so far) been unable to make a decent java SIMD implementation of fsst.

One caveat: I didn't include the symbol table in the fsst compressed size. The table is around 500 bytes. But since the smallest dataset if 133k, and many are >MB, this doesn't bias the results too much.

  Comp Factor Comp Factor Comp (ms) Comp (ms) Decomp (ms) Decomp (ms)
(dataset) fsst lz4_fast fsst lz4_fast fsst lz4_fast
c_name 4.07 3.19 4.50 3.95 0.44 1.96
chinese 1.68 1.39 5.35 3.18 0.36 1.66
city 1.95 1.36 2.45 0.58 0.05 0.30
credentials 2.09 1.45 2.37 0.77 0.07 0.34
email 2.05 1.53 8.69 6.73 0.80 4.06
faust 1.82 1.43 3.35 1.67 0.17 0.76
firstname 1.86 1.19 3.46 1.82 0.20 0.98
genome 2.89 1.42 3.50 3.51 0.35 1.28
hamlet 2.22 2.15 2.68 1.00 0.11 0.48
hex 1.85 1.04 3.45 2.26 0.46 1.20
japanese 2.00 1.70 2.62 0.95 0.10 0.42
l_comment 2.75 2.16 10.81 8.69 0.77 4.27
lastname 1.79 1.24 10.06 9.01 1.03 5.17
location 2.69 1.57 7.06 11.66 0.78 6.52
movies 1.59 1.17 9.03 7.39 1.06 4.07
ps_comment 3.27 2.58 8.45 5.56 0.68 3.26
street 2.18 1.66 2.19 0.61 0.05 0.28
urls 2.35 2.74 18.51 14.56 2.10 8.29
urls2 1.97 1.70 7.62 4.44 0.69 2.50
uuid 2.33 1.50 11.61 9.54 1.35 5.38
wiki 1.56 1.25 10.13 7.72 1.15 4.28
wikipedia 1.81 1.43 14.03 13.44 1.28 6.79
yago 1.54 1.20 8.21 6.44 0.97 3.56

@parkertimmins
Copy link
Contributor Author

parkertimmins commented Jul 8, 2025

This original benchmarks from the paper concatenated each file with itself repeated so that the input file is 8MB. I reran the benchmarks with this setup. Also changed the compression and decompression times to throughput so it could be more easily compared with the paper.

Though the lz4 compression is still faster on many datasets, these results look more equal.

  Comp Factor Comp Factor Comp (mb/s) Comp (mb/s) Decomp (mb/s) Decomp (mb/s)
(dataset) fsst lz4_fast fsst lz4_fast fsst lz4_fast
c_name 4.05 3.19 494 447 4457 843
chinese 1.68 1.39 210 224 2111 418
city 1.96 1.37 268 205 2487 406
credentials 2.13 1.47 220 174 2706 419
email 2.02 1.53 218 317 2588 511
faust 1.81 1.44 232 179 2261 385
firstname 1.85 1.19 266 243 2265 403
genome 2.87 1.42 372 271 3190 728
hamlet 2.22 2.16 284 253 2787 522
hex 1.85 1.04 551 353 2043 676
japanese 2.00 1.71 208 207 2481 443
l_comment 2.73 2.16 267 302 3413 581
lastname 1.81 1.24 277 260 2207 442
location 2.67 1.57 446 225 3404 385
movies 1.60 1.17 264 263 1886 478
ps_comment 3.36 2.58 337 405 3723 648
street 2.20 1.68 250 204 2762 453
urls 2.35 2.74 331 417 2865 715
urls2 1.99 1.70 247 347 2451 618
uuid 2.34 1.50 329 369 2609 611
wiki 1.58 1.25 262 287 1922 508
wikipedia 1.81 1.43 217 202 2192 394
yago 1.56 1.20 267 281 1886 495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants