Skip to content

Commit ceedf81

Browse files
author
Johan Tibell
committed
Add a first version of the developer guide
1 parent be721ad commit ceedf81

File tree

1 file changed

+65
-0
lines changed

1 file changed

+65
-0
lines changed

docs/developer-guide.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Developer Guide
2+
3+
This guide is meant as an entry point for developer. It both gives the
4+
philisophy behind the design of this package and some concrete details, such as
5+
invariants.
6+
7+
## Why does this package exist?
8+
9+
This package exists to offer a different performance/functionality
10+
trade-of vis-a-vis ordered container packages
11+
(e.g. [containers](http://hackage.haskell.org/package/containers)). Hashing-based
12+
data structures tend to be faster than comparison-based ones, at the cost of not
13+
providing operations the rely on the data being ordered.
14+
15+
This means that this package must be faster than ordered containers, or there
16+
would be no reason for it to exist, given that its functionality is a strict
17+
subset of ordered containers. This might seem obvious, but the author has
18+
rejected several proposals in the past (e.g. to switch to higher quality but
19+
slower hash functions) that would have made unordered-containers too slow to
20+
motivate its existance.
21+
22+
## A note on hash functions
23+
24+
While the [hashable](http://hackage.haskell.org/package/containers) package is a
25+
separate package, it was co-designed with this package. Its main role is to
26+
support this package and not to provide good general purpose hash functions
27+
(e.g. to use when fingerprinting a text file).
28+
29+
The hash functions used (by default) were picked to make data structures
30+
fast. The actual functions used oftens surprise developers who have learned
31+
about hashing during their studies but haven't looked at which functions are
32+
actually used in practice.
33+
34+
For example, integers are hashed to themselves. This might seemed contrary to
35+
what you might have learned about hashing (e.g. that you need avalanche
36+
behavior; changing one bit of input changes half of the bits in the output). It
37+
turns out that this isn't what typically is done in practice (take a little tour
38+
of the various programming languages standard libraries to see this for
39+
yourself). Hashing integers to themselves is both faster (i.e. free) and the
40+
improved locality can be helpful given common input patterns.
41+
42+
Another interesting example of hashing is string hashing, where
43+
[FNV](https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function)
44+
is used. FNV is a decent hash function, but hash worse properties than say
45+
[MurmurHash](https://en.wikipedia.org/wiki/MurmurHash). However, it's much
46+
faster. The fact that it's faster is not obvious given the way hash function
47+
performance is often quoted, namely by giving the average throughput on large
48+
inputs. Most inputs (e.g. keys) aren't large, often no more than 10 characters
49+
long. Hash functions typically have a start-up cost and many functions that have
50+
high throughput (such as MurmurHash) are more expensive for short strings than
51+
FNV.
52+
53+
### Security
54+
55+
There's an uncomfortable trade-off with regards to security threats posed by
56+
e.g. denial of service attacks. Always using more secure hash function, like
57+
[SipHash](https://en.wikipedia.org/wiki/SipHash), would provide security by
58+
default. However, those functions would make the performance of the data
59+
structures no better than that of ordered containers, which defeats the purpose
60+
of this package.
61+
62+
The current, someone frustrating, state is that you have to know which data
63+
structures can be tampered with by users and either use SipHash just for those
64+
or switch to ordered containers that don't have collision problems. This package
65+
uses fast hash functions by default.

0 commit comments

Comments
 (0)