Skip to content

Commit 7751ef6

Browse files
committed
Update docs
1 parent 1abae3f commit 7751ef6

File tree

3 files changed

+73
-31
lines changed

3 files changed

+73
-31
lines changed

Docs/DataStructures.md

Lines changed: 47 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,73 @@
11
# Data structures
22

33
By default, FastData chooses the optimal data structure for your data, but you can also set it manually with
4-
`FastData -s <type>`. See the details of each structure type below.
4+
`FastData -s <type>`.
55

6-
## Auto
7-
This is the default option. It automatically selects the best data structure based on the number of items you provide.
6+
Below you'll see details of each structure.
7+
8+
- **Memory:** The memory overhead of the data structure. _Native_ means it does not have any overhead.
9+
- **Complexity:** The data structures complexity expressed in [Big O notation](https://en.wikipedia.org/wiki/Big_O_notation).
810

911
## Array
1012

11-
* Memory: Low
12-
* Latency: Low
13+
### Overview
14+
15+
* Memory: Native
1316
* Complexity: O(n)
1417

15-
This data structure uses an array as the backing store. It is often faster than a normal array due to efficient early
16-
exits (value/length range checks).
17-
It works well for small amounts of data since the array is scanned linearly, but for larger datasets, the O(n)
18-
complexity hurts performance a lot.
18+
This data structure uses an array as the backing store. It is faster than a normal array due to early exits (value/length range checks).
19+
It works well for small amounts of data, but for larger datasets, the linear scans on data hurts performance.
1920

2021
## BinarySearch
2122

22-
* Memory: Low
23-
* Latency: Medium
23+
* Memory: Native
2424
* Complexity: O(log n)
2525

2626
This data structure sorts your data and does a binary search on it. Since data is sorted at compile time, there is no
27-
overhead at runtime. Each lookup
28-
has a higher latency than a simple array, but once the dataset gets to a few hundred items, it beats the array due to a
29-
lower complexity.
27+
overhead at runtime. It is good for medium sized datasets when memory usage is a concern.
3028

3129
## Conditional
3230

33-
* Memory: Low
34-
* Latency: Low
31+
* Memory: Native
3532
* Complexity: O(n)
3633

37-
This data structure relies on built-in logic in the programming language. It produces if/switch statements which
38-
ultimately become machine instructions on the CPU, rather than data
39-
that resides in memory.
40-
Latency is therefore incredibly low, but the higher number of instructions bloat the assembly, and at a certain point it
41-
becomes more efficient to have
42-
the data reside in memory.
34+
This data structure relies on built-in logic in the programming language. It produces logic statements which
35+
ultimately become machine instructions. It is faster than an array for small amounts of data, but as soon as there are more than 400-500 keys, it starts declining.
4336

44-
## HashSet
37+
## HashTable
4538

46-
* Memory: Medium
47-
* Latency: Medium
39+
* Memory: up to 16 bytes pr. key
4840
* Complexity: O(1)
4941

5042
This data structure is based on a hash table with separate chaining collision resolution. It uses a separate array for
5143
buckets to stay cache coherent, but it also uses more
52-
memory since it needs to keep track of indices.
44+
memory since it needs to keep track of indices.
45+
46+
### Special cases
47+
48+
#### Small hash table type optimization
49+
Usually hash table implementations needs to store some infrastructure to perform its job correctly. However, it adds
50+
quite a lot of memory overhead. FastData detects when the hash table is small enough for using smaller types, thereby saving some memory.
51+
52+
#### Keys are floating point numbers, but with no special values
53+
Floating point numbers have the concept of Not a Number (NaN) as well as multiple binary representations of zero.
54+
Because of this, a good float hash function will fold the many representations into a single representation. This ensures correctness.
55+
56+
However, the check adds overhead. When FastData does not see Zero or NaN in the dataset, it uses a faster hash function.
57+
58+
#### Keys are identity hashed
59+
When the input is integer based, FastData uses an identity hash function (a hash of the key is the key itself).
60+
Because of that, we don't need to store both the key and the hash of the key. It saves 8 bytes pr. key.
61+
62+
#### No collision on keys
63+
If the keys have no collisions among them, a special data structure called PerfectHashTable is produced.
64+
It is like a normal HashTable, but without any logic for collision resolution, thereby making it faster and saving up to 4 bytes pr. key.
65+
66+
## Comparison
67+
We can compare each data structure in a graph and see which one is the fastest for a given number of keys.
68+
The Y-axis is the number of queries per second (QPS). The X-axis is the number of keys.
69+
70+
![StructuresGraph.png](StructuresGraph.png)
71+
72+
We can see that Conditional starts out by being the fastest, but at ~500 keys, it declines sharply.
73+
HashTable has stable lookup performance no matter how many keys it contains, but for less than 10 items, it is not a good choice.

Docs/StructuresGraph.png

Lines changed: 3 additions & 0 deletions
Loading

README.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,15 +24,17 @@ It supports many output languages (C#, C++, Rust, etc.), ready for inclusion in
2424
### Using the executable
2525

2626
1. [Download](https://github.com/Genbox/FastData/releases/latest) the executable
27-
2. Run `FastData rust dogs.txt`
27+
2. Run `FastData <lang> dogs.txt`
28+
29+
`<lang>` can be one of _rust_, _cpp_, or _csharp_.
2830

2931
### Using the .NET CLI tool
3032
1. Install the [Genbox.FastData.Cli tool](https://www.nuget.org/packages/Genbox.FastData.Cli/): `dotnet tool install --global Genbox.FastData.Cli`
31-
2. Run `FastData cpp dogs.txt`
33+
2. Run `FastData <lang> dogs.txt`
3234

3335
### Using the PowerShell module
3436
1. Install the [PowerShell module](https://www.powershellgallery.com/packages/Genbox.FastData/): `Install-Module -Name Genbox.FastData`
35-
2. Run `Invoke-FastData -Language CSharp -InputFile dogs.txt`
37+
2. Run `Invoke-FastData -Language <lang> -InputFile dogs.txt`
3638

3739
### Using the .NET Source Generator
3840
1. Add the [Genbox.FastData.SourceGenerator](https://www.nuget.org/packages/Genbox.FastData.SourceGenerator/) package to your project
@@ -127,7 +129,9 @@ As a bonus, we also get some metadata about the dataset as constants, which, whe
127129
- **High-perfromance:** The generated data structures are generated without unnecessary branching or virtualization making the compiler produce optimal code.
128130
- **Key/Value support:** FastData can produce key/value lookup data structures
129131

130-
It supports several output programming languages.
132+
For more details about the data structures, see [data structures](Docs/DataStructures.md).
133+
134+
FastData supports several output programming languages.
131135

132136
* C#: `FastData csharp <input-file>`
133137
* C++: `FastData cpp <input-file>`
@@ -139,6 +143,7 @@ Each output language has different settings. Run `FastData <lang> --help` to see
139143

140144
A benchmark of .NET's `Array`, `HashSet<T>` and `FrozenSet<T>` versus FastData's auto-generated data structure really illustrates the difference.
141145

146+
### Membership queries
142147
| Method | Categories | Mean | Factor |
143148
|-----------|------------|----------:|-------:|
144149
| Array | InSet | 6.5198 ns | - |
@@ -151,7 +156,16 @@ A benchmark of .NET's `Array`, `HashSet<T>` and `FrozenSet<T>` versus FastData's
151156
| FrozenSet | NotInSet | 1.5816 ns | 4.68x |
152157
| FastData | NotInSet | 0.5284 ns | 14.01x |
153158

154-
Bigger factor means faster query times.
159+
### Keyed queries
160+
| Method | Categories | Mean | Factor |
161+
|--------------------|------------|---------:|-------:|
162+
| Dictionary | InSet | 6.890 ns | - |
163+
| FrozenDictionary | InSet | 1.484 ns | 4.64x |
164+
| FastData | InSet | 1.375 ns | 5.01x |
165+
| | | | |
166+
| DictionaryNF | NotInSet | 5.832 ns | - |
167+
| FrozenDictionaryNF | NotInSet | 1.376 ns | 4.24x |
168+
| FastDataNF | NotInSet | 1.349 ns | 4.32x |
155169

156170
## FAQ
157171

@@ -162,6 +176,9 @@ There are several reasons:
162176
* Frozen only provides a few of the optimizations provided in FastData
163177
* Frozen is only available in C#. FastData can produce data structures in many languages.
164178

179+
#### Does FastData use less memory than runtime structures?
180+
Yes and no. For some data structures like Array, it uses the same amount of memory. For others, like HashTable, depending on the data, it can use considerably less memory.
181+
165182
#### Does it support case-insensitive lookups?
166183
No, not yet.
167184

@@ -173,6 +190,7 @@ Yes, you can specify key/value arrays as input data and FastData will generate a
173190

174191
#### Are there any best pratcies for using FastData?
175192
* Put the most often queried items first in the input data. It can speed up query speed for some data structures.
193+
* Enable string analysis when using string keys to produce a more efficient hash function.
176194

177195
#### Can I use it for dynamic data?
178196
No, FastData is designed for static data only. It generates code at compile time, so the data must be known beforehand.

0 commit comments

Comments
 (0)