You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default, FastData chooses the optimal data structure for your data, but you can also set it manually with
4
-
`FastData -s <type>`. See the details of each structure type below.
4
+
`FastData -s <type>`.
5
5
6
-
## Auto
7
-
This is the default option. It automatically selects the best data structure based on the number of items you provide.
6
+
Below you'll see details of each structure.
7
+
8
+
-**Memory:** The memory overhead of the data structure. _Native_ means it does not have any overhead.
9
+
-**Complexity:** The data structures complexity expressed in [Big O notation](https://en.wikipedia.org/wiki/Big_O_notation).
8
10
9
11
## Array
10
12
11
-
* Memory: Low
12
-
* Latency: Low
13
+
### Overview
14
+
15
+
* Memory: Native
13
16
* Complexity: O(n)
14
17
15
-
This data structure uses an array as the backing store. It is often faster than a normal array due to efficient early
16
-
exits (value/length range checks).
17
-
It works well for small amounts of data since the array is scanned linearly, but for larger datasets, the O(n)
18
-
complexity hurts performance a lot.
18
+
This data structure uses an array as the backing store. It is faster than a normal array due to early exits (value/length range checks).
19
+
It works well for small amounts of data, but for larger datasets, the linear scans on data hurts performance.
19
20
20
21
## BinarySearch
21
22
22
-
* Memory: Low
23
-
* Latency: Medium
23
+
* Memory: Native
24
24
* Complexity: O(log n)
25
25
26
26
This data structure sorts your data and does a binary search on it. Since data is sorted at compile time, there is no
27
-
overhead at runtime. Each lookup
28
-
has a higher latency than a simple array, but once the dataset gets to a few hundred items, it beats the array due to a
29
-
lower complexity.
27
+
overhead at runtime. It is good for medium sized datasets when memory usage is a concern.
30
28
31
29
## Conditional
32
30
33
-
* Memory: Low
34
-
* Latency: Low
31
+
* Memory: Native
35
32
* Complexity: O(n)
36
33
37
-
This data structure relies on built-in logic in the programming language. It produces if/switch statements which
38
-
ultimately become machine instructions on the CPU, rather than data
39
-
that resides in memory.
40
-
Latency is therefore incredibly low, but the higher number of instructions bloat the assembly, and at a certain point it
41
-
becomes more efficient to have
42
-
the data reside in memory.
34
+
This data structure relies on built-in logic in the programming language. It produces logic statements which
35
+
ultimately become machine instructions. It is faster than an array for small amounts of data, but as soon as there are more than 400-500 keys, it starts declining.
43
36
44
-
## HashSet
37
+
## HashTable
45
38
46
-
* Memory: Medium
47
-
* Latency: Medium
39
+
* Memory: up to 16 bytes pr. key
48
40
* Complexity: O(1)
49
41
50
42
This data structure is based on a hash table with separate chaining collision resolution. It uses a separate array for
51
43
buckets to stay cache coherent, but it also uses more
52
-
memory since it needs to keep track of indices.
44
+
memory since it needs to keep track of indices.
45
+
46
+
### Special cases
47
+
48
+
#### Small hash table type optimization
49
+
Usually hash table implementations needs to store some infrastructure to perform its job correctly. However, it adds
50
+
quite a lot of memory overhead. FastData detects when the hash table is small enough for using smaller types, thereby saving some memory.
51
+
52
+
#### Keys are floating point numbers, but with no special values
53
+
Floating point numbers have the concept of Not a Number (NaN) as well as multiple binary representations of zero.
54
+
Because of this, a good float hash function will fold the many representations into a single representation. This ensures correctness.
55
+
56
+
However, the check adds overhead. When FastData does not see Zero or NaN in the dataset, it uses a faster hash function.
57
+
58
+
#### Keys are identity hashed
59
+
When the input is integer based, FastData uses an identity hash function (a hash of the key is the key itself).
60
+
Because of that, we don't need to store both the key and the hash of the key. It saves 8 bytes pr. key.
61
+
62
+
#### No collision on keys
63
+
If the keys have no collisions among them, a special data structure called PerfectHashTable is produced.
64
+
It is like a normal HashTable, but without any logic for collision resolution, thereby making it faster and saving up to 4 bytes pr. key.
65
+
66
+
## Comparison
67
+
We can compare each data structure in a graph and see which one is the fastest for a given number of keys.
68
+
The Y-axis is the number of queries per second (QPS). The X-axis is the number of keys.
69
+
70
+

71
+
72
+
We can see that Conditional starts out by being the fastest, but at ~500 keys, it declines sharply.
73
+
HashTable has stable lookup performance no matter how many keys it contains, but for less than 10 items, it is not a good choice.
Copy file name to clipboardExpand all lines: README.md
+23-5Lines changed: 23 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,15 +24,17 @@ It supports many output languages (C#, C++, Rust, etc.), ready for inclusion in
24
24
### Using the executable
25
25
26
26
1.[Download](https://github.com/Genbox/FastData/releases/latest) the executable
27
-
2. Run `FastData rust dogs.txt`
27
+
2. Run `FastData <lang> dogs.txt`
28
+
29
+
`<lang>` can be one of _rust_, _cpp_, or _csharp_.
28
30
29
31
### Using the .NET CLI tool
30
32
1. Install the [Genbox.FastData.Cli tool](https://www.nuget.org/packages/Genbox.FastData.Cli/): `dotnet tool install --global Genbox.FastData.Cli`
31
-
2. Run `FastData cpp dogs.txt`
33
+
2. Run `FastData <lang> dogs.txt`
32
34
33
35
### Using the PowerShell module
34
36
1. Install the [PowerShell module](https://www.powershellgallery.com/packages/Genbox.FastData/): `Install-Module -Name Genbox.FastData`
35
-
2. Run `Invoke-FastData -Language CSharp -InputFile dogs.txt`
37
+
2. Run `Invoke-FastData -Language <lang> -InputFile dogs.txt`
36
38
37
39
### Using the .NET Source Generator
38
40
1. Add the [Genbox.FastData.SourceGenerator](https://www.nuget.org/packages/Genbox.FastData.SourceGenerator/) package to your project
@@ -127,7 +129,9 @@ As a bonus, we also get some metadata about the dataset as constants, which, whe
127
129
-**High-perfromance:** The generated data structures are generated without unnecessary branching or virtualization making the compiler produce optimal code.
128
130
-**Key/Value support:** FastData can produce key/value lookup data structures
129
131
130
-
It supports several output programming languages.
132
+
For more details about the data structures, see [data structures](Docs/DataStructures.md).
133
+
134
+
FastData supports several output programming languages.
131
135
132
136
* C#: `FastData csharp <input-file>`
133
137
* C++: `FastData cpp <input-file>`
@@ -139,6 +143,7 @@ Each output language has different settings. Run `FastData <lang> --help` to see
139
143
140
144
A benchmark of .NET's `Array`, `HashSet<T>` and `FrozenSet<T>` versus FastData's auto-generated data structure really illustrates the difference.
141
145
146
+
### Membership queries
142
147
| Method | Categories | Mean | Factor |
143
148
|-----------|------------|----------:|-------:|
144
149
| Array | InSet | 6.5198 ns | - |
@@ -151,7 +156,16 @@ A benchmark of .NET's `Array`, `HashSet<T>` and `FrozenSet<T>` versus FastData's
* Frozen only provides a few of the optimizations provided in FastData
163
177
* Frozen is only available in C#. FastData can produce data structures in many languages.
164
178
179
+
#### Does FastData use less memory than runtime structures?
180
+
Yes and no. For some data structures like Array, it uses the same amount of memory. For others, like HashTable, depending on the data, it can use considerably less memory.
181
+
165
182
#### Does it support case-insensitive lookups?
166
183
No, not yet.
167
184
@@ -173,6 +190,7 @@ Yes, you can specify key/value arrays as input data and FastData will generate a
173
190
174
191
#### Are there any best pratcies for using FastData?
175
192
* Put the most often queried items first in the input data. It can speed up query speed for some data structures.
193
+
* Enable string analysis when using string keys to produce a more efficient hash function.
176
194
177
195
#### Can I use it for dynamic data?
178
196
No, FastData is designed for static data only. It generates code at compile time, so the data must be known beforehand.
0 commit comments