Update docs

Genbox · Genbox · commit 7751ef6a5f26 · 2025-08-03T19:15:13.000+02:00
diff --git a/Docs/DataStructures.md b/Docs/DataStructures.md
@@ -1,52 +1,73 @@
 # Data structures
 
 By default, FastData chooses the optimal data structure for your data, but you can also set it manually with
-`FastData -s <type>`. See the details of each structure type below.
+`FastData -s <type>`.
 
-## Auto
-This is the default option. It automatically selects the best data structure based on the number of items you provide.
+Below you'll see details of each structure.
+
+- **Memory:** The memory overhead of the data structure. _Native_ means it does not have any overhead.
+- **Complexity:** The data structures complexity expressed in [Big O notation](https://en.wikipedia.org/wiki/Big_O_notation).
 
 ## Array
 
-* Memory: Low
-* Latency: Low
+### Overview
+
+* Memory: Native
 * Complexity: O(n)
 
-This data structure uses an array as the backing store. It is often faster than a normal array due to efficient early
-exits (value/length range checks).
-It works well for small amounts of data since the array is scanned linearly, but for larger datasets, the O(n)
-complexity hurts performance a lot.
+This data structure uses an array as the backing store. It is faster than a normal array due to early exits (value/length range checks).
+It works well for small amounts of data, but for larger datasets, the linear scans on data hurts performance.
 
 ## BinarySearch
 
-* Memory: Low
-* Latency: Medium
+* Memory: Native
 * Complexity: O(log n)
 
 This data structure sorts your data and does a binary search on it. Since data is sorted at compile time, there is no
-overhead at runtime. Each lookup
-has a higher latency than a simple array, but once the dataset gets to a few hundred items, it beats the array due to a
-lower complexity.
+overhead at runtime. It is good for medium sized datasets when memory usage is a concern.
 
 ## Conditional
 
-* Memory: Low
-* Latency: Low
+* Memory: Native
 * Complexity: O(n)
 
-This data structure relies on built-in logic in the programming language. It produces if/switch statements which
-ultimately become machine instructions on the CPU, rather than data
-that resides in memory.
-Latency is therefore incredibly low, but the higher number of instructions bloat the assembly, and at a certain point it
-becomes more efficient to have
-the data reside in memory.
+This data structure relies on built-in logic in the programming language. It produces logic statements which
+ultimately become machine instructions. It is faster than an array for small amounts of data, but as soon as there are more than 400-500 keys, it starts declining.
 
-## HashSet
+## HashTable
 
-* Memory: Medium
-* Latency: Medium
+* Memory: up to 16 bytes pr. key
 * Complexity: O(1)
 
 This data structure is based on a hash table with separate chaining collision resolution. It uses a separate array for
 buckets to stay cache coherent, but it also uses more
-memory since it needs to keep track of indices.
+memory since it needs to keep track of indices.
+
+### Special cases
+
+#### Small hash table type optimization
+Usually hash table implementations needs to store some infrastructure to perform its job correctly. However, it adds
+quite a lot of memory overhead. FastData detects when the hash table is small enough for using smaller types, thereby saving some memory.
+
+#### Keys are floating point numbers, but with no special values
+Floating point numbers have the concept of Not a Number (NaN) as well as multiple binary representations of zero.
+Because of this, a good float hash function will fold the many representations into a single representation. This ensures correctness.
+
+However, the check adds overhead. When FastData does not see Zero or NaN in the dataset, it uses a faster hash function.
+
+#### Keys are identity hashed
+When the input is integer based, FastData uses an identity hash function (a hash of the key is the key itself).
+Because of that, we don't need to store both the key and the hash of the key. It saves 8 bytes pr. key.
+
+#### No collision on keys
+If the keys have no collisions among them, a special data structure called PerfectHashTable is produced.
+It is like a normal HashTable, but without any logic for collision resolution, thereby making it faster and saving up to 4 bytes pr. key.
+
+## Comparison
+We can compare each data structure in a graph and see which one is the fastest for a given number of keys.
+The Y-axis is the number of queries per second (QPS). The X-axis is the number of keys.
+
+![StructuresGraph.png](StructuresGraph.png)
+
+We can see that Conditional starts out by being the fastest, but at ~500 keys, it declines sharply.
+HashTable has stable lookup performance no matter how many keys it contains, but for less than 10 items, it is not a good choice.
diff --git a/Docs/StructuresGraph.png b/Docs/StructuresGraph.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:dde0d8fc3891df3ba023eb05aec6adcf2ebe0b2f3a7e873063575ccb71cd0391
+size 132027
diff --git a/README.md b/README.md
@@ -24,15 +24,17 @@ It supports many output languages (C#, C++, Rust, etc.), ready for inclusion in
 ### Using the executable
 
 1. [Download](https://github.com/Genbox/FastData/releases/latest) the executable
-2. Run `FastData rust dogs.txt`
+2. Run `FastData <lang> dogs.txt`
+
+`<lang>` can be one of _rust_, _cpp_, or _csharp_.
 
 ### Using the .NET CLI tool
 1. Install the [Genbox.FastData.Cli tool](https://www.nuget.org/packages/Genbox.FastData.Cli/): `dotnet tool install --global Genbox.FastData.Cli`
-2. Run `FastData cpp dogs.txt`
+2. Run `FastData <lang> dogs.txt`
 
 ### Using the PowerShell module
 1. Install the [PowerShell module](https://www.powershellgallery.com/packages/Genbox.FastData/): `Install-Module -Name Genbox.FastData`
-2. Run `Invoke-FastData -Language CSharp -InputFile dogs.txt`
+2. Run `Invoke-FastData -Language <lang> -InputFile dogs.txt`
 
 ### Using the .NET Source Generator
 1. Add the [Genbox.FastData.SourceGenerator](https://www.nuget.org/packages/Genbox.FastData.SourceGenerator/) package to your project
@@ -127,7 +129,9 @@ As a bonus, we also get some metadata about the dataset as constants, which, whe
 - **High-perfromance:** The generated data structures are generated without unnecessary branching or virtualization making the compiler produce optimal code.
 - **Key/Value support:** FastData can produce key/value lookup data structures
 
-It supports several output programming languages.
+For more details about the data structures, see [data structures](Docs/DataStructures.md).
+
+FastData supports several output programming languages.
 
 * C#: `FastData csharp <input-file>`
 * C++: `FastData cpp <input-file>`
@@ -139,6 +143,7 @@ Each output language has different settings. Run `FastData <lang> --help` to see
 
 A benchmark of .NET's `Array`, `HashSet<T>` and `FrozenSet<T>` versus FastData's auto-generated data structure really illustrates the difference.
 
+### Membership queries
 | Method    | Categories |      Mean | Factor |
 |-----------|------------|----------:|-------:|
 | Array     | InSet      | 6.5198 ns |      - |
@@ -151,7 +156,16 @@ A benchmark of .NET's `Array`, `HashSet<T>` and `FrozenSet<T>` versus FastData's
 | FrozenSet | NotInSet   | 1.5816 ns |  4.68x |
 | FastData  | NotInSet   | 0.5284 ns | 14.01x |
 
-Bigger factor means faster query times.
+### Keyed queries
+| Method             | Categories |     Mean | Factor |
+|--------------------|------------|---------:|-------:|
+| Dictionary         | InSet      | 6.890 ns |      - |
+| FrozenDictionary   | InSet      | 1.484 ns |  4.64x |
+| FastData           | InSet      | 1.375 ns |  5.01x |
+|                    |            |          |        |
+| DictionaryNF       | NotInSet   | 5.832 ns |      - |
+| FrozenDictionaryNF | NotInSet   | 1.376 ns |  4.24x |
+| FastDataNF         | NotInSet   | 1.349 ns |  4.32x |
 
 ## FAQ
 
@@ -162,6 +176,9 @@ There are several reasons:
 * Frozen only provides a few of the optimizations provided in FastData
 * Frozen is only available in C#. FastData can produce data structures in many languages.
 
+#### Does FastData use less memory than runtime structures?
+Yes and no. For some data structures like Array, it uses the same amount of memory. For others, like HashTable, depending on the data, it can use considerably less memory.
+
 #### Does it support case-insensitive lookups?
 No, not yet.
 
@@ -173,6 +190,7 @@ Yes, you can specify key/value arrays as input data and FastData will generate a
 
 #### Are there any best pratcies for using FastData?
 * Put the most often queried items first in the input data. It can speed up query speed for some data structures.
+* Enable string analysis when using string keys to produce a more efficient hash function.
 
 #### Can I use it for dynamic data?
 No, FastData is designed for static data only. It generates code at compile time, so the data must be known beforehand.