11# FastData
22
3- [ ![ NuGet] ( https://img.shields.io/nuget/v/Genbox.FastData.svg?style=flat-square&label=nuget )] ( https://www.nuget.org/packages/Genbox.FastData/ )
43[ ![ License] ( https://img.shields.io/github/license/Genbox/FastData )] ( https://github.com/Genbox/FastData/blob/master/LICENSE.txt )
54
5+ ![ Docs/FastData.png] ( Docs/FastData.png )
6+
67## Description
78
89FastData is a code generator that analyzes your data and creates high-performance, read-only lookup data structures for
910static data. It can output the data structures
1011in many different languages (C#, C++, Rust, etc.), ready for inclusion in your project with zero dependencies.
1112
13+ ## Download
14+
15+ [ ![ C# library] ( https://img.shields.io/nuget/v/Genbox.FastData.Generator.CSharp.svg?style=flat-square&label=nuget )] ( https://www.nuget.org/packages/Genbox.FastData.Generator.CSharp/ )
16+ [ ![ .NET Tool] ( https://img.shields.io/nuget/v/Genbox.FastData.Cli.svg?style=flat-square&label=nuget )] ( https://www.nuget.org/packages/Genbox.FastData.Cli/ )
17+
18+
1219## Use case
1320
1421Imagine a scenario where you have a predefined list of words (e.g., dog breeds) and need to check whether a specific dog
1522breed exists in the set.
16- Usually you create an array and look up the value. However, this is far from optimal and is missing a few optimizations.
23+ Usually you create an array and look up the value. However, this is far from optimal and is lacks several optimizations.
1724
1825``` csharp
1926string [] breeds = [" Labrador" , " German Shepherd" , " Golden Retriever" ];
@@ -22,7 +29,7 @@ if (breeds.Contains("Beagle"))
2229 Console .WriteLine (" It contains Beagle" );
2330```
2431
25- We can do better by analyzing the dataset and generating a data structure optimized for the data.
32+ We can do better by analyzing the dataset and generating an optimized data structure .
2633
27341 . Create a file ` Dogs.txt ` with the following contents:
2835
@@ -32,7 +39,7 @@ German Shepherd
3239Golden Retriever
3340```
3441
35- 2 . Run ` FastData csharp Dogs.txt ` . It produces the output:
42+ 2 . Run ` FastData csharp Dogs.txt ` . It produces the following output:
3643
3744``` csharp
3845internal static class Dogs
@@ -62,8 +69,8 @@ internal static class Dogs
6269Benefits of the generated code:
6370
6471- ** Fast Early Exit:** A bitmap of string lengths allows early termination for out-of-range values.
65- - ** Efficient Lookups:** A switch-based data structure. It uses more advanced structures for larger data sets .
66- - ** Additional Metadata:** Provides item count and minimum/maximum string length .
72+ - ** Efficient Lookups:** A switch-based data structure which is fast for small datasets .
73+ - ** Additional Metadata:** Provides item count and other useful properties .
6774
6875A benchmark of the array versus our generated structure really illustrates the difference. It is 13x faster.
6976
@@ -82,91 +89,71 @@ There are several ways of running FastData. See the sections below for details.
82892 . Create a file with an item per line
83903 . Run ` FastData csharp File.txt `
8491
85- ### Using it in a C# application
92+ ### Using the .NET Source Generator
8693
87- 1 . Add the ` Genbox.FastData.Generator.CSharp ` nuget package to your project.
88- 2 . Use the ` FastDataGenerator.TryGenerate() ` method. Give it your data as an array .
94+ 1 . Add the ` Genbox.FastData.SourceGenerator ` package to your project
95+ 2 . Add ` FastDataAttribute ` as an assembly level attribute .
8996
9097``` csharp
98+ using Genbox .FastData .SourceGenerator ;
99+
100+ [assembly : FastData <string >(" Dogs" , [" Labrador" , " German Shepherd" , " Golden Retriever" ])]
101+
91102internal static class Program
92103{
93104 private static void Main ()
94105 {
95- FastDataConfig config = new FastDataConfig ();
96- config .StringComparison = StringComparison .OrdinalIgnoreCase ;
97-
98- CSharpCodeGenerator generator = new CSharpCodeGenerator (new CSharpGeneratorConfig (" Dogs" ));
99-
100- if (! FastDataGenerator .TryGenerate ([" Labrador" , " German Shepherd" , " Golden Retriever" ], config , generator , out string ? source ))
101- Console .WriteLine (" Failed to generate source code" );
102-
103- Console .WriteLine (source );
106+ Console .WriteLine (Dogs .Contains (" Labrador" ));
107+ Console .WriteLine (Dogs .Contains (" Beagle" ));
104108 }
105109}
106110```
107111
108- ### Using the .NET Source Generator
112+ ### Using it as a C# library
109113
110- 1 . Add the ` Genbox.FastData.SourceGenerator ` package to your project
111- 2 . Add ` FastDataAttribute ` as an assembly level attribute .
114+ 1 . Add the ` Genbox.FastData.Generator.CSharp ` NuGet package to your project.
115+ 2 . Use the ` FastDataGenerator.TryGenerate() ` method. Give it your data as an array .
112116
113117``` csharp
114- using Genbox .FastData .SourceGenerator ;
115-
116- [assembly : FastData <string >(" Dogs" , [" Labrador" , " German Shepherd" , " Golden Retriever" ])]
117-
118118internal static class Program
119119{
120120 private static void Main ()
121121 {
122- Console .WriteLine (Dogs .Contains (" Labrador" ));
123- Console .WriteLine (Dogs .Contains (" Beagle" ));
122+ FastDataConfig config = new FastDataConfig ();
123+
124+ CSharpCodeGenerator generator = new CSharpCodeGenerator (new CSharpGeneratorConfig (" Dogs" ));
125+
126+ if (! FastDataGenerator .TryGenerate ([" Labrador" , " German Shepherd" , " Golden Retriever" ], config , generator , out string ? source ))
127+ Console .WriteLine (" Failed to generate source code" );
128+
129+ Console .WriteLine (source );
124130 }
125131}
126132```
127133
128- Whenever you change the array, it automatically generates the new source code and includes it in your project .
134+ Whenever you change the array, it automatically generates the new source code.
129135
130136## Features
131137
132138- ** Data Analysis:** Optimizes the structure based on the inherent properties of the dataset.
133- - ** Multiple Indexing Structures:** FastData automatically chooses the best structure for your data.
139+ - ** Multiple Structures:** FastData automatically chooses the best data structure for your data.
140+ - ** Fast hashing:** String lookups are fast due to a fast string hash function
134141
135142It supports several output programming languages.
136143
137- * C# output : ` fastdata csharp <input-file>`
138- * C++ output : ` fastdata cplusplus <input-file>`
139- * Rust output : ` fastdata rust <input-file>`
144+ * C#: ` FastData csharp <input-file>`
145+ * C++: ` FastData cplusplus <input-file>`
146+ * Rust: ` FastData rust <input-file>`
140147
141- Each output language has different settings. Type ` fastdata <lang> --help` to see the options.
148+ Each output language has different settings. Type ` FastData <lang> --help` to see the options.
142149
143150### Data structures
144151
145152By default, FastData chooses the optimal data structure for your data, but you can also set it manually with
146- ` fastdata -s <type> ` . See the details of each structure type below.
147-
148- #### SingleValue
153+ ` FastData -s <type> ` . See the details of each structure type below.
149154
150- * Memory: Low
151- * Latency: Low
152- * Complexity: O(1)
153-
154- This data structure only supports a single value. It is much faster than an array with a single item and has no overhead
155- associated with it.
156- FastData always selects this data structure whenever your dataset only contains one item.
157-
158- #### Conditional
159-
160- * Memory: Low
161- * Latency: Low
162- * Complexity: O(n)
163-
164- This data structure relies on built-in logic in the programming language. It produces if/switch statements which
165- ultimately become machine instructions on the CPU, rather than data
166- that resides in memory.
167- Latency is therefore incredibly low, but the higher number of instructions bloat the assembly, and at a certain point it
168- becomes more efficient to have
169- the data reside in memory.
155+ #### Auto
156+ This is the default option. It autoselects the best data structure based on the number of items you provide.
170157
171158#### Array
172159
@@ -190,26 +177,20 @@ overhead at runtime. Each lookup
190177has a higher latency than a simple array, but once the dataset gets to a few hundred items, it beats the array due to a
191178lower complexity.
192179
193- #### EytzingerSearch
194-
195- * Memory: Low
196- * Latency: Medium
197- * Complexity: O(n* log(n))
198-
199- This data structure sorts data using an Eytzinger layout. It has better cache-locality than binary search. Under some
200- circumstances it has better performance.
201-
202- #### KeyLength
180+ #### Conditional
203181
204182* Memory: Low
205183* Latency: Low
206- * Complexity: O(1 )
184+ * Complexity: O(n )
207185
208- This data structure only works on strings, but it indexes them after their length, rather than a hash. In the case all
209- the strings have unique lengths, the
210- data structure further optimizes for latency.
186+ This data structure relies on built-in logic in the programming language. It produces if/switch statements which
187+ ultimately become machine instructions on the CPU, rather than data
188+ that resides in memory.
189+ Latency is therefore incredibly low, but the higher number of instructions bloat the assembly, and at a certain point it
190+ becomes more efficient to have
191+ the data reside in memory.
211192
212- #### HashSetChain
193+ #### HashSet
213194
214195* Memory: Medium
215196* Latency: Medium
@@ -219,35 +200,6 @@ This data structure is based on a hash table with separate chaining collision re
219200buckets to stay cache coherent, but it also uses more
220201memory since it needs to keep track of indices.
221202
222- #### HashSetLinear
223-
224- * Memory: Medium
225- * Latency: Medium
226- * Complexity: O(1)
227-
228- This data structure is also a hash table, but with linear collision resolution.
229-
230- #### PerfectHashBruteForce
231-
232- * Memory: Low
233- * Latency: Low
234- * Complexity: O(1)
235-
236- This data structure tries to create a perfect hash for the dataset. It does so by brute-forcing a seed for a simple hash
237- function
238- until it hits the right combination. If the dataset is small enough, it can even produce a minimal perfect hash.
239-
240- #### PerfectHashGPerf
241-
242- * Memory: Low
243- * Latency: Low
244- * Complexity: O(1)
245-
246- This data structure uses the same algorithm as gperf to derive a perfect hash. It uses Richard J. Cichelli's method for
247- creating an associative table,
248- which is augmented using alpha increments to resolve collisions. It only works on strings, but it is great for
249- medium-sized datasets.
250-
251203## How does it work?
252204
253205The idea behind the project is to generate a data-dependent optimized data structure for read-only lookup. When data is
@@ -258,7 +210,7 @@ of different data structures, indexing, and comparison methods that are tailor-b
258210
259211There are many benefits gained from generating data structures at compile time:
260212
261- * Enables analysis the data
213+ * Enables otherwise time-consuming data analysis
262214* Zero runtime overhead
263215* No defensive copying of data (takes time and needs double the memory)
264216* No virtual dispatching (virtual method calls & inheritance)
@@ -276,19 +228,6 @@ FastData uses advanced data analysis techniques to generate optimized data struc
276228It uses the analysis to create so-called early-exits, which are fast ` O(1) ` checks on your input before doing any ` O(n) `
277229checks on the actual dataset.
278230
279- #### Hash function generators
280-
281- Hash functions come in many flavors. Some are designed for low latency, some for throughput, others for low collision
282- rate.
283- Programming language runtimes come with a hash function that is a tradeoff between these parameters. FastData builds a
284- hash function specifically tailored to the dataset.
285- It has support for several techniques:
286-
287- 1 . ** Default:** If no technique is selected, FastData uses a hash function by Daniel Bernstein (DJB2)
288- 2 . ** Brute force:** It spends some time on trying increasingly stronger hash functions
289- 3 . ** Heuristic:** It tries to build a hash function that selects for entropy in strings
290- 4 . ** Genetic algorithm:** It uses machine learning to evolve a hash function that matches the data effectively
291-
292231## Best practices
293232
294233* Put the most often queried items first in the input data. It can speed up query speed for some data structures.
@@ -300,4 +239,4 @@ It has support for several techniques:
300239 * Frozen comes with considerable runtime overhead
301240 * Frozen is only available in .NET 8.0+
302241 * Frozen only provides a few of the optimizations provided in FastData
303- * Frozen is only available in C#. FastData can produce data structures in many langauges .
242+ * Frozen is only available in C#. FastData can produce data structures in many languages .
0 commit comments