Skip to content

Commit cac54b4

Browse files
committed
Update docs/readme
1 parent 3b32b58 commit cac54b4

File tree

4 files changed

+57
-115
lines changed

4 files changed

+57
-115
lines changed

Docs/BinarySearch.xlsx

-17.3 KB
Binary file not shown.

Docs/FastData.png

Lines changed: 3 additions & 0 deletions
Loading

Docs/Indexes.xlsx

-26.7 KB
Binary file not shown.

README.md

Lines changed: 54 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,26 @@
11
# FastData
22

3-
[![NuGet](https://img.shields.io/nuget/v/Genbox.FastData.svg?style=flat-square&label=nuget)](https://www.nuget.org/packages/Genbox.FastData/)
43
[![License](https://img.shields.io/github/license/Genbox/FastData)](https://github.com/Genbox/FastData/blob/master/LICENSE.txt)
54

5+
![Docs/FastData.png](Docs/FastData.png)
6+
67
## Description
78

89
FastData is a code generator that analyzes your data and creates high-performance, read-only lookup data structures for
910
static data. It can output the data structures
1011
in many different languages (C#, C++, Rust, etc.), ready for inclusion in your project with zero dependencies.
1112

13+
## Download
14+
15+
[![C# library](https://img.shields.io/nuget/v/Genbox.FastData.Generator.CSharp.svg?style=flat-square&label=nuget)](https://www.nuget.org/packages/Genbox.FastData.Generator.CSharp/)
16+
[![.NET Tool](https://img.shields.io/nuget/v/Genbox.FastData.Cli.svg?style=flat-square&label=nuget)](https://www.nuget.org/packages/Genbox.FastData.Cli/)
17+
18+
1219
## Use case
1320

1421
Imagine a scenario where you have a predefined list of words (e.g., dog breeds) and need to check whether a specific dog
1522
breed exists in the set.
16-
Usually you create an array and look up the value. However, this is far from optimal and is missing a few optimizations.
23+
Usually you create an array and look up the value. However, this is far from optimal and is lacks several optimizations.
1724

1825
```csharp
1926
string[] breeds = ["Labrador", "German Shepherd", "Golden Retriever"];
@@ -22,7 +29,7 @@ if (breeds.Contains("Beagle"))
2229
Console.WriteLine("It contains Beagle");
2330
```
2431

25-
We can do better by analyzing the dataset and generating a data structure optimized for the data.
32+
We can do better by analyzing the dataset and generating an optimized data structure.
2633

2734
1. Create a file `Dogs.txt` with the following contents:
2835

@@ -32,7 +39,7 @@ German Shepherd
3239
Golden Retriever
3340
```
3441

35-
2. Run `FastData csharp Dogs.txt`. It produces the output:
42+
2. Run `FastData csharp Dogs.txt`. It produces the following output:
3643

3744
```csharp
3845
internal static class Dogs
@@ -62,8 +69,8 @@ internal static class Dogs
6269
Benefits of the generated code:
6370

6471
- **Fast Early Exit:** A bitmap of string lengths allows early termination for out-of-range values.
65-
- **Efficient Lookups:** A switch-based data structure. It uses more advanced structures for larger data sets.
66-
- **Additional Metadata:** Provides item count and minimum/maximum string length.
72+
- **Efficient Lookups:** A switch-based data structure which is fast for small datasets.
73+
- **Additional Metadata:** Provides item count and other useful properties.
6774

6875
A benchmark of the array versus our generated structure really illustrates the difference. It is 13x faster.
6976

@@ -82,91 +89,71 @@ There are several ways of running FastData. See the sections below for details.
8289
2. Create a file with an item per line
8390
3. Run `FastData csharp File.txt`
8491

85-
### Using it in a C# application
92+
### Using the .NET Source Generator
8693

87-
1. Add the `Genbox.FastData.Generator.CSharp` nuget package to your project.
88-
2. Use the `FastDataGenerator.TryGenerate()` method. Give it your data as an array.
94+
1. Add the `Genbox.FastData.SourceGenerator` package to your project
95+
2. Add `FastDataAttribute` as an assembly level attribute.
8996

9097
```csharp
98+
using Genbox.FastData.SourceGenerator;
99+
100+
[assembly: FastData<string>("Dogs", ["Labrador", "German Shepherd", "Golden Retriever"])]
101+
91102
internal static class Program
92103
{
93104
private static void Main()
94105
{
95-
FastDataConfig config = new FastDataConfig();
96-
config.StringComparison = StringComparison.OrdinalIgnoreCase;
97-
98-
CSharpCodeGenerator generator = new CSharpCodeGenerator(new CSharpGeneratorConfig("Dogs"));
99-
100-
if (!FastDataGenerator.TryGenerate(["Labrador", "German Shepherd", "Golden Retriever"], config, generator, out string? source))
101-
Console.WriteLine("Failed to generate source code");
102-
103-
Console.WriteLine(source);
106+
Console.WriteLine(Dogs.Contains("Labrador"));
107+
Console.WriteLine(Dogs.Contains("Beagle"));
104108
}
105109
}
106110
```
107111

108-
### Using the .NET Source Generator
112+
### Using it as a C# library
109113

110-
1. Add the `Genbox.FastData.SourceGenerator` package to your project
111-
2. Add `FastDataAttribute` as an assembly level attribute.
114+
1. Add the `Genbox.FastData.Generator.CSharp` NuGet package to your project.
115+
2. Use the `FastDataGenerator.TryGenerate()` method. Give it your data as an array.
112116

113117
```csharp
114-
using Genbox.FastData.SourceGenerator;
115-
116-
[assembly: FastData<string>("Dogs", ["Labrador", "German Shepherd", "Golden Retriever"])]
117-
118118
internal static class Program
119119
{
120120
private static void Main()
121121
{
122-
Console.WriteLine(Dogs.Contains("Labrador"));
123-
Console.WriteLine(Dogs.Contains("Beagle"));
122+
FastDataConfig config = new FastDataConfig();
123+
124+
CSharpCodeGenerator generator = new CSharpCodeGenerator(new CSharpGeneratorConfig("Dogs"));
125+
126+
if (!FastDataGenerator.TryGenerate(["Labrador", "German Shepherd", "Golden Retriever"], config, generator, out string? source))
127+
Console.WriteLine("Failed to generate source code");
128+
129+
Console.WriteLine(source);
124130
}
125131
}
126132
```
127133

128-
Whenever you change the array, it automatically generates the new source code and includes it in your project.
134+
Whenever you change the array, it automatically generates the new source code.
129135

130136
## Features
131137

132138
- **Data Analysis:** Optimizes the structure based on the inherent properties of the dataset.
133-
- **Multiple Indexing Structures:** FastData automatically chooses the best structure for your data.
139+
- **Multiple Structures:** FastData automatically chooses the best data structure for your data.
140+
- **Fast hashing:** String lookups are fast due to a fast string hash function
134141

135142
It supports several output programming languages.
136143

137-
* C# output: `fastdata csharp <input-file>`
138-
* C++ output: `fastdata cplusplus <input-file>`
139-
* Rust output: `fastdata rust <input-file>`
144+
* C#: `FastData csharp <input-file>`
145+
* C++: `FastData cplusplus <input-file>`
146+
* Rust: `FastData rust <input-file>`
140147

141-
Each output language has different settings. Type `fastdata <lang> --help` to see the options.
148+
Each output language has different settings. Type `FastData <lang> --help` to see the options.
142149

143150
### Data structures
144151

145152
By default, FastData chooses the optimal data structure for your data, but you can also set it manually with
146-
`fastdata -s <type>`. See the details of each structure type below.
147-
148-
#### SingleValue
153+
`FastData -s <type>`. See the details of each structure type below.
149154

150-
* Memory: Low
151-
* Latency: Low
152-
* Complexity: O(1)
153-
154-
This data structure only supports a single value. It is much faster than an array with a single item and has no overhead
155-
associated with it.
156-
FastData always selects this data structure whenever your dataset only contains one item.
157-
158-
#### Conditional
159-
160-
* Memory: Low
161-
* Latency: Low
162-
* Complexity: O(n)
163-
164-
This data structure relies on built-in logic in the programming language. It produces if/switch statements which
165-
ultimately become machine instructions on the CPU, rather than data
166-
that resides in memory.
167-
Latency is therefore incredibly low, but the higher number of instructions bloat the assembly, and at a certain point it
168-
becomes more efficient to have
169-
the data reside in memory.
155+
#### Auto
156+
This is the default option. It autoselects the best data structure based on the number of items you provide.
170157

171158
#### Array
172159

@@ -190,26 +177,20 @@ overhead at runtime. Each lookup
190177
has a higher latency than a simple array, but once the dataset gets to a few hundred items, it beats the array due to a
191178
lower complexity.
192179

193-
#### EytzingerSearch
194-
195-
* Memory: Low
196-
* Latency: Medium
197-
* Complexity: O(n*log(n))
198-
199-
This data structure sorts data using an Eytzinger layout. It has better cache-locality than binary search. Under some
200-
circumstances it has better performance.
201-
202-
#### KeyLength
180+
#### Conditional
203181

204182
* Memory: Low
205183
* Latency: Low
206-
* Complexity: O(1)
184+
* Complexity: O(n)
207185

208-
This data structure only works on strings, but it indexes them after their length, rather than a hash. In the case all
209-
the strings have unique lengths, the
210-
data structure further optimizes for latency.
186+
This data structure relies on built-in logic in the programming language. It produces if/switch statements which
187+
ultimately become machine instructions on the CPU, rather than data
188+
that resides in memory.
189+
Latency is therefore incredibly low, but the higher number of instructions bloat the assembly, and at a certain point it
190+
becomes more efficient to have
191+
the data reside in memory.
211192

212-
#### HashSetChain
193+
#### HashSet
213194

214195
* Memory: Medium
215196
* Latency: Medium
@@ -219,35 +200,6 @@ This data structure is based on a hash table with separate chaining collision re
219200
buckets to stay cache coherent, but it also uses more
220201
memory since it needs to keep track of indices.
221202

222-
#### HashSetLinear
223-
224-
* Memory: Medium
225-
* Latency: Medium
226-
* Complexity: O(1)
227-
228-
This data structure is also a hash table, but with linear collision resolution.
229-
230-
#### PerfectHashBruteForce
231-
232-
* Memory: Low
233-
* Latency: Low
234-
* Complexity: O(1)
235-
236-
This data structure tries to create a perfect hash for the dataset. It does so by brute-forcing a seed for a simple hash
237-
function
238-
until it hits the right combination. If the dataset is small enough, it can even produce a minimal perfect hash.
239-
240-
#### PerfectHashGPerf
241-
242-
* Memory: Low
243-
* Latency: Low
244-
* Complexity: O(1)
245-
246-
This data structure uses the same algorithm as gperf to derive a perfect hash. It uses Richard J. Cichelli's method for
247-
creating an associative table,
248-
which is augmented using alpha increments to resolve collisions. It only works on strings, but it is great for
249-
medium-sized datasets.
250-
251203
## How does it work?
252204

253205
The idea behind the project is to generate a data-dependent optimized data structure for read-only lookup. When data is
@@ -258,7 +210,7 @@ of different data structures, indexing, and comparison methods that are tailor-b
258210

259211
There are many benefits gained from generating data structures at compile time:
260212

261-
* Enables analysis the data
213+
* Enables otherwise time-consuming data analysis
262214
* Zero runtime overhead
263215
* No defensive copying of data (takes time and needs double the memory)
264216
* No virtual dispatching (virtual method calls & inheritance)
@@ -276,19 +228,6 @@ FastData uses advanced data analysis techniques to generate optimized data struc
276228
It uses the analysis to create so-called early-exits, which are fast `O(1)` checks on your input before doing any `O(n)`
277229
checks on the actual dataset.
278230

279-
#### Hash function generators
280-
281-
Hash functions come in many flavors. Some are designed for low latency, some for throughput, others for low collision
282-
rate.
283-
Programming language runtimes come with a hash function that is a tradeoff between these parameters. FastData builds a
284-
hash function specifically tailored to the dataset.
285-
It has support for several techniques:
286-
287-
1. **Default:** If no technique is selected, FastData uses a hash function by Daniel Bernstein (DJB2)
288-
2. **Brute force:** It spends some time on trying increasingly stronger hash functions
289-
3. **Heuristic:** It tries to build a hash function that selects for entropy in strings
290-
4. **Genetic algorithm:** It uses machine learning to evolve a hash function that matches the data effectively
291-
292231
## Best practices
293232

294233
* Put the most often queried items first in the input data. It can speed up query speed for some data structures.
@@ -300,4 +239,4 @@ It has support for several techniques:
300239
* Frozen comes with considerable runtime overhead
301240
* Frozen is only available in .NET 8.0+
302241
* Frozen only provides a few of the optimizations provided in FastData
303-
* Frozen is only available in C#. FastData can produce data structures in many langauges.
242+
* Frozen is only available in C#. FastData can produce data structures in many languages.

0 commit comments

Comments
 (0)