The datagenerator tool allows to generate random data. The aim is to have a tool that generates data in a way which is flexible enough to satisfy the needs of developers or analysts or anybody else who needs some sort of test data - possibly with dependencies between individual fields and varying/definable distribution of field values.
The tool requires a yaml file which contains configuration details for the tool itself, including attributes for the export of the generated data to files. A second yaml file defines how the data is generated in terms of fields, field value weight and other attributes. Some of the configuration attributes may also be passed as arguments when starting the datagenerator tool. In this case these will override the same attributes from the configuration files.
Samples, word lists (category files) and details for the configuration can be found in the samples folder in this repository.
- select random values from word lists (where values can have an assigned weight)
- generate uuid's, random strings, numbers or floating point numbers
- generate random dates and timestamps. generate date fields referencing another date field
- generate random data according to a given regular expression
- transform the generated data values: uppercase, lowercase, base64 encode, negate, round, encrypt and more
- export rows of generated data in CSV, Excel or Json
- export rows of generated data in Parquet format - including partitioning
- define nested structures for the data output
Different types of generators are available to generate different types of data such as strings, numbers, dates, etc.
For each field specified in the yaml configuration file one of the generators has to be defined by specifying a field type in the data configuration file.
The type attribute for each field can be one of the following values:
- category
- randomstring
- randomlong
- randomdouble
- randomdate
- randomtimestamp
- datereference
- regularexpression
- randomuuid
If no type is specified then type=category is assumed.
Some of the generators allow to specify one or multiple transformations. They are applied after a value is generated. When one or more parameters are listed for a transformation, these need to be specified in the data configuration yaml file. Find below a list of transformations for the individual generator types. If an error occurs during transformation, then the original value passed to the transformation will be returned instead of the transformed one.
The transformations toLong, toBoolean and toDouble allow to convert a value to a different type. These transformations need to be defined as the last transformation for a given field in the configuration.
This type of generator (type=randomstring) generates purely random text. The options in the yaml configuration file allow to specify the range of characters to be used for constructing the random text. Additional options allow to specify the minimum and maximum length. Setting minLength=maxLength will create a constant length string.
| Option | Description | Data Type | Default |
|---|---|---|---|
| minLength | minimum length of the value | long | 1 |
| maxLength | maximum length of the value | long | 40 |
| randomCharacters | characters to be used when generating the value | String | [a-z] + [A-Z] + [0-9] + [-_] |
| Transformation | Description | Parameters |
|---|---|---|
| uppercase | convert the value to uppercase | none |
| lowercase | convert the value to lowercase | none |
| reverse | reverse the characters of the value | none |
| base64encode | encode the value to base64 format | none |
| trim | remove leading and trailing spaces | none |
| maskLeading | mask leading characters of the value using a mask character | number of characters to mask (long), mask character(s) to use (string) |
| maskTrailing | mask trailing characters of the value using a mask character | number of characters to mask (long), mask character(s) to use (string) |
| replaceAll | replaces each substring of the value that matches the given regular expression with the given replacement | regular expression (string), replacement (string) |
| remove | remove all specified characters from the value | a string containing all characters to remove |
This generator (type=randomlong) allows to generate numbers. The options for this type of generator allow to specify a lower bound and upper bound for the generated value.
| Option | Description | Data Type | Default |
|---|---|---|---|
| minValue | minimum value | long | 0 |
| maxValue | maximum value | long | 1000000 |
| Transformation | Description | Parameters |
|---|---|---|
| toBoolean | convert the value to a boolean value. values greater 0 are converted to "true", all others to "false" | none |
This generator (type=randomdouble) allows to generate floating point numbers. The options for this type of generator allow to specify a lower bound and upper bound for the generated value.
| Option | Description | Data Type | Default |
|---|---|---|---|
| minValue | minimum value | long | 0 |
| maxValue | maximum value | long | 1000000 |
| Transformation | Description | Parameters |
|---|---|---|
| round | round the value using rounding mode HALF_UP | number of decimal places (integer) |
This generator (type=randomdate) allows to generate dates. The options for this type of generator allow to specify a minimum and maximum year, as well as the output format for the generated value. If the outputType=long then the date is output as the equivalent long value in milliseconds.
| Option | Description | Data Type | Default |
|---|---|---|---|
| minYear | minimum value | long | 2020 |
| maxYear | maximum value | long | 2030 |
| dateFormat | output format of the date (Java DateTimeFormatter) | string | yyyy-MM-dd |
| outputType | how data should be output. possible values: varchar or long | varchar | varchar |
This generator (type=randomtimestamp) allows to generate timestamps. The options for this type of generator allow to specify a minimum and maximum year, as well as the output format for the generated value.
| Option | Description | Data Type | Default |
|---|---|---|---|
| minYear | minimum value | long | 2020 |
| maxYear | maximum value | long | 2030 |
| dateFormat | output format of the date (Java DateTimeFormatter) | string | yyyy-MM-dd HH:mm:ss |
This generator (type=datereference) allows to generate a date string based on another date. This means that the values of this date and the referenced date correspond to each other. The options for this type of generator allow to specify the date field that shall be referenced, as well as the output format for the generated value.
| Option | Description | Data Type | Default |
|---|---|---|---|
| reference | name of the field which is the reference date | string | |
| dateFormat | output format of the date (Java DateTimeFormatter) | string | yyyy-MM-dd |
| Transformation | Description | Parameters |
|---|---|---|
| toQuarter | if the dateFormat of the field is "MM" it will be converted to the relevant quarter (Q1, Q2, Q3, Q4) | none |
| toHalfYear | if the dateFormat of the field is "MM" it will be converted to the relevant half year (H1, H2) | none |
This generator (type=randomuuid) allows to generate a random uuid.
This type of generator (type=regularexpression) generates random text based on a regular expression pattern. The pattern option in the yaml configuration file allows to specify characters, character ranges and multipliers which make up the pattern.
Following features are available:
- using standard characters like a, B, 9, -, etc.
- using character groups like [A-Z], [F-L], [A-Za-z0-9], [A-Z0-9XYZ], [A-Cd-g0-4], [AbCdE-L123x-z] etc.
- using multipliers for characters like B{1,10}, C{21}, etc.
- using multipliers for character groups like [A-zf-p4-9]{1,10}, [a-z]{7}, etc.
- multipliers can specify a minimum and maximum number of repetitions like e.g. {1,8}. In this case the resulting random string has a length between 1 and 8
- multipliers can specify a minimum number of repetitions only like e.g. {4}. In this case the resulting random string has a length of exactly 4
The minimum and maximum value of a multiplier can not be smaller than 1. The maximum value must be greater than the minimum value.
NOTE: Currently you can not use any other features of regular expression patterns than character groups and multipliers.
| Option | Description | Data Type | Default |
|---|---|---|---|
| pattern | pattern describing a regular expression | String | [A-Za-z0-9]{1,10} |
| Transformation | Description | Parameters |
|---|---|---|
| uppercase | convert the value to uppercase | none |
| lowercase | convert the value to lowercase | none |
| remove | remove all specified characters from the value | a string containing all characters to remove |
| toLong | convert the value to a long value | none |
| toBoolean | convert the value to a boolean value | none |
| toDouble | convert the value to a double value | none |
Word lists allow to define values for certain categories such as "weekdays", "seasons", "car types", "first names", etc. in a file. The generator (type=category) will randomly pick a value from the configured word list file. Word lists are simple text files where each row contains one value. As such all values of the word lists are treated as strings (even if you have a word list containing e.g numbers).
Using word lists offers a few advantages:
- word lists can be stored in a directory hierarchy where e.g. different directories contain the same word lists but in different languages or the structure defines word lists for different environments (test/production)
- word lists can be created from a data extract from a database, such as a select distinct on a certain column
- word lists can be constructed from a script processing a data file or consuming a Rest API
- word lists can be constructed or changed easily using a simple text editor
In the yaml configuration, additional values for a given word list (also values which are already defined in the word list file) may be defined, including a weight for individual values. This allows to specify a higher priority/weight for defined values. The weight of a value is always specified on the base of 100 percent.
E.g. one may define the days of the week in a word list file and in the configuration file "Saturday" with a weight of 5 percent and "Sunday" with a weight of 5 percent. The other days "Monday" to "Friday" will then be assigned a weight of 16 percent so that the overall sum of percentages is 100 %.
If a value for a given word list appears both in the word list file and the yaml configuration file, the setting from the configuration will overrule the value from the word list file.
The datagenerator will then produce random data (pick random values from the word list) according to the weights assigned. In the example above "saturday" and "sunday" will occur less often in the generated number of rows than the other days, because these values have a lower weight.
A word list is optional. All values to be used for randomly generating data can also be defined solely in the yaml configuration file. The sum of the weight definitions must be 100 percent (and can not exceed 100 percent). Individual values can not have negative percentage values.
NOTE: If values and their weight are specified in a word list but for some values no weight is defined, the datagenerator will calculate the weight for those fields that have no weight definition and equally distribute the weight value. But, depending on the number of values without a weight definition, it might not be possible to exactly evenly distribute the value. In this case some values from the word list might get a slightly higher weight value. If weight definitions are assigned in a way that the remaining percentage for the other values is less than 1 percent an error occurs.
| Option | Description | Data Type | Default |
|---|---|---|---|
| categoryFileSeparator | separator between value and weight in category file | string | , |
| Transformation | Description | Parameters |
|---|---|---|
| uppercase | convert the value to uppercase | none |
| lowercase | convert the value to lowercase | none |
| reverse | reverse the characters of the value | none |
| prepend | add a prefix to the value | prefix to add (string) |
| append | add a suffix to the value | suffix to add (string) |
| base64encode | encode the value to base64 format | none |
| encrypt | encrypt the value using AES/CBC/PKCS5Padding algorithm | none |
| maskLeading | mask leading characters of the value using a mask character | number of characters to mask (long), mask character(s) to use (string) |
| maskTrailing | mask trailing characters of the value using a mask character | number of characters to mask (long), mask character(s) to use (string) |
| trim | remove leading and trailing spaces | none |
| replaceAll | replaces each substring of the value that matches the given regular expression with the given replacement | regular expression (string), replacement (string) |
| remove | remove all specified characters from the value | a string containing all characters to remove |
| toLong | convert the value to a long value | none |
| toBoolean | convert the value to a boolean value | none |
First, the given program configuration and the data configuration yaml files are analyzed for their correctness. Any existing table definitions and data is removed from DuckDB, if a file with the specified name of the database is found.
After that the value for each field is generated and then transformed (if any transformations are specified). The fields are processed sequentially and build a row of data. The tool generates the desired number of rows and stores them in a local DuckDB instance. Finally, the data is exported to the desired output format.
The DuckDB database is not deleted after the process is completed. You can remove it manually or otherwise further use the generated data in the database.
The configuration file contains various attributes to steer the behavior of the datagenerator tool.
- the name of the export file for the generated data
- the type of the export file: csv, excel, parquet or json
- the number of rows to generate
- after how many generated rows a log message will be output
- the name of the auto-generated row number field (default:
rownumber). This field is always added to every generated table as aLONGcolumn and contains a unique sequential number for each row. It can be used as a primary key or for joins between tables generated in separate runs. - details for the export to a csv file - delimiter and header settings
- details for the export to a json file - output as separate lines or as array
- details for the export to a parquet file or partitioned file
- details for the export to an excel file
See the sample yaml files in this repository under: samples/programconfiguration.
The configuration file contains a list of fields/attributes to generate - see the sample yaml files in this repository under: samples/dataconfiguration. For each field, options and transformations may be defined depending on the type of generator used.
There are three generic attributes defined in the configuration file: name, databaseName and tableName. The name attribute assigns a name to the configuration but is otherwise not used. The databaseName attribute defines the path and name for the DuckDB database that is used to collect the generated data. The tableName attribute defines the table of the DuckDB database where the generated data is stored. If you run a configuration multiple times but with different table names, the database will contain the data of both runs. If you run a configuration multiple times but do not change the table name, the data of the second run will overwrite all data of the first run (the data of the first run will be removed).
Fields is a list of fields for which data is to be generated. Each field has a unique name. A substructure can be created by dividing the structure and the field name with the dot separator - e.g. address.street, address.city, person.country.name, etc. This will create a substructure named "address" with the fields street and city. Multiple levels/substructures may be defined. Each field is assigned a type. Fields may have additional (optional) options. Fields may have one or more transformations assigned and the transformations may require additional parameters to be executed. Be aware not to create duplicate structures. For example, if you create a structure person.city.name.firstname then you can not also have a structure person.city or person.city.name. But you can have a structure person.city.location.
Fields of type=category may either specify valid values in the configuration file or in a category file or both, but one of them must be present. The definition for values contains the value itself and optionally a weight for the value.
To run the tool you must pass at least the mandatory arguments to the program as shown below. These point to the program configuration file and the data configuration file. You may pass the other arguments, which will override the relevant default value as well as the value from the program configuration file.
| Argument | Type | Default | Description |
|---|---|---|---|
| -n=<number> | optional | 10000 | number of rows to generate |
| -l=<number> | optional | 1000 | interval for log messages during data generation |
| -g=<loglevel> | optional | INFO | log level to be used for logging output. must be one out of: OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE, ALL |
| -xp=<path+filename> | optional | datagenerator_export.csv | path and filename of the export file |
| -xt=<type> | optional | csv | type of the export to generate. possible values: csv, excel, json, parquet |
| -cd=<delimiter> | optional | delimiter to be used for export files of type CSV | |
| -ch | optional | indicator if a header row should be output for export files of type CSV | |
| -dc=<path+filename> | mandatory | -none- | path and filename of the data configuration yaml file |
| -pc=<path+filename> | mandatory | -none- | path and filename of the program configuration yaml file |
| -s | optional | false | output statistics for the generated field values |
| -h / --help | optional | display help about the available program arguments |
Run the datagenerator tool:
java -jar datagenerator2-<version>-jar-with-dependencies.jar -pc=<program configuration file> -dc=<data configuration file>
You can get help about the available program arguments by running:
java -jar datagenerator2-<version>-jar-with-dependencies.jar --help
See the sample yaml file for the program configuration in this repository under: samples/programconfiguration
You may also use the tool programmatically by adding it as a dependency to your project. The RowGenerator class provides a lazy infinite stream of rows. The caller controls termination via limit(), takeWhile(), or any other stream operation. Each element is wrapped in a Try — filter on Try::isSuccess to get successful rows only.
Add the following dependency to your Maven pom.xml:
<dependency>
<groupId>io.github.uwegeercken</groupId>
<artifactId>datagenerator2</artifactId>
<version>0.4.5</version>
</dependency>The artifact is available on Maven Central: https://central.sonatype.com/artifact/io.github.uwegeercken/datagenerator2
Example usage:
// generate a fixed number of rows
List<Row> rows = rowGenerator.generateRows(100)
.filter(Try::isSuccess)
.map(Try::getResult)
.toList();
// generate rows lazily until a condition is met
rowGenerator.generateRows()
.filter(Try::isSuccess)
.map(Try::getResult)
.takeWhile(row -> someCondition(row))
.forEach(row -> process(row));
// generate a single row
Try<Row> row = rowGenerator.generateRow();To build the jar file either download the release from https://github.com/uwegeercken/datagenerator2/tags or clone this repository and run:
mvn clean install
last update: uwe geercken - uwe.geercken@web.de - 2026-03-07