datagenerator2

The datagenerator tool allows to generate random data. The aim is to have a tool that generates data in a way which is flexible enough to satisfy the needs of developers or analysts or anybody else who needs some sort of test data - possibly with dependencies between individual fields and varying/definable distribution of field values.

The tool requires a yaml file which contains configuration details for the tool itself, including attributes for the export of the generated data to files. A second yaml file defines how the data is generated in terms of fields, field value weight and other attributes. Some of the configuration attributes may also be passed as arguments when starting the datagenerator tool. In this case these will override the same attributes from the configuration files.

Samples, word lists (category files) and details for the configuration can be found in the samples folder in this repository.

Features

select random values from word lists (where values can have an assigned weight)
generate uuid's, random strings, numbers or floating point numbers
generate random dates and timestamps. generate date fields referencing another date field
generate random data according to a given regular expression
transform the generated data values: uppercase, lowercase, base64 encode, negate, round, encrypt and more
export rows of generated data in CSV, Excel or Json
export rows of generated data in Parquet format - including partitioning
define nested structures for the data output

Types of generators

Different types of generators are available to generate different types of data such as strings, numbers, dates, etc.

For each field specified in the yaml configuration file one of the generators has to be defined by specifying a field type in the data configuration file.

The type attribute for each field can be one of the following values:

category
randomstring
randomlong
randomdouble
randomdate
randomtimestamp
datereference
regularexpression
randomuuid

If no type is specified then type=category is assumed.

Some of the generators allow to specify one or multiple transformations. They are applied after a value is generated. When one or more parameters are listed for a transformation, these need to be specified in the data configuration yaml file. Find below a list of transformations for the individual generator types. If an error occurs during transformation, then the original value passed to the transformation will be returned instead of the transformed one.

The transformations toLong, toBoolean and toDouble allow to convert a value to a different type. These transformations need to be defined as the last transformation for a given field in the configuration.

Random Strings

This type of generator (type=randomstring) generates purely random text. The options in the yaml configuration file allow to specify the range of characters to be used for constructing the random text. Additional options allow to specify the minimum and maximum length. Setting minLength=maxLength will create a constant length string.

Available options:

Option	Description	Data Type	Default
minLength	minimum length of the value	long	1
maxLength	maximum length of the value	long	40
randomCharacters	characters to be used when generating the value	String	[a-z] + [A-Z] + [0-9] + [-_]

Available transformations:

Transformation	Description	Parameters
uppercase	convert the value to uppercase	none
lowercase	convert the value to lowercase	none
reverse	reverse the characters of the value	none
base64encode	encode the value to base64 format	none
trim	remove leading and trailing spaces	none
maskLeading	mask leading characters of the value using a mask character	number of characters to mask (long), mask character(s) to use (string)
maskTrailing	mask trailing characters of the value using a mask character	number of characters to mask (long), mask character(s) to use (string)
replaceAll	replaces each substring of the value that matches the given regular expression with the given replacement	regular expression (string), replacement (string)
remove	remove all specified characters from the value	a string containing all characters to remove

Random Numbers

This generator (type=randomlong) allows to generate numbers. The options for this type of generator allow to specify a lower bound and upper bound for the generated value.

Available options:

Option	Description	Data Type	Default
minValue	minimum value	long	0
maxValue	maximum value	long	1000000

Available transformations:

Transformation	Description	Parameters
toBoolean	convert the value to a boolean value. values greater 0 are converted to "true", all others to "false"	none

Random Floating Point Numbers

This generator (type=randomdouble) allows to generate floating point numbers. The options for this type of generator allow to specify a lower bound and upper bound for the generated value.

Available options:

Option	Description	Data Type	Default
minValue	minimum value	long	0
maxValue	maximum value	long	1000000

Available transformations:

Transformation	Description	Parameters
round	round the value using rounding mode HALF_UP	number of decimal places (integer)

Random Dates

This generator (type=randomdate) allows to generate dates. The options for this type of generator allow to specify a minimum and maximum year, as well as the output format for the generated value. If the outputType=long then the date is output as the equivalent long value in milliseconds.

Available options:

Option	Description	Data Type	Default
minYear	minimum value	long	2020
maxYear	maximum value	long	2030
dateFormat	output format of the date (Java DateTimeFormatter)	string	yyyy-MM-dd
outputType	how data should be output. possible values: varchar or long	varchar	varchar

Random Timestamps

This generator (type=randomtimestamp) allows to generate timestamps. The options for this type of generator allow to specify a minimum and maximum year, as well as the output format for the generated value.

Available options:

Option	Description	Data Type	Default
minYear	minimum value	long	2020
maxYear	maximum value	long	2030
dateFormat	output format of the date (Java DateTimeFormatter)	string	yyyy-MM-dd HH:mm:ss

Date Reference

This generator (type=datereference) allows to generate a date string based on another date. This means that the values of this date and the referenced date correspond to each other. The options for this type of generator allow to specify the date field that shall be referenced, as well as the output format for the generated value.

Available options:

Option	Description	Data Type	Default
reference	name of the field which is the reference date	string
dateFormat	output format of the date (Java DateTimeFormatter)	string	yyyy-MM-dd

Available transformations:

Transformation	Description	Parameters
toQuarter	if the dateFormat of the field is "MM" it will be converted to the relevant quarter (Q1, Q2, Q3, Q4)	none
toHalfYear	if the dateFormat of the field is "MM" it will be converted to the relevant half year (H1, H2)	none

UUID

This generator (type=randomuuid) allows to generate a random uuid.

Regular Expressions

This type of generator (type=regularexpression) generates random text based on a regular expression pattern. The pattern option in the yaml configuration file allows to specify characters, character ranges and multipliers which make up the pattern.

Following features are available:

using standard characters like a, B, 9, -, etc.
using character groups like [A-Z], [F-L], [A-Za-z0-9], [A-Z0-9XYZ], [A-Cd-g0-4], [AbCdE-L123x-z] etc.
using multipliers for characters like B{1,10}, C{21}, etc.
using multipliers for character groups like [A-zf-p4-9]{1,10}, [a-z]{7}, etc.
multipliers can specify a minimum and maximum number of repetitions like e.g. {1,8}. In this case the resulting random string has a length between 1 and 8
multipliers can specify a minimum number of repetitions only like e.g. {4}. In this case the resulting random string has a length of exactly 4

The minimum and maximum value of a multiplier can not be smaller than 1. The maximum value must be greater than the minimum value.

NOTE: Currently you can not use any other features of regular expression patterns than character groups and multipliers.

Available options:

Option	Description	Data Type	Default
pattern	pattern describing a regular expression	String	[A-Za-z0-9]{1,10}

Available transformations:

Transformation	Description	Parameters
uppercase	convert the value to uppercase	none
lowercase	convert the value to lowercase	none
remove	remove all specified characters from the value	a string containing all characters to remove
toLong	convert the value to a long value	none
toBoolean	convert the value to a boolean value	none
toDouble	convert the value to a double value	none

Word lists

Word lists allow to define values for certain categories such as "weekdays", "seasons", "car types", "first names", etc. in a file. The generator (type=category) will randomly pick a value from the configured word list file. Word lists are simple text files where each row contains one value. As such all values of the word lists are treated as strings (even if you have a word list containing e.g numbers).

Using word lists offers a few advantages:

word lists can be stored in a directory hierarchy where e.g. different directories contain the same word lists but in different languages or the structure defines word lists for different environments (test/production)
word lists can be created from a data extract from a database, such as a select distinct on a certain column
word lists can be constructed from a script processing a data file or consuming a Rest API
word lists can be constructed or changed easily using a simple text editor

In the yaml configuration, additional values for a given word list (also values which are already defined in the word list file) may be defined, including a weight for individual values. This allows to specify a higher priority/weight for defined values. The weight of a value is always specified on the base of 100 percent.

E.g. one may define the days of the week in a word list file and in the configuration file "Saturday" with a weight of 5 percent and "Sunday" with a weight of 5 percent. The other days "Monday" to "Friday" will then be assigned a weight of 16 percent so that the overall sum of percentages is 100 %.

If a value for a given word list appears both in the word list file and the yaml configuration file, the setting from the configuration will overrule the value from the word list file.

The datagenerator will then produce random data (pick random values from the word list) according to the weights assigned. In the example above "saturday" and "sunday" will occur less often in the generated number of rows than the other days, because these values have a lower weight.

A word list is optional. All values to be used for randomly generating data can also be defined solely in the yaml configuration file. The sum of the weight definitions must be 100 percent (and can not exceed 100 percent). Individual values can not have negative percentage values.

NOTE: If values and their weight are specified in a word list but for some values no weight is defined, the datagenerator will calculate the weight for those fields that have no weight definition and equally distribute the weight value. But, depending on the number of values without a weight definition, it might not be possible to exactly evenly distribute the value. In this case some values from the word list might get a slightly higher weight value. If weight definitions are assigned in a way that the remaining percentage for the other values is less than 1 percent an error occurs.

Available options:

Option	Description	Data Type	Default
categoryFileSeparator	separator between value and weight in category file	string	,

Available transformations:

Transformation	Description	Parameters
uppercase	convert the value to uppercase	none
lowercase	convert the value to lowercase	none
reverse	reverse the characters of the value	none
prepend	add a prefix to the value	prefix to add (string)
append	add a suffix to the value	suffix to add (string)
base64encode	encode the value to base64 format	none
encrypt	encrypt the value using AES/CBC/PKCS5Padding algorithm	none
maskLeading	mask leading characters of the value using a mask character	number of characters to mask (long), mask character(s) to use (string)
maskTrailing	mask trailing characters of the value using a mask character	number of characters to mask (long), mask character(s) to use (string)
trim	remove leading and trailing spaces	none
replaceAll	replaces each substring of the value that matches the given regular expression with the given replacement	regular expression (string), replacement (string)
remove	remove all specified characters from the value	a string containing all characters to remove
toLong	convert the value to a long value	none
toBoolean	convert the value to a boolean value	none

Processing steps

First, the given program configuration and the data configuration yaml files are analyzed for their correctness. Any existing table definitions and data is removed from DuckDB, if a file with the specified name of the database is found.

After that the value for each field is generated and then transformed (if any transformations are specified). The fields are processed sequentially and build a row of data. The tool generates the desired number of rows and stores them in a local DuckDB instance. Finally, the data is exported to the desired output format.

The DuckDB database is not deleted after the process is completed. You can remove it manually or otherwise further use the generated data in the database.

Yaml configuration for the datagenerator2 tool

The configuration file contains various attributes to steer the behavior of the datagenerator tool.

the name of the export file for the generated data
the type of the export file: csv, excel, parquet or json
the number of rows to generate
after how many generated rows a log message will be output
the name of the auto-generated row number field (default: rownumber). This field is always added to every generated table as a LONG column and contains a unique sequential number for each row. It can be used as a primary key or for joins between tables generated in separate runs.
details for the export to a csv file - delimiter and header settings
details for the export to a json file - output as separate lines or as array
details for the export to a parquet file or partitioned file
details for the export to an excel file

See the sample yaml files in this repository under: samples/programconfiguration.

Yaml configuration for the definition of fields to generate

The configuration file contains a list of fields/attributes to generate - see the sample yaml files in this repository under: samples/dataconfiguration. For each field, options and transformations may be defined depending on the type of generator used.

There are three generic attributes defined in the configuration file: name, databaseName and tableName. The name attribute assigns a name to the configuration but is otherwise not used. The databaseName attribute defines the path and name for the DuckDB database that is used to collect the generated data. The tableName attribute defines the table of the DuckDB database where the generated data is stored. If you run a configuration multiple times but with different table names, the database will contain the data of both runs. If you run a configuration multiple times but do not change the table name, the data of the second run will overwrite all data of the first run (the data of the first run will be removed).

Fields is a list of fields for which data is to be generated. Each field has a unique name. A substructure can be created by dividing the structure and the field name with the dot separator - e.g. address.street, address.city, person.country.name, etc. This will create a substructure named "address" with the fields street and city. Multiple levels/substructures may be defined. Each field is assigned a type. Fields may have additional (optional) options. Fields may have one or more transformations assigned and the transformations may require additional parameters to be executed. Be aware not to create duplicate structures. For example, if you create a structure person.city.name.firstname then you can not also have a structure person.city or person.city.name. But you can have a structure person.city.location.

Fields of type=category may either specify valid values in the configuration file or in a category file or both, but one of them must be present. The definition for values contains the value itself and optionally a weight for the value.

Running the datagenerator tool

To run the tool you must pass at least the mandatory arguments to the program as shown below. These point to the program configuration file and the data configuration file. You may pass the other arguments, which will override the relevant default value as well as the value from the program configuration file.

Program arguments:

Argument	Type	Default	Description
-n=<number>	optional	10000	number of rows to generate
-l=<number>	optional	1000	interval for log messages during data generation
-g=<loglevel>	optional	INFO	log level to be used for logging output. must be one out of: OFF, FATAL, ERROR, WARN, INFO, DEBUG, TRACE, ALL
-xp=<path+filename>	optional	datagenerator_export.csv	path and filename of the export file
-xt=<type>	optional	csv	type of the export to generate. possible values: csv, excel, json, parquet
-cd=<delimiter>	optional		delimiter to be used for export files of type CSV
-ch	optional		indicator if a header row should be output for export files of type CSV
-dc=<path+filename>	mandatory	-none-	path and filename of the data configuration yaml file
-pc=<path+filename>	mandatory	-none-	path and filename of the program configuration yaml file
-s	optional	false	output statistics for the generated field values
-h / --help	optional		display help about the available program arguments

Run the datagenerator tool:

java -jar datagenerator2-<version>-jar-with-dependencies.jar -pc=<program configuration file> -dc=<data configuration file>

You can get help about the available program arguments by running:

java -jar datagenerator2-<version>-jar-with-dependencies.jar --help

See the sample yaml file for the program configuration in this repository under: samples/programconfiguration

Using datagenerator2 programmatically

You may also use the tool programmatically by adding it as a dependency to your project. The RowGenerator class provides a lazy infinite stream of rows. The caller controls termination via limit(), takeWhile(), or any other stream operation. Each element is wrapped in a Try — filter on Try::isSuccess to get successful rows only.

Add the following dependency to your Maven pom.xml:

<dependency>
    <groupId>io.github.uwegeercken</groupId>
    <artifactId>datagenerator2</artifactId>
    <version>0.4.5</version>
</dependency>

The artifact is available on Maven Central: https://central.sonatype.com/artifact/io.github.uwegeercken/datagenerator2

Example usage:

// generate a fixed number of rows
List<Row> rows = rowGenerator.generateRows(100)
    .filter(Try::isSuccess)
    .map(Try::getResult)
    .toList();

// generate rows lazily until a condition is met
rowGenerator.generateRows()
    .filter(Try::isSuccess)
    .map(Try::getResult)
    .takeWhile(row -> someCondition(row))
    .forEach(row -> process(row));

// generate a single row
Try<Row> row = rowGenerator.generateRow();

Building the datagenerator jar file

To build the jar file either download the release from https://github.com/uwegeercken/datagenerator2/tags or clone this repository and run:

mvn clean install

last update: uwe geercken - uwe.geercken@web.de - 2026-03-07

Name		Name	Last commit message	Last commit date
Latest commit History 284 Commits
samples		samples
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

datagenerator2

Features

Types of generators

Random Strings

Available options:

Available transformations:

Random Numbers

Available options:

Available transformations:

Random Floating Point Numbers

Available options:

Available transformations:

Random Dates

Available options:

Random Timestamps

Available options:

Date Reference

Available options:

Available transformations:

UUID

Regular Expressions

Available options:

Available transformations:

Word lists

Available options:

Available transformations:

Processing steps

Yaml configuration for the datagenerator2 tool

Yaml configuration for the definition of fields to generate

Running the datagenerator tool

Program arguments:

Using datagenerator2 programmatically

Building the datagenerator jar file

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages