Skip to content

Commit 6eca600

Browse files
committed
Update documentation
1 parent ba82154 commit 6eca600

File tree

2 files changed

+180
-0
lines changed

2 files changed

+180
-0
lines changed

docs/USING_THE_APIS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Note that for Sphinx source documents (.rst docs), correct rendering is only pro
1616
- [Using data distributions](source/DISTRIBUTIONS.md)
1717
- [Generating text data](source/textdata.rst)
1818
- [Repeatable data generation](source/repeatable_data_generation.rst)
19+
- [Creating data generators from configuration](source/serialized_data_generators.rst)
1920
- [Generating CDC data](source/generating_cdc_data.rst)
2021
- [Multi-table data generation](source/multi_table_data.rst)
2122
- [Troubleshooting](source/troubleshooting.rst)
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
.. Databricks Labs Data Generator documentation master file, created by
2+
sphinx-quickstart on Sun Jun 21 10:54:30 2020.
3+
4+
Creating Data Generation Specs from Configuration
5+
=================================================
6+
7+
Data generation specifications can be converted to and from configuration (either Python dictionaries or JSON strings).
8+
This section shows conversion between configuration and data generators, columns, and constraints.
9+
10+
Getting Data Generator Configuration Options
11+
--------------------------------------------
12+
13+
A dictionary of options needed to create a ``DataGenerator`` via the ``constructorOptions`` property.
14+
15+
.. code-block:: python
16+
17+
from pyspark.sql.types import LongType, IntegerType, StringType
18+
import dbldatagen as dg
19+
20+
# Create a sample data generator with a few columns:
21+
testDataSpec = (
22+
dg.DataGenerator(spark, name="users_dataset", rows=1000, randomSeedMethod='hash_fieldname')
23+
.withIdOutput()
24+
.withColumn("user_name", StringType(), expr="concat('user_', id)")
25+
.withColumn("email_address", StringType(), expr="concat(user_name, '@email.com')")
26+
.withColumn("phone_number", StringType(), template="555-DDD-DDDD")
27+
)
28+
29+
# Get the data generation options as a Python dictionary:
30+
dataSpecOptions = testDataSpec.constructorOptions
31+
32+
Calling ``constructorOptions`` will return properties of the ``DataGenerator`` (e.g. `rows`, `randomSeedMethod`) as
33+
root-level keys. Associated dictionaries for the ``ColumnGenerationSpecs`` and ``Constraints`` will be returned in the
34+
``columns`` and ``constraints`` keys.
35+
36+
Creating Data Generators from Configuration
37+
-------------------------------------------
38+
39+
``DataGenerators`` and their associated objects can be created from configuration by calling ``fromConstructorOptions``.
40+
41+
.. code-block:: python
42+
43+
import dbldatagen as dg
44+
45+
# Define the data generation options:
46+
dataSpecOptions = {
47+
"name": "users_dataset",
48+
"rows": 1000,
49+
"randomSeedMethod": "hash_fieldname",
50+
"columns": [
51+
{"colName": "user_name", "colType": "string", "expr": "concat('user_', id)"},
52+
{"colName": "phone_number", "colType": "string", "template": "555-DDD-DDDD"}
53+
]
54+
}
55+
56+
# Create the DataGenerator from options:
57+
dg.DataGenerator.fromConstructorOptions(dataSpecOptions)
58+
59+
Advanced Configuration Syntax
60+
-----------------------------
61+
62+
When adding constraints, distributions, text generators, or data ranges via configuration, specify the object's
63+
constructor arguments as a Python dictionary and include the class name in the `kind` property.
64+
65+
To define a column with a data range, pass a dictionary with the ``DateRange`` or ``NRange`` options.
66+
67+
.. code-block:: python
68+
69+
dataSpecOptions = {
70+
"name": "users_dataset",
71+
"rows": 1000,
72+
"randomSeedMethod": "hash_fieldname",
73+
"columns": [
74+
{"colName": "user_name", "colType": "string", "expr": "concat('user_', id)"},
75+
{"colName": "phone_number", "colType": "string", "template": "555-DDD-DDDD"},
76+
{"colName": "created_on", "colType": "date", "dataRange": {
77+
"kind": "DateRange", "begin": "2020-01-01", "end": "2025-01-01", "interval": "1 DAY", "datetime_format": "yyyy-MM-dd"}}
78+
]
79+
}
80+
81+
To define a column with a distribution, pass a dictionary with the ``Distribution`` options.
82+
83+
.. code-block:: python
84+
85+
86+
dataSpecOptions = {
87+
"name": "users_dataset", "rows": 1000, "randomSeedMethod": "hash_fieldname",
88+
"columns": [
89+
{"colName": "user_name", "colType": "string", "expr": "concat('user_', id)"},
90+
{"colName": "phone_number", "colType": "string", "template": "555-DDD-DDDD"},
91+
{"colName": "total_logins", "colType": "int", "distribution": {
92+
"kind": "Normal", "mean": "100", "stddev": "10"}}
93+
]
94+
}
95+
96+
To define a column with a text generator, pass a dictionary with the ``TextGenerator`` options.
97+
98+
.. code-block:: python
99+
100+
101+
dataSpecOptions = {
102+
"name": "users_dataset", "rows": 1000, "randomSeedMethod": "hash_fieldname",
103+
"columns": [
104+
{"colName": "user_name", "colType": "string", "expr": "concat('user_', id)"},
105+
{"colName": "phone_number", "colType": "string", "template": "555-DDD-DDDD"},
106+
{"colName": "description", "colType": "string", "text": {
107+
"kind": "ILText", "sentences": 3, "words": 10}}
108+
]
109+
}
110+
111+
112+
To define a column with a text generator, pass a dictionary with the ``TextGenerator`` options.
113+
114+
.. code-block:: python
115+
116+
dataSpecOptions = {
117+
"name": "users_dataset", "rows": 1000, "randomSeedMethod": "hash_fieldname",
118+
"columns": [
119+
{"colName": "user_name", "colType": "string", "expr": "concat('user_', id)"},
120+
{"colName": "phone_number", "colType": "string", "template": "555-DDD-DDDD"},
121+
{"colName": "total_logins", "colType": "int", "distribution": {
122+
"kind": "Normal", "mean": "100", "stddev": "10"}}
123+
],
124+
"constraints": [
125+
{"kind": "PositiveValues", "columns": "total_logins", "strict": True}
126+
]
127+
}
128+
129+
.. note::
130+
131+
Columns which use ``PyfuncText``, ``PyfuncTextFactory``, and ``FakerTextFactory`` are not currently serializable to
132+
and from configuration.
133+
134+
Using JSON Configuration
135+
------------------------
136+
137+
Data generators can be converted to and from JSON. This allows users to repeatedly generate datasets via options stored
138+
in files.
139+
140+
Use ``toJson`` to generate a JSON string from a ``DataGenerator``.
141+
142+
.. code-block:: python
143+
144+
from pyspark.sql.types import LongType, IntegerType, StringType
145+
import dbldatagen as dg
146+
147+
# Create a sample data generator with a few columns:
148+
testDataSpec = (
149+
dg.DataGenerator(spark, name="users_dataset", rows=1000, randomSeedMethod='hash_fieldname')
150+
.withIdOutput()
151+
.withColumn("user_name", StringType(), expr="concat('user_', id)")
152+
.withColumn("email_address", StringType(), expr="concat(user_name, '@email.com')")
153+
.withColumn("phone_number", StringType(), template="555-DDD-DDDD")
154+
)
155+
156+
# Create a JSON string with the data generation config:
157+
jsonStr = testDataSpec.toJson()
158+
159+
160+
Use ``fromJson`` to create a ``DataGenerator`` from a JSON string.
161+
162+
.. code-block:: python
163+
164+
from pyspark.sql.types import LongType, IntegerType, StringType
165+
import dbldatagen as dg
166+
167+
# Define the data generation options:
168+
jsonStr = '''{
169+
"name": "users_dataset",
170+
"rows": 1000,
171+
"randomSeedMethod": "hash_fieldname",
172+
"columns": [
173+
{"colName": "user_name", "colType": "string", "expr": "concat('user_', id)"},
174+
{"colName": "phone_number", "colType": "string", "template": "555-DDD-DDDD"}
175+
]
176+
}'''
177+
178+
# Create a data generator from the JSON string:
179+
testDataSpec = dg.DataGenerator.fromJson(jsonStr)

0 commit comments

Comments
 (0)