Skip to content

Commit 1355bec

Browse files
Doc updates 032223 (#180)
* wip * wip * wip * wip * wip * wip * wip
1 parent 4e55e2e commit 1355bec

20 files changed

+712
-532
lines changed

README.md

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -19,36 +19,36 @@
1919
-->
2020

2121
## Project Description
22-
The `dbldatgen` Databricks Labs project is a Python library for generating synthetic data within the Databricks
23-
environment using Spark. The generated data may be used for testing, benchmarking, demos and many
22+
The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks
23+
environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
2424
other uses.
2525

2626
It operates by defining a data generation specification in code that controls
27-
how the synthetic data is to be generated.
28-
The specification may incorporate use of existing schemas, or create data in an adhoc fashion.
27+
how the synthetic data is generated.
28+
The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.
2929

30-
It has no dependencies on any libraries that are not already incuded in the Databricks
30+
It has no dependencies on any libraries that are not already installed in the Databricks
3131
runtime, and you can use it from Scala, R or other languages by defining
3232
a view over the generated data.
3333

3434
### Feature Summary
3535
It supports:
3636
* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
37-
* Generating repeatable, predictable data supporting the needs for producing multiple tables, Change Data Capture,
37+
* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
3838
merge and join scenarios with consistency between primary and foreign keys
3939
* Generating synthetic data for all of the
4040
Spark SQL supported primitive types as a Spark data frame which may be persisted,
4141
saved to external storage or
4242
used in other computations
43-
* Generating ranges of dates, timestamps and numeric values
43+
* Generating ranges of dates, timestamps, and numeric values
4444
* Generation of discrete values - both numeric and text
4545
* Generation of values at random and based on the values of other fields
4646
(either based on the `hash` of the underlying values or the values themselves)
4747
* Ability to specify a distribution for random data generation
48-
* Generating arrays of values for ML style feature arrays
48+
* Generating arrays of values for ML-style feature arrays
4949
* Applying weights to the occurrence of values
5050
* Generating values to conform to a schema or independent of an existing schema
51-
* use of SQL expressions in test data generation
51+
* use of SQL expressions in synthetic data generation
5252
* plugin mechanism to allow use of 3rd party libraries such as Faker
5353
* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
5454

@@ -61,26 +61,26 @@ Please refer to the [online documentation](https://databrickslabs.github.io/dbld
6161
details of use and many examples.
6262

6363
Release notes and details of the latest changes for this specific release
64-
can be found in the Github repository
64+
can be found in the GitHub repository
6565
[here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.3post2/CHANGELOG.md)
6666

6767
# Installation
6868

69-
Use `pip install dbldatagen` to install the PyPi package
69+
Use `pip install dbldatagen` to install the PyPi package.
7070

7171
Within a Databricks notebook, invoke the following in a notebook cell
7272
```commandline
7373
%pip install dbldatagen
7474
```
7575

76-
This can be invoked within a Databricks notebook, a Delta Live Tables pipeline and even works on the Databricks
77-
community edition.
76+
The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
77+
and even works on the Databricks community edition.
7878

7979
The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html)
8080
contains details of installation using alternative mechanisms.
8181

8282
## Compatibility
83-
The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
83+
The Databricks Labs Data Generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
8484
compatible with the Databricks runtime 9.1 LTS and later releases.
8585

8686
Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
@@ -91,9 +91,9 @@ release notes for library compatibility
9191

9292
- https://docs.databricks.com/release-notes/runtime/releases.html
9393

94-
When using the Databricks Labs Data Generator on Unity Catalog enabled environments, the Data Generator requires
94+
When using the Databricks Labs Data Generator on "Unity Catalog" enabled environments, the Data Generator requires
9595
the use of `Single User` or `No Isolation Shared` access modes as some needed features are not available in `Shared`
96-
mode (for example, use of 3rd party libraries). Depending on settings, `Custom` access mode may be supported.
96+
mode (for example, use of 3rd party libraries). Depending on settings, the `Custom` access mode may be supported.
9797

9898
See the following documentation for more information:
9999

@@ -134,30 +134,30 @@ num_rows=df.count()
134134
Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further
135135
examples.
136136

137-
The Github repository also contains further examples in the examples directory
137+
The GitHub repository also contains further examples in the examples directory.
138138

139139
## Spark and Databricks Runtime Compatibility
140-
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime including
141-
older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes
140+
The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
141+
older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
142142
including `current` and `preview`.
143143

144-
While we dont specifically drop support for older runtimes, changes in Pyspark APIs or
145-
APIs from dependent packages such as `numpy`, `pandas`, `pyarrow` and `pyparsing` make cause issues with older
144+
While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
145+
APIs from dependent packages such as `numpy`, `pandas`, `pyarrow`, and `pyparsing` make cause issues with older
146146
runtimes.
147147

148-
Installing `dbldatagen` explicitly does not install releases of dependent packages so as to preserve the curated
149-
set of packages installed in any Databricks runtime environment.
148+
By design, installing `dbldatagen` does not install releases of dependent packages in order
149+
to preserve the curated set of packages pre-installed in any Databricks runtime environment.
150150

151-
When building on local environments, the `Pipfile` and requirements files are used to determine the versions
152-
tested against for releases and unit tests.
151+
When building on local environments, the build process uses the `Pipfile` and requirements files to determine
152+
the package versions for releases and unit tests.
153153

154154
## Project Support
155155
Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)
156156
are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
157-
(SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket
157+
(SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
158158
relating to any issues arising from the use of these projects.
159159

160-
Any issues discovered through the use of this project should be filed as issues on the Github Repo.
160+
Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
161161
They will be reviewed as time permits, but there are no formal SLAs for support.
162162

163163

dbldatagen/column_generation_spec.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,8 @@ class ColumnGenerationSpec(object):
5656
:param distribution: Instance of distribution, that will control the distribution of the generated values
5757
:param baseColumn: String or list of strings representing columns used as basis for generating the column data
5858
:param randomSeed: random seed value used to generate the random value, if column data is random
59-
:param randomSeedMethod: method for computing random values from the random seed
59+
:param randomSeedMethod: method for computing random values from the random seed. It may take on the
60+
values `fixed`, `hash_fieldname` or None
6061
6162
:param implicit: If True, the specification for the column can be replaced by a later definition.
6263
If not, a later attempt to replace the definition will flag an error.

dbldatagen/column_spec_options.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,15 @@ class ColumnSpecOptions(object):
4646
4747
:param weights: List of discrete weights for the colummn. Should be integer values.
4848
For example, you might declare a column for status values with a weighted distribution with
49-
the following statement: \
49+
the following statement:
5050
`withColumn("status", StringType(), values=['online', 'offline', 'unknown'], weights=[3,2,1])`
5151
5252
:param percentNulls: Specifies numeric percentage of generated values to be populated with SQL `null`.
5353
Value is fraction representing percentage between 0.0 and 1.0
5454
For example: `percentNulls=0.12` will give approximately 12% nulls for this field in the
5555
output.
56-
s
57-
:param unique_values: Number of unique values for column.
56+
57+
:param uniqueValues: Number of unique values for column.
5858
If the unique values are specified for a timestamp or date field, the values will be chosen
5959
working back from the end of the previous month,
6060
unless `begin`, `end` and `interval` parameters are specified
@@ -76,7 +76,7 @@ class ColumnSpecOptions(object):
7676
7777
:param template: template controlling how text should be generated
7878
79-
:param text_separator: string specifying separator to be used when constructing strings with prefix and suffix
79+
:param textSeparator: string specifying separator to be used when constructing strings with prefix and suffix
8080
8181
:param prefix: string specifying prefix text to construct field from prefix and numeric value. Both `prefix` and
8282
`suffix` can be used together

dbldatagen/data_generator.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -238,13 +238,14 @@ def useSeed(cls, seedVal):
238238
"""
239239
cls._randomSeed = seedVal
240240

241-
@deprecated('Use `useSeed` instead')
241+
@deprecated("Use `useSeed` instead")
242242
@classmethod
243243
def use_seed(cls, seedVal):
244244
""" set seed for random number generation
245245
246246
Arguments:
247247
:param seedVal: - new value for the random number seed
248+
248249
"""
249250
cls._randomSeed = seedVal
250251

dbldatagen/text_generator_plugins.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ class _FnCallContext:
6363
of the `initFn` calls
6464
6565
:param txtGen: - reference to outer PyfnText object
66+
6667
"""
6768

6869
def __init__(self, txtGen):
@@ -185,7 +186,8 @@ def initFaker(ctx):
185186
def __init__(self, name=None):
186187
"""
187188
188-
:param name:
189+
:param name: name of generated object (when converted to string via ``str``)
190+
189191
"""
190192
self._initFn = None
191193
self._rootProperty = None

dbldatagen/utils.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,13 @@
1515

1616

1717
def deprecated(message=""):
18-
''' Define a deprecated decorator without dependencies on 3rd party libraries
18+
""" Define a deprecated decorator without dependencies on 3rd party libraries
1919
2020
Note there is a 3rd party library called `deprecated` that provides this feature but goal is to only have
2121
dependencies on packages already used in the Databricks runtime
22-
'''
22+
"""
23+
24+
# create closure around function that follows use of the decorator
2325

2426
def deprecated_decorator(func):
2527
@functools.wraps(func)
@@ -156,7 +158,7 @@ def topologicalSort(sources, initial_columns=None, flatten=True):
156158

157159

158160
def parse_time_interval(spec):
159-
'''parse time interval from string'''
161+
"""parse time interval from string"""
160162
hours = 0
161163
minutes = 0
162164
weeks = 0
@@ -220,7 +222,7 @@ def parse_time_interval(spec):
220222
def split_list_matching_condition(lst, cond):
221223
""" Split a list on elements that match a condition
222224
223-
This will find all matches of a specific condition in the list and split the list into sublists around the
225+
This will find all matches of a specific condition in the list and split the list into sub lists around the
224226
element that matches this condition.
225227
226228
It will handle multiple matches performing splits on each match.
@@ -239,6 +241,7 @@ def split_list_matching_condition(lst, cond):
239241
:arg cond: lambda function or function taking single argument and returning True or False
240242
:returns: list of sublists
241243
"""
244+
retval = []
242245

243246
def match_condition(matchList, matchFn):
244247
"""Return first index of element of list matching condition"""
@@ -251,9 +254,6 @@ def match_condition(matchList, matchFn):
251254

252255
return -1
253256

254-
# main code
255-
retval = []
256-
257257
if lst is None:
258258
retval = lst
259259
elif len(lst) == 1:

0 commit comments

Comments
 (0)