databrickslabs
diff --git a/‎README.md‎
Lines changed: 27 additions & 27 deletions b/‎README.md‎
Lines changed: 27 additions & 27 deletions
diff --git a/‎dbldatagen/column_generation_spec.py‎
Lines changed: 2 additions & 1 deletion b/‎dbldatagen/column_generation_spec.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎dbldatagen/column_spec_options.py‎
Lines changed: 4 additions & 4 deletions b/‎dbldatagen/column_spec_options.py‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎dbldatagen/data_generator.py‎
Lines changed: 2 additions & 1 deletion b/‎dbldatagen/data_generator.py‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎dbldatagen/text_generator_plugins.py‎
Lines changed: 3 additions & 1 deletion b/‎dbldatagen/text_generator_plugins.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎dbldatagen/utils.py‎
Lines changed: 7 additions & 7 deletions b/‎dbldatagen/utils.py‎
Lines changed: 7 additions & 7 deletions
@@ -19,36 +19,36 @@
 -->
 
 ## Project Description
-The `dbldatgen` Databricks Labs project is a Python library for generating synthetic data within the Databricks 
-environment using Spark. The generated data may be used for testing, benchmarking, demos and many 
+The `dbldatagen` Databricks Labs project is a Python library for generating synthetic data within the Databricks 
+environment using Spark. The generated data may be used for testing, benchmarking, demos, and many 
 other uses.
 
 It operates by defining a data generation specification in code that controls 
-how the synthetic data is to be generated.
-The specification may incorporate use of existing schemas, or create data in an adhoc fashion.
+how the synthetic data is generated.
+The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.
 
-It has no dependencies on any libraries that are not already incuded in the Databricks 
+It has no dependencies on any libraries that are not already installed in the Databricks 
 runtime, and you can use it from Scala, R or other languages by defining
 a view over the generated data.
 
 ### Feature Summary
 It supports:
 * Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters 
-* Generating repeatable, predictable data supporting the needs for producing multiple tables, Change Data Capture, 
+* Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture, 
 merge and join scenarios with consistency between primary and foreign keys
 * Generating synthetic data for all of the 
 Spark SQL supported primitive types as a Spark data frame which may be persisted, 
 saved to external storage or 
 used in other computations
-* Generating ranges of dates, timestamps and numeric values
+* Generating ranges of dates, timestamps, and numeric values
 * Generation of discrete values - both numeric and text
 * Generation of values at random and based on the values of other fields 
 (either based on the `hash` of the underlying values or the values themselves)
 * Ability to specify a distribution for random data generation 
-* Generating arrays of values for ML style feature arrays
+* Generating arrays of values for ML-style feature arrays
 * Applying weights to the occurrence of values
 * Generating values to conform to a schema or independent of an existing schema
-* use of SQL expressions in test data generation
+* use of SQL expressions in synthetic data generation
 * plugin mechanism to allow use of 3rd party libraries such as Faker
 * Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
 
@@ -61,26 +61,26 @@ Please refer to the [online documentation](https://databrickslabs.github.io/dbld
 details of use and many examples.
 
 Release notes and details of the latest changes for this specific release
-can be found in the Github repository
+can be found in the GitHub repository
 [here](https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.3post2/CHANGELOG.md)
 
 # Installation
 
-Use `pip install dbldatagen` to install the PyPi package
+Use `pip install dbldatagen` to install the PyPi package.
 
 Within a Databricks notebook, invoke the following in a notebook cell
 ```commandline
 %pip install dbldatagen
 ```
 
-This can be invoked within a Databricks notebook, a Delta Live Tables pipeline and even works on the Databricks 
-community edition.
+The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline 
+and even works on the Databricks community edition.
 
 The documentation [installation notes](https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html) 
 contains details of installation using alternative mechanisms.
 
 ## Compatibility 
-The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are 
+The Databricks Labs Data Generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are 
 compatible with the Databricks runtime 9.1 LTS and later releases.
 
 Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS 
@@ -91,9 +91,9 @@ release notes for library compatibility
 
 - https://docs.databricks.com/release-notes/runtime/releases.html
 
-When using the Databricks Labs Data Generator on Unity Catalog enabled environments, the Data Generator requires
+When using the Databricks Labs Data Generator on "Unity Catalog" enabled environments, the Data Generator requires
 the use of `Single User` or `No Isolation Shared` access modes as some needed features are not available in `Shared` 
-mode (for example, use of 3rd party libraries). Depending on settings, `Custom` access mode may be supported.
+mode (for example, use of 3rd party libraries). Depending on settings, the `Custom` access mode may be supported.
 
 See the following documentation for more information:
 
@@ -134,30 +134,30 @@ num_rows=df.count()
 Refer to the [online documentation](https://databrickslabs.github.io/dbldatagen/public_docs/index.html) for further 
 examples. 
 
-The Github repository also contains further examples in the examples directory
+The GitHub repository also contains further examples in the examples directory.
 
 ## Spark and Databricks Runtime Compatibility
-The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime including 
-older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes 
+The `dbldatagen` package is intended to be compatible with recent LTS versions of the Databricks runtime, including 
+older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes, 
 including `current` and `preview`. 
 
-While we dont specifically drop support for older runtimes, changes in Pyspark APIs or
-APIs from dependent packages such as `numpy`, `pandas`, `pyarrow` and `pyparsing` make cause issues with older
+While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
+APIs from dependent packages such as `numpy`, `pandas`, `pyarrow`, and `pyparsing` make cause issues with older
 runtimes. 
 
-Installing `dbldatagen` explicitly does not install releases of dependent packages so as to preserve the curated
-set of packages installed in any Databricks runtime environment.
+By design, installing `dbldatagen` does not install releases of dependent packages in order 
+to preserve the curated set of packages pre-installed in any Databricks runtime environment.
 
-When building on local environments, the `Pipfile` and requirements files are used to determine the versions 
-tested against for releases and unit tests. 
+When building on local environments, the build process uses the `Pipfile` and requirements files to determine 
+the package versions for releases and unit tests. 
 
 ## Project Support
 Please note that all projects released under [`Databricks Labs`](https://www.databricks.com/learn/labs)
  are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements 
-(SLAs).  They are provided AS-IS and we do not make any guarantees of any kind.  Please do not submit a support ticket 
+(SLAs).  They are provided AS-IS, and we do not make any guarantees of any kind.  Please do not submit a support ticket 
 relating to any issues arising from the use of these projects.
 
-Any issues discovered through the use of this project should be filed as issues on the Github Repo.  
+Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.  
 They will be reviewed as time permits, but there are no formal SLAs for support.
 
 
 
@@ -56,7 +56,8 @@ class ColumnGenerationSpec(object):
     :param distribution: Instance of distribution, that will control the distribution of the generated values
     :param baseColumn: String or list of strings representing columns used as basis for generating the column data
     :param randomSeed: random seed value used to generate the random value, if column data is random
-    :param randomSeedMethod: method for computing random values from the random seed
+    :param randomSeedMethod: method for computing random values from the random seed. It may take on the
+           values `fixed`, `hash_fieldname` or None
 
     :param implicit: If True, the specification for the column can be replaced by a later definition.
            If not, a later attempt to replace the definition will flag an error.
 
@@ -46,15 +46,15 @@ class ColumnSpecOptions(object):
 
     :param weights: List of discrete weights for the colummn. Should be integer values.
                     For example, you might declare a column for status values with a weighted distribution with
-                    the following statement: \
+                    the following statement:
                     `withColumn("status", StringType(), values=['online', 'offline', 'unknown'], weights=[3,2,1])`
 
     :param percentNulls: Specifies numeric percentage of generated values to be populated with SQL `null`.
                           Value is fraction representing percentage between 0.0 and 1.0
                           For example: `percentNulls=0.12` will give approximately 12% nulls for this field in the
                           output.
-s
-    :param unique_values: Number of unique values for column.
+
+    :param uniqueValues: Number of unique values for column.
                           If the unique values are specified for a timestamp or date field, the values will be chosen
                           working back from the end of the previous month,
                           unless `begin`, `end` and `interval` parameters are specified
@@ -76,7 +76,7 @@ class ColumnSpecOptions(object):
 
     :param template: template controlling how text should be generated
 
-    :param text_separator: string specifying separator to be used when constructing strings with prefix and suffix
+    :param textSeparator: string specifying separator to be used when constructing strings with prefix and suffix
 
     :param prefix: string specifying prefix text to construct field from prefix and numeric value. Both `prefix` and
                    `suffix` can be used together
 
@@ -238,13 +238,14 @@ def useSeed(cls, seedVal):
         """
         cls._randomSeed = seedVal
 
-    @deprecated('Use `useSeed` instead')
+    @deprecated("Use `useSeed` instead")
     @classmethod
     def use_seed(cls, seedVal):
         """ set seed for random number generation
 
             Arguments:
             :param seedVal: - new value for the random number seed
+
         """
         cls._randomSeed = seedVal
 
 
@@ -63,6 +63,7 @@ class _FnCallContext:
             of the `initFn` calls
 
             :param txtGen: - reference to outer PyfnText object
+
         """
 
         def __init__(self, txtGen):
@@ -185,7 +186,8 @@ def initFaker(ctx):
     def __init__(self, name=None):
         """
 
-        :param name:
+        :param name: name of generated object (when converted to string via ``str``)
+
         """
         self._initFn = None
         self._rootProperty = None
 
@@ -15,11 +15,13 @@
 
 
 def deprecated(message=""):
-    ''' Define a deprecated decorator without dependencies on 3rd party libraries
+    """ Define a deprecated decorator without dependencies on 3rd party libraries
 
     Note there is a 3rd party library called `deprecated` that provides this feature but goal is to only have
     dependencies on packages already used in the Databricks runtime
-    '''
+    """
+
+    # create closure around function that follows use of the decorator
 
     def deprecated_decorator(func):
         @functools.wraps(func)
@@ -156,7 +158,7 @@ def topologicalSort(sources, initial_columns=None, flatten=True):
 
 
 def parse_time_interval(spec):
-    '''parse time interval from string'''
+    """parse time interval from string"""
     hours = 0
     minutes = 0
     weeks = 0
@@ -220,7 +222,7 @@ def parse_time_interval(spec):
 def split_list_matching_condition(lst, cond):
     """ Split a list on elements that match a condition
 
-    This will find all matches of a specific condition in the list and split the list into sublists around the
+    This will find all matches of a specific condition in the list and split the list into sub lists around the
     element that matches this condition.
 
     It will handle multiple matches performing splits on each match.
@@ -239,6 +241,7 @@ def split_list_matching_condition(lst, cond):
     :arg cond: lambda function or function taking single argument and returning True or False
     :returns: list of sublists
     """
+    retval = []
 
     def match_condition(matchList, matchFn):
         """Return first index of element of list matching condition"""
@@ -251,9 +254,6 @@ def match_condition(matchList, matchFn):
 
         return -1
 
-    # main code
-    retval = []
-
     if lst is None:
         retval = lst
     elif len(lst) == 1: