1919-->
2020
2121## Project Description
22- The ` dbldatgen ` Databricks Labs project is a Python library for generating synthetic data within the Databricks
23- environment using Spark. The generated data may be used for testing, benchmarking, demos and many
22+ The ` dbldatagen ` Databricks Labs project is a Python library for generating synthetic data within the Databricks
23+ environment using Spark. The generated data may be used for testing, benchmarking, demos, and many
2424other uses.
2525
2626It operates by defining a data generation specification in code that controls
27- how the synthetic data is to be generated.
28- The specification may incorporate use of existing schemas, or create data in an adhoc fashion.
27+ how the synthetic data is generated.
28+ The specification may incorporate the use of existing schemas or create data in an ad-hoc fashion.
2929
30- It has no dependencies on any libraries that are not already incuded in the Databricks
30+ It has no dependencies on any libraries that are not already installed in the Databricks
3131runtime, and you can use it from Scala, R or other languages by defining
3232a view over the generated data.
3333
3434### Feature Summary
3535It supports:
3636* Generating synthetic data at scale up to billions of rows within minutes using appropriately sized clusters
37- * Generating repeatable, predictable data supporting the needs for producing multiple tables, Change Data Capture,
37+ * Generating repeatable, predictable data supporting the need for producing multiple tables, Change Data Capture,
3838merge and join scenarios with consistency between primary and foreign keys
3939* Generating synthetic data for all of the
4040Spark SQL supported primitive types as a Spark data frame which may be persisted,
4141saved to external storage or
4242used in other computations
43- * Generating ranges of dates, timestamps and numeric values
43+ * Generating ranges of dates, timestamps, and numeric values
4444* Generation of discrete values - both numeric and text
4545* Generation of values at random and based on the values of other fields
4646(either based on the ` hash ` of the underlying values or the values themselves)
4747* Ability to specify a distribution for random data generation
48- * Generating arrays of values for ML style feature arrays
48+ * Generating arrays of values for ML- style feature arrays
4949* Applying weights to the occurrence of values
5050* Generating values to conform to a schema or independent of an existing schema
51- * use of SQL expressions in test data generation
51+ * use of SQL expressions in synthetic data generation
5252* plugin mechanism to allow use of 3rd party libraries such as Faker
5353* Use within a Databricks Delta Live Tables pipeline as a synthetic data generation source
5454
@@ -61,26 +61,26 @@ Please refer to the [online documentation](https://databrickslabs.github.io/dbld
6161details of use and many examples.
6262
6363Release notes and details of the latest changes for this specific release
64- can be found in the Github repository
64+ can be found in the GitHub repository
6565[ here] ( https://github.com/databrickslabs/dbldatagen/blob/release/v0.3.3post2/CHANGELOG.md )
6666
6767# Installation
6868
69- Use ` pip install dbldatagen ` to install the PyPi package
69+ Use ` pip install dbldatagen ` to install the PyPi package.
7070
7171Within a Databricks notebook, invoke the following in a notebook cell
7272``` commandline
7373%pip install dbldatagen
7474```
7575
76- This can be invoked within a Databricks notebook, a Delta Live Tables pipeline and even works on the Databricks
77- community edition.
76+ The Pip install command can be invoked within a Databricks notebook, a Delta Live Tables pipeline
77+ and even works on the Databricks community edition.
7878
7979The documentation [ installation notes] ( https://databrickslabs.github.io/dbldatagen/public_docs/installation_notes.html )
8080contains details of installation using alternative mechanisms.
8181
8282## Compatibility
83- The Databricks Labs data generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
83+ The Databricks Labs Data Generator framework can be used with Pyspark 3.1.2 and Python 3.8 or later. These are
8484compatible with the Databricks runtime 9.1 LTS and later releases.
8585
8686Older prebuilt releases are tested against Pyspark 3.0.1 (compatible with the Databricks runtime 7.3 LTS
@@ -91,9 +91,9 @@ release notes for library compatibility
9191
9292- https://docs.databricks.com/release-notes/runtime/releases.html
9393
94- When using the Databricks Labs Data Generator on Unity Catalog enabled environments, the Data Generator requires
94+ When using the Databricks Labs Data Generator on " Unity Catalog" enabled environments, the Data Generator requires
9595the use of ` Single User ` or ` No Isolation Shared ` access modes as some needed features are not available in ` Shared `
96- mode (for example, use of 3rd party libraries). Depending on settings, ` Custom ` access mode may be supported.
96+ mode (for example, use of 3rd party libraries). Depending on settings, the ` Custom ` access mode may be supported.
9797
9898See the following documentation for more information:
9999
@@ -134,30 +134,30 @@ num_rows=df.count()
134134Refer to the [ online documentation] ( https://databrickslabs.github.io/dbldatagen/public_docs/index.html ) for further
135135examples.
136136
137- The Github repository also contains further examples in the examples directory
137+ The GitHub repository also contains further examples in the examples directory.
138138
139139## Spark and Databricks Runtime Compatibility
140- The ` dbldatagen ` package is intended to be compatible with recent LTS versions of the Databricks runtime including
141- older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes
140+ The ` dbldatagen ` package is intended to be compatible with recent LTS versions of the Databricks runtime, including
141+ older LTS versions at least from 10.4 LTS and later. It also aims to be compatible with Delta Live Table runtimes,
142142including ` current ` and ` preview ` .
143143
144- While we dont specifically drop support for older runtimes, changes in Pyspark APIs or
145- APIs from dependent packages such as ` numpy ` , ` pandas ` , ` pyarrow ` and ` pyparsing ` make cause issues with older
144+ While we don't specifically drop support for older runtimes, changes in Pyspark APIs or
145+ APIs from dependent packages such as ` numpy ` , ` pandas ` , ` pyarrow ` , and ` pyparsing ` make cause issues with older
146146runtimes.
147147
148- Installing ` dbldatagen ` explicitly does not install releases of dependent packages so as to preserve the curated
149- set of packages installed in any Databricks runtime environment.
148+ By design, installing ` dbldatagen ` does not install releases of dependent packages in order
149+ to preserve the curated set of packages pre- installed in any Databricks runtime environment.
150150
151- When building on local environments, the ` Pipfile ` and requirements files are used to determine the versions
152- tested against for releases and unit tests.
151+ When building on local environments, the build process uses the ` Pipfile ` and requirements files to determine
152+ the package versions for releases and unit tests.
153153
154154## Project Support
155155Please note that all projects released under [ ` Databricks Labs ` ] ( https://www.databricks.com/learn/labs )
156156 are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements
157- (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket
157+ (SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket
158158relating to any issues arising from the use of these projects.
159159
160- Any issues discovered through the use of this project should be filed as issues on the Github Repo.
160+ Any issues discovered through the use of this project should be filed as issues on the GitHub Repo.
161161They will be reviewed as time permits, but there are no formal SLAs for support.
162162
163163
0 commit comments