-
Notifications
You must be signed in to change notification settings - Fork 86
Closed
Labels
Milestone
Description
Expected Behavior
When generating array valued column generation spec, use different random seed for each element
Current Behavior
When generating multiple values for array elements, current default random seed produces same value for each array element:
For example:
import dbldatagen as dg
from pyspark.sql.types import ArrayType, StringType
dataspec = dg.DataGenerator(spark, rows=10 * 1000000)
dataspec = (dataspec
.withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test')
.withColumn("serial_number", "string", minValue=1000000, maxValue=10000000,
prefix="dr", random=True)
.withColumn("email", "string", template=r'\\w.\\w@\\w.com', random=True, numColumns=5, structType="array",
omit=True)
.withColumn("emails", ArrayType(StringType()), expr="slice(email, 1, (abs(hash(id)) % 4)+1)",
baseColumns=["email"])
.withColumn("license_plate", "string", template=r'\\n-\\n')
)
dfTestData = dataspec.build()
display(dfTestData)
Workaround
Add randomSeed option of -1 to array valued column - however the data generation is then not-repeatable.
import dbldatagen as dg
from pyspark.sql.types import ArrayType, StringType
dataspec = dg.DataGenerator(spark, rows=10 * 1000000)
dataspec = (dataspec
.withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test')
.withColumn("serial_number", "string", minValue=1000000, maxValue=10000000,
prefix="dr", random=True)
.withColumn("email", "string", template=r'\\w.\\w@\\w.com', random=True, numColumns=5, structType="array",
omit=True, randomSeed=-1)
.withColumn("emails", ArrayType(StringType()), expr="slice(email, 1, (abs(hash(id)) % 4)+1)",
baseColumns=["email"])
.withColumn("license_plate", "string", template=r'\\n-\\n')
)
dfTestData = dataspec.build()
display(dfTestData)
Context
Your Environment
dbldatagenversion used:- Databricks Runtime version:
- Cloud environment used: