Skip to content

When generating array valued column generation spec, use different random seed for each element #178

@ronanstokes-db

Description

@ronanstokes-db

Expected Behavior

When generating array valued column generation spec, use different random seed for each element

Current Behavior

When generating multiple values for array elements, current default random seed produces same value for each array element:

For example:

import dbldatagen as dg
from pyspark.sql.types import ArrayType, StringType

dataspec = dg.DataGenerator(spark, rows=10 * 1000000)

dataspec = (dataspec
           .withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test')                                       
           .withColumn("serial_number", "string", minValue=1000000, maxValue=10000000, 
                                 prefix="dr", random=True) 
           .withColumn("email", "string", template=r'\\w.\\w@\\w.com', random=True, numColumns=5, structType="array",
                           omit=True) 
            .withColumn("emails", ArrayType(StringType()), expr="slice(email, 1, (abs(hash(id)) % 4)+1)", 
                           baseColumns=["email"]) 
            .withColumn("license_plate", "string", template=r'\\n-\\n')
           )
dfTestData = dataspec.build()

display(dfTestData)

Workaround

Add randomSeed option of -1 to array valued column - however the data generation is then not-repeatable.

import dbldatagen as dg
from pyspark.sql.types import ArrayType, StringType

dataspec = dg.DataGenerator(spark, rows=10 * 1000000)

dataspec = (dataspec
           .withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test')                                       
           .withColumn("serial_number", "string", minValue=1000000, maxValue=10000000, 
                                 prefix="dr", random=True) 
           .withColumn("email", "string", template=r'\\w.\\w@\\w.com', random=True, numColumns=5, structType="array",
                           omit=True, randomSeed=-1) 
            .withColumn("emails", ArrayType(StringType()), expr="slice(email, 1, (abs(hash(id)) % 4)+1)", 
                           baseColumns=["email"]) 
            .withColumn("license_plate", "string", template=r'\\n-\\n')
           )
dfTestData = dataspec.build()

display(dfTestData)

Context

Your Environment

  • dbldatagen version used:
  • Databricks Runtime version:
  • Cloud environment used:

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions