Skip to content

Problems encountered when using sequence generators #2744

@PuJeff

Description

@PuJeff

Environment details

  • SDV version:1.26
  • Python version: 3.12
  • Operating System:Linux

Problem description

I want to use the CPAR model provided by SDV to synthesize time series data. However, my actual data has a long-tailed distribution, with a large proportion of 0 values ​​and sparse non-zero values. I can now learn the distribution of 0 values ​​quite well, but the non-zero portion is always much smaller than the actual data. After replacing the 0 values ​​with np.nan, the size of the non-zero portion improved somewhat. However, when I trained the two sets of data separately and performed an A/B test on the generated data, the actual data was not significant, but the generated data was significant; sometimes the mean of 'a' was greater than 'b', but in reality, the mean of 'a' was less than 'b'. Later, I trained the two sets of data together, and the mean of 'a' was no longer greater than 'b', but the significance was still inconsistent with the actual data.

What I already tried

My real data 'a' contains 198,110 values, of which 176,200 are 0, and the remaining values ​​range from 0.0001 to 35.9779. Real data 'b' contains 198,528 values, of which 176,553 are 0, and the non-zero values ​​range from 0.0001 to 75.3298.

If possible, also add below the exact code that you are running.>
df = pd.read_csv("***********************************************")
df.rename(columns={"date": "timestamp", "guest_ID": "entity_id"}, inplace=True)
df["timestamp"] = pd.to_datetime(df["timestamp"])
df = df.sort_values(["entity_id", "timestamp"])

df["user_id"] = df.groupby("experiment_group")["entity_id"].transform(
lambda x: pd.factorize(x)[0] + 1
)

df["user_id_unique"] = df["experiment_group"].astype(str) + "_" + df["user_id"].astype(str)
cols = list(df.columns)
if "experiment_group" in cols and "user_id_unique" in cols:
exp_idx = cols.index("experiment_group")
cols.remove("user_id_unique")
cols.insert(exp_idx + 1, "user_id_unique")
df = df[cols]

df = df.drop(columns=["user_id"])
df = df.drop(columns=["experiment_group"])
df = df.replace(0, np.nan)

cols_to_log = ['bv', 'rv', 'iv', 'session_time']
for col in cols_to_log:
if col in df.columns:
df[col] = np.log1p(df[col])

print(df["user_id_unique"].unique())
print(df)
print(df.dtypes)

metadata = Metadata.detect_from_dataframe(df)
metadata.update_column("user_id_unique", sdtype="categorical")
metadata.update_column("bin", sdtype="numerical", computer_representation='Int64')
metadata.update_column("bv", sdtype="numerical", computer_representation='Float')
metadata.update_column("iin", sdtype="numerical", computer_representation='Int64')
metadata.update_column("iv", sdtype="numerical", computer_representation='Float')
metadata.update_column("rin", sdtype="numerical", computer_representation='Int64')
metadata.update_column("rv", sdtype="numerical", computer_representation='Float')
metadata.update_column("session_time", sdtype="numerical", computer_representation='Float')
metadata.update_column("timestamp", sdtype="datetime")
metadata.update_column("entity_id", sdtype="id")
metadata.set_sequence_key("entity_id")
metadata.set_sequence_index("timestamp")

synthesizer = PARSynthesizer(metadata, epochs=800, verbose=True, context_columns=["user_id_unique"])
synthesizer.fit(df)
synthesizer.save("*******************")
loss_values = synthesizer.get_loss_values()
loss_values.to_csv("loss.csv", index=False)

loaded_synthesizer = PARSynthesizer.load("*****************")

context_columns_df = pd.DataFrame(data={
'user_id_unique': df["user_id_unique"].unique()
})

generated_data = loaded_synthesizer.sample_sequential_columns(
sequence_length=22,
context_columns=context_columns_df,
)

for col in cols_to_log:
if col in generated_data.columns:
generated_data[col] = np.expm1(generated_data[col])

generated_data = generated_data.replace(np.nan, 0)
print(generated_data)

generated_data["value"] = generated_data["bv"] + generated_data["iv"] + generated_data["rv"]
generated_data.to_csv("", index=False, float_format='%.6f')

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    newAutomatic label applied to new issuesquestionGeneral question about the software

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions