Problems encountered when using sequence generators

### Environment details

* SDV version:1.26
* Python version: 3.12
* Operating System:Linux

### Problem description

I want to use the CPAR model provided by SDV to synthesize time series data. However, my actual data has a long-tailed distribution, with a large proportion of 0 values ​​and sparse non-zero values. I can now learn the distribution of 0 values ​​quite well, but the non-zero portion is always much smaller than the actual data. After replacing the 0 values ​​with np.nan, the size of the non-zero portion improved somewhat. However, when I trained the two sets of data separately and performed an A/B test on the generated data, the actual data was not significant, but the generated data was significant; sometimes the mean of 'a' was greater than 'b', but in reality, the mean of 'a' was less than 'b'. Later, I trained the two sets of data together, and the mean of 'a' was no longer greater than 'b', but the significance was still inconsistent with the actual data.

### What I already tried

My real data 'a' contains 198,110 values, of which 176,200 are 0, and the remaining values ​​range from 0.0001 to 35.9779. Real data 'b' contains 198,528 values, of which 176,553 are 0, and the non-zero values ​​range from 0.0001 to 75.3298.

If possible, also add below the exact code that you are running.>
df = pd.read_csv("***********************************************") 
df.rename(columns={"date": "timestamp", "guest_ID": "entity_id"}, inplace=True)
df["timestamp"] = pd.to_datetime(df["timestamp"])
df = df.sort_values(["entity_id", "timestamp"])

df["user_id"] = df.groupby("experiment_group")["entity_id"].transform(
    lambda x: pd.factorize(x)[0] + 1
)

df["user_id_unique"] = df["experiment_group"].astype(str) + "_" + df["user_id"].astype(str)
cols = list(df.columns)
if "experiment_group" in cols and "user_id_unique" in cols:
    exp_idx = cols.index("experiment_group")
    cols.remove("user_id_unique")
    cols.insert(exp_idx + 1, "user_id_unique")
    df = df[cols]

df = df.drop(columns=["user_id"])
df = df.drop(columns=["experiment_group"])
df = df.replace(0, np.nan)

cols_to_log = ['bv', 'rv', 'iv', 'session_time']
for col in cols_to_log:
    if col in df.columns:
        df[col] = np.log1p(df[col])
      
print(df["user_id_unique"].unique())
print(df)
print(df.dtypes)

metadata = Metadata.detect_from_dataframe(df)
metadata.update_column("user_id_unique", sdtype="categorical")
metadata.update_column("bin", sdtype="numerical", computer_representation='Int64')
metadata.update_column("bv", sdtype="numerical", computer_representation='Float')
metadata.update_column("iin", sdtype="numerical", computer_representation='Int64')
metadata.update_column("iv", sdtype="numerical", computer_representation='Float')
metadata.update_column("rin", sdtype="numerical", computer_representation='Int64')
metadata.update_column("rv", sdtype="numerical", computer_representation='Float')
metadata.update_column("session_time", sdtype="numerical", computer_representation='Float')
metadata.update_column("timestamp", sdtype="datetime")
metadata.update_column("entity_id", sdtype="id")
metadata.set_sequence_key("entity_id")
metadata.set_sequence_index("timestamp")

synthesizer = PARSynthesizer(metadata, epochs=800, verbose=True, context_columns=["user_id_unique"])
synthesizer.fit(df)
synthesizer.save("*******************")
loss_values = synthesizer.get_loss_values() 
loss_values.to_csv("loss.csv", index=False)

loaded_synthesizer = PARSynthesizer.load("*****************")

context_columns_df = pd.DataFrame(data={
    'user_id_unique': df["user_id_unique"].unique()
})

generated_data = loaded_synthesizer.sample_sequential_columns(
    sequence_length=22,
    context_columns=context_columns_df,
)

for col in cols_to_log:
    if col in generated_data.columns:
        generated_data[col] = np.expm1(generated_data[col])

generated_data = generated_data.replace(np.nan, 0)
print(generated_data)


generated_data["value"] = generated_data["bv"] + generated_data["iv"] + generated_data["rv"]
generated_data.to_csv("", index=False, float_format='%.6f')
```
Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems encountered when using sequence generators #2744

Environment details

Problem description

What I already tried

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems encountered when using sequence generators #2744

Description

Environment details

Problem description

What I already tried

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions