Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions examples/tartan_federer_attack/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Tartan Federer Attack

This example runs a Tartan–Federer membership inference attack using trained TabDDPM models. The pipeline optionally performs a data processing step to prepare population datasets and then executes the attack using a binary classifier.

## Data Processing

The data processing step constructs population datasets representing the real data available to the attacker.

A selected subset of `train_with_id.csv` files is collected from `tabddpm_1` to `tabddpm_6` located under:

```
examples/tartan_federer_attack/tabddpm_trained_with_20k/tabddpm_white_box
```

For each selected model, both `train_with_id.csv` and `challenge_with_id.csv` are loaded. All training datasets are merged into a single dataframe and all challenge datasets are merged into a single dataframe. Any training samples that also appear in the challenge dataset are removed, and duplicate samples are dropped based on configured identifier columns.

The model indices used to build the population datasets for training and validation are specified in the configuration file:

```yaml
data_processing_config:
population_attack_indices_to_collect_for_training: [1, 2]
population_attack_indices_to_collect_for_validation: [3, 4]
model_type: tabddpm
columns_for_deduplication: ['trans_id', 'balance']
```
## Running the Attack
Before running the attack, activate your virtual environment and update `config.yaml` as needed. From the top-level directory of the library, run:

```bash
python -m examples.tartan_federer_attack.run_attack
```
Empty file.
47 changes: 47 additions & 0 deletions examples/tartan_federer_attack/configs/experiment_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Ensemble experiment configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be Tartan-Federer instead 🙂

# This config can be used to run both the Ensemble attack training (``run_attack.py``) and testing phases (``tets_attack_model.py``).
base_experiment_dir: examples/tartan_federer_attack/tabddpm_20k_experiment_data # Processed data, and experiment artifacts will be stored here
base_data_config_dir: examples/tartan_federer_attack/data_configs # Training and data type configs are saved under this directory

# Pipeline control
pipeline:
run_data_processing: true


# Data paths
data_paths:
midst_data_path: examples/tartan_federer_attack/tabddpm_20k_experiment_data/tabddpm_white_box # Used to collect the data
population_data_path: ${base_experiment_dir}/population_data # Path where the processed population data will be stored
metadata_dir: ${base_experiment_dir}/data_configs # Path where metadata about the experiment will be stored


# Dataset specific information used for processing in this example
data_processing_config:
population_attack_indices_to_collect_for_training:
[1,2]
population_attack_indices_to_collect_for_validation:
[3,4]
model_type: tabddpm
columns_for_deduplication: ['trans_id', 'balance']


attack_config:
model_type: tabddpm
models_base_dir: examples/tartan_federer_attack/tabddpm_20k_experiment_data/tabddpm_white_box
target_model_subdir: "."
samples_per_train_model: 3000
sample_per_val_model: 10
num_noise_per_time_step: 30
timesteps: [5, 10, 15]
additional_timesteps: [0]
predictions_file_format: challenge_label_predictions
results_path: ${base_experiment_dir}/tartan_federer_attack_results
test_indices: [3]
train_indices: [1]
val_indices: [2]
columns_for_deduplication: [trans_id, balance]

classifier_config:
hidden_dim: 100
learning_rate: 1e-4
num_epochs: 200
162 changes: 162 additions & 0 deletions examples/tartan_federer_attack/run_attack.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
import os
from logging import INFO
from pathlib import Path
from typing import Any, cast

import hydra
import pandas as pd
from omegaconf import DictConfig, OmegaConf

from midst_toolkit.attacks.tartan_federer.tartan_federer_attack import tartan_federer_attack
from midst_toolkit.common.logger import log
from midst_toolkit.common.random import set_all_random_seeds, unset_all_random_seeds


def prepare_population_dataset_for_attack(
model_indices: list[int],
model_type: str,
models_base_dir: Path,
columns_for_deduplication: list[str],
) -> pd.DataFrame:
"""
Prepares data for an attack by merging and deduplicating datasets.
Args:
model_indices: List of model indices over which to iterate and for which to gather information.
model_type: Name of the model type for which we're loading data.
models_base_dir: Where the various models' data lives.
columns_for_deduplication: Names of columns to use in de-duplicating the dataframes
Raises:
ValueError: Throws if the list of model indices is empty.
ValueError: Throws if any of the dataframes to be de-duplicated do not have the specified columns in
``columns_for_deduplication``
Returns:
A DataFrame containing the merged trainig data that has been deduplicated and is free from challenge data.
"""
if len(model_indices) == 0:
raise ValueError("The 'indices' list is empty. Please provide indices to process datasets.")

df_merge_list = []
df_challenge_list = []

for model_index in model_indices:
base_path = models_base_dir / f"{model_type}_{model_index}"
df_merge_list.append(pd.read_csv(os.path.join(base_path, "train_with_id.csv")))
df_challenge_list.append(pd.read_csv(os.path.join(base_path, "challenge_with_id.csv")))

df_merge = pd.concat(df_merge_list, ignore_index=True)
df_challenge = pd.concat(df_challenge_list, ignore_index=True)
# Deduplicate the datasets once
df_merge = df_merge.drop_duplicates(subset=columns_for_deduplication)
df_challenge = df_challenge.drop_duplicates(subset=columns_for_deduplication)

# Ensure all keys for deduplication exist in both DataFrames
missing_keys_merge = [key for key in columns_for_deduplication if key not in df_merge.columns]
missing_keys_challenge = [key for key in columns_for_deduplication if key not in df_challenge.columns]
if missing_keys_merge or missing_keys_challenge:
raise ValueError(f"Missing columns for deduplication: {missing_keys_merge + missing_keys_challenge}")

# Remove challenge entries from the merged dataset
return df_merge[
~df_merge.set_index(columns_for_deduplication).index.isin(
df_challenge.set_index(columns_for_deduplication).index
)
]


def run_data_processing(config: dict[str, Any]) -> None:
"""
Run the data processing pipeline for the Tartan–Federer attack example.
This function prepares the population datasets required for training and validating the attack.
Args:
config: Attack configuration as an OmegaConf DictConfig object.
"""
log(INFO, "Running data processing pipeline...")

# 🔑 Hydra → plain Python
Path(config["data_paths"]["population_data_path"]).mkdir(parents=True, exist_ok=True)

population_data_for_training_attack = prepare_population_dataset_for_attack(
model_indices=config["data_processing_config"]["population_attack_indices_to_collect_for_training"],
model_type=config["data_processing_config"]["model_type"],
models_base_dir=Path(config["data_paths"]["midst_data_path"]),
columns_for_deduplication=config["data_processing_config"]["columns_for_deduplication"],
)

population_data_for_training_attack.to_csv(
Path(config["data_paths"]["population_data_path"]) / "population_dataset_for_training_attack.csv",
index=False,
)

population_data_for_validating_attack = prepare_population_dataset_for_attack(
model_indices=config["data_processing_config"]["population_attack_indices_to_collect_for_validation"],
model_type=config["data_processing_config"]["model_type"],
models_base_dir=Path(config["data_paths"]["midst_data_path"]),
columns_for_deduplication=config["data_processing_config"]["columns_for_deduplication"],
)

population_data_for_validating_attack.to_csv(
Path(config["data_paths"]["population_data_path"]) / "population_dataset_for_validating_attack.csv",
index=False,
)

log(INFO, "Data processing pipeline finished.")


@hydra.main(config_path="configs", config_name="experiment_config", version_base=None)
def run_attack(config: DictConfig) -> None:
"""
Run the Tartan–Federer attack example pipeline.
Args:
config: Attack configuration as an OmegaConf DictConfig object.
"""
log(INFO, "Running Tartan–Federer attack...")

set_all_random_seeds(
seed=133742,
use_deterministic_torch_algos=True,
disable_torch_benchmarking=True,
)

cfg = cast(dict[str, Any], OmegaConf.to_container(config, resolve=True))

if config["pipeline"]["run_data_processing"]:
run_data_processing(cfg)

data_cfg = cfg["data_paths"]
attack_cfg = cfg["attack_config"]
classifier_cfg = cfg["classifier_config"]

mia_performance_train, mia_performance_val, mia_performance_test = tartan_federer_attack(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, I am missing where the user can define the type of the attack (white-box or black-box). Do we give this option to the user?
This example seems to be black-box, but I might be wrong. Adding this information to the README would also be great!

Copy link
Collaborator Author

@bzamanlooy bzamanlooy Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there is really not a difference. If we set the target_model_subdir to be the one containing the shadow model trained using synthetic data, the attack is going to be black box essentially and we currently do not have them trained using the toolkit so perhaps I can add that as a todo as well for Elahe and to update the code then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! So it depends on what is in the config.target_shadow_model_subdir directory. If there are trained shadow models there, it would be black-box, but in this config, it is set to target_shadow_model_subdir: "." So it will be white-box. It wasn't entirely clear to me at first, but I suppose users will already be aware of this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can optionally note that in a comment in experiment_config.yaml.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking comment (sorry): it might be a bit more readable if the inputs followed the function signature order.

model_type=attack_cfg["model_type"],
model_data_dir=Path(attack_cfg["models_base_dir"]),
target_model_subdir=Path(attack_cfg["target_model_subdir"]),
samples_per_train_model=attack_cfg["samples_per_train_model"],
sample_per_val_model=attack_cfg["sample_per_val_model"],
num_noise_per_time_step=attack_cfg["num_noise_per_time_step"],
timesteps=attack_cfg["timesteps"],
additional_timesteps=attack_cfg["additional_timesteps"],
predictions_file_format=attack_cfg["predictions_file_format"],
results_path=Path(attack_cfg["results_path"]),
test_indices=attack_cfg["test_indices"],
train_indices=attack_cfg["train_indices"],
val_indices=attack_cfg["val_indices"],
columns_for_deduplication=attack_cfg["columns_for_deduplication"],
classifier_hidden_dim=classifier_cfg["hidden_dim"],
classifier_num_epochs=classifier_cfg["num_epochs"],
classifier_learning_rate=classifier_cfg["learning_rate"],
meta_dir=Path(config["data_paths"]["metadata_dir"]),
population_data_dir=Path(data_cfg["population_data_path"]),
)

unset_all_random_seeds()

log(INFO, "Attack finished successfully.")


if __name__ == "__main__":
run_attack()
51 changes: 0 additions & 51 deletions src/midst_toolkit/attacks/tartan_federer/data_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,57 +157,6 @@ def save_results_and_plot_roc_curve(
log(INFO, f"✅ All runs completed. Results saved to {results_summary_path}")


def prepare_population_dataset_for_attack(
model_indices: list[int], model_type: str, models_base_dir: Path, columns_for_deduplication: list[str]
) -> pd.DataFrame:
"""
Prepares data for an attack by merging and deduplicating datasets.

Args:
model_indices: List of model indices over which to iterate and for which to gather information.
model_type: Name of the model type for which we're loading data.
models_base_dir: Where the various models' data lives.
columns_for_deduplication: Names of columns to use in de-duplicating the dataframes

Raises:
ValueError: Throws if the list of model indices is empty.
ValueError: Throws if any of the dataframes to be de-duplicated do not have the specified columns in
``columns_for_deduplication``

Returns:
A DataFrame containing the merged trainig data that has been deduplicated and is free from challenge data.
"""
if len(model_indices) == 0:
raise ValueError("The 'indices' list is empty. Please provide indices to process datasets.")

df_merge_list = []
df_challenge_list = []

for model_index in model_indices:
base_path = models_base_dir / f"{model_type}_{model_index}"
df_merge_list.append(pd.read_csv(os.path.join(base_path, "train_with_id.csv")))
df_challenge_list.append(pd.read_csv(os.path.join(base_path, "challenge_with_id.csv")))

df_merge = pd.concat(df_merge_list, ignore_index=True)
df_challenge = pd.concat(df_challenge_list, ignore_index=True)
# Deduplicate the datasets once
df_merge = df_merge.drop_duplicates(subset=columns_for_deduplication)
df_challenge = df_challenge.drop_duplicates(subset=columns_for_deduplication)

# Ensure all keys for deduplication exist in both DataFrames
missing_keys_merge = [key for key in columns_for_deduplication if key not in df_merge.columns]
missing_keys_challenge = [key for key in columns_for_deduplication if key not in df_challenge.columns]
if missing_keys_merge or missing_keys_challenge:
raise ValueError(f"Missing columns for deduplication: {missing_keys_merge + missing_keys_challenge}")

# Remove challenge entries from the merged dataset
return df_merge[
~df_merge.set_index(columns_for_deduplication).index.isin(
df_challenge.set_index(columns_for_deduplication).index
)
]


def evaluate_attack_performance(
model_indices: list[int],
description: str,
Expand Down
Loading