SynEval: Synthetic Data Evaluation Framework

SynEval is a comprehensive evaluation framework for assessing the quality of synthetic data generated by Large Language Models (LLMs). The framework provides quantitative scoring across four key dimensions:

Fidelity: Measures how well the synthetic data preserves the statistical properties and patterns of the original data
Utility: Evaluates the usefulness of synthetic data for downstream tasks
Diversity: Assesses the variety and uniqueness of the generated data
Privacy: Analyzes the privacy protection level of the synthetic data

Installation

Clone the repository:

git clone https://github.com/SCU-TrustworthyAI/SynEval.git
cd SynEval

(Optional) Create and activate a conda virtual environment:

conda create -n syneval python=3.10
conda activate syneval

Prepare environment (one command):

python prepare_environment.py

This script will automatically:

Install all required Python packages from requirements.txt (with dependency conflict resolution)
Download required NLTK data packages (including punkt_tab)
Create necessary directories (plots/)
Test the installation

Note: You may see dependency conflict warnings during installation. This is normal in environments like Google Colab or when other packages are already installed. These conflicts won't affect SynEval functionality.

For a clean installation without conflicts, consider using a virtual environment:

conda create -n syneval python=3.10
conda activate syneval
python prepare_environment.py

Quick Start

Running SynEval Demo on Google Colab

The easiest way to get started with SynEval is to run the demo notebook on Google Colab:

Open the Demo Notebook:
- Navigate to SynEval_Demo.ipynb in the repository
- Click "Open in Colab" or upload the notebook to Google Colab
Install Dependencies:
- The notebook includes a setup cell that automatically installs all required dependencies
- Run the setup cell to prepare the environment
Run the Demo:
- The notebook provides step-by-step examples of running SynEval evaluations
- Includes sample data and metadata for testing
- Demonstrates all four evaluation dimensions (fidelity, utility, diversity, privacy)
View Results:
- Evaluation results are displayed inline with detailed explanations
- Plots are generated and shown directly in the notebook
- Results are also saved to files for further analysis

Command Line Usage

After installation, you can use SynEval from the command line:

python run.py \
    --synthetic synthetic_data.csv \
    --original original_data.csv \
    --metadata metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output results.json \
    --plot

Requirements

Python 3.10+
pandas
Additional dependencies will be installed automatically

Usage

Running Evaluations

The main entry point for the framework is run.py. This script allows you to evaluate synthetic data against original data using various metrics.

Basic Usage

Here's an example of running the evaluation framework:

python run.py \
    --synthetic ../Data/claude.csv \
    --original ../Data/real_10k.csv \
    --metadata ../Data/metadata.json \
    --dimensions fidelity \
    --utility-input text \
    --utility-output rating

The general command format is:

python run.py --synthetic <synthetic_data.csv> --original <original_data.csv> --metadata <metadata.json> [evaluation_flags] [--output <results.json>]

Required Arguments

--synthetic: Path to the synthetic data CSV file
--original: Path to the original data CSV file
--metadata: Path to the metadata JSON file

Evaluation Flags

You can select one or more evaluation dimensions to run:

--fidelity: Run fidelity evaluation
--utility: Run utility evaluation
--diversity: Run diversity evaluation
--privacy: Run privacy evaluation

Optional Arguments

--output: Path to save evaluation results in JSON format. If not specified, results will be printed to stdout.
--plot: Generate plots for all evaluation metrics and save them to the ./plots directory. Plots will visualize key metrics from fidelity, utility, diversity, and privacy evaluations.

General Example

python run.py \
    --synthetic synthetic_hotel_data.csv \
    --original original_hotel_data.csv \
    --metadata hotel_metadata.json \
    --fidelity \
    --utility \
    --output evaluation_results.json \
    --plot

Metadata Format

The metadata file should be a JSON file that describes the structure of your data. It should include:

Column names and their types
Dataset name
Primary key information

Example metadata format:

{
  "columns": {
    "_id": {
      "sdtype": "numerical",
      "pii": false,
      "is_primary_key": true
    },
    "rating": {
      "sdtype": "categorical",
      "values": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    "title": {
      "sdtype": "text"
    },
    "text": {
      "sdtype": "text"
    },
    "asin": {
      "sdtype": "numerical",
      "pii": true
    },
    "parent_asin": {
      "sdtype": "numerical",
      "pii": true
    },
    "user_id": {
      "sdtype": "numerical",
      "pii": true
    },
    "timestamp": {
      "sdtype": "datetime"
    },
    "helpful_vote": {
      "sdtype": "numerical"
    },
    "verified_purchase": {
      "sdtype": "boolean",
      "values": [true, false]
    }
  },
  "text_columns": ["title", "text"],
  "utility": {
    "input_columns": ["text"],
    "output_columns": ["rating"],
    "task_type": "classification"
  }
}

Evaluation Results Description

Fidelity Evaluation (`fidelity.py`)

Fidelity measures how well the synthetic data preserves the statistical properties and patterns of the original data. This evaluation uses both SDV (Synthetic Data Vault) metrics and custom statistical analysis.

Diagnostic Metrics (SDV-based)

Data Type: Structured data only

Data Validity: Measures the percentage of valid data in synthetic dataset (0-1 scale)
- Algorithm: SDV's diagnostic evaluation checks for data type consistency, missing value patterns, and constraint violations
- Score Calculation: Percentage of rows that pass all validity checks divided by total rows
- Interpretation: Higher scores indicate better data quality and adherence to original data constraints
Data Structure: Evaluates how well the synthetic data maintains the structural relationships of the original data (0-1 scale)
- Algorithm: SDV analyzes primary key uniqueness, foreign key relationships, and referential integrity
- Score Calculation: Weighted average of structural constraint compliance scores
- Interpretation: Higher scores indicate better preservation of data relationships and constraints
Overall Score: Combined diagnostic score indicating overall data quality
- Algorithm: Weighted average of Data Validity and Data Structure scores
- Score Calculation: (Data Validity × 0.6) + (Data Structure × 0.4)
- Interpretation: Comprehensive measure of basic data quality and structural integrity

Quality Metrics (SDV-based)

Data Type: Structured data only

Column Shapes: Measures how well the synthetic data preserves the distribution shapes of individual columns (0-1 scale)
- Algorithm: SDV uses statistical tests (Kolmogorov-Smirnov for continuous, Chi-square for categorical) to compare distributions
- Score Calculation: Average of distribution similarity scores across all columns, normalized to 0-1 scale
- Interpretation: Higher scores indicate better preservation of individual column distributions
Column Pair Trends: Evaluates the preservation of relationships between column pairs (0-1 scale)
- Algorithm: SDV analyzes correlation coefficients, mutual information, and conditional distributions between column pairs
- Score Calculation: Average of pairwise relationship preservation scores across all column combinations
- Interpretation: Higher scores indicate better preservation of inter-column relationships and correlations
Overall Quality Score: Combined quality score for statistical fidelity
- Algorithm: Weighted average of Column Shapes and Column Pair Trends
- Score Calculation: (Column Shapes × 0.7) + (Column Pair Trends × 0.3)
- Interpretation: Comprehensive measure of statistical fidelity and relationship preservation

Numerical Statistics Analysis (Custom)

Data Type: Numerical columns only

Basic Statistics Comparison: Compares fundamental statistical measures between original and synthetic data
- Measures: Mean, median, standard deviation, min/max, quartiles (Q25, Q75), skewness, kurtosis
- Algorithm: Direct statistical calculation using pandas/numpy functions
- Score Calculation: Relative differences calculated as |synthetic - original| / |original|
- Interpretation: Lower relative differences indicate better statistical preservation
Range Coverage: Measures how much of the original data range is covered by synthetic data
- Algorithm: Calculates overlap between original and synthetic value ranges
- Score Calculation: Overlap range / Original range, where overlap = min(max_syn, max_orig) - max(min_syn, min_orig)
- Interpretation: Higher coverage (closer to 1.0) indicates better range preservation
Distribution Similarity: Compares the shape and characteristics of data distributions
- KL Divergence: Measures information loss between original and synthetic distributions
  - Algorithm: Kullback-Leibler divergence using histogram binning
  - Score Calculation: Σ p(x) * log(p(x)/q(x)) where p=original, q=synthetic
  - Interpretation: Lower values indicate more similar distributions (0 = identical)
- Histogram Intersection: Measures overlap between distribution histograms
  - Algorithm: Calculates intersection of normalized histograms
  - Score Calculation: Σ min(hist_orig[i], hist_syn[i])
  - Interpretation: Higher values (closer to 1.0) indicate better distribution similarity
Overall Fidelity Score: Combined numerical fidelity metric
- Algorithm: Weighted average of multiple preservation metrics
- Score Calculation: Average of (mean preservation, std preservation, skewness preservation, range coverage, histogram similarity)
- Interpretation:
  - 0.9-1.0: Excellent fidelity
  - 0.8-0.9: Good fidelity
  - 0.7-0.8: Fair fidelity
  - 0.6-0.7: Poor fidelity
  - <0.6: Very poor fidelity

Text-Specific Metrics (for text columns)

Data Type: Text columns only

Length Statistics: Compares text length distributions between original and synthetic data
- Algorithm: Character count and word count analysis using string operations
- Measures: Mean and standard deviation of text lengths and word counts
- Score Calculation: Direct statistical comparison of length distributions
- Interpretation: Similar means and standard deviations indicate good text length preservation
Keyword Analysis: Compares the most important keywords (TF-IDF scores) between datasets
- Algorithm: TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
- Score Calculation:
  1. Fit TF-IDF vectorizer on original data
  2. Transform both datasets using the same vectorizer
  3. Calculate mean TF-IDF scores for each term
  4. Rank terms by importance scores
- Interpretation: Similar top keywords and scores indicate good content preservation
Sentiment Analysis: Compares sentiment distributions between datasets
- Algorithm: TextBlob sentiment analysis using polarity scores (-1 to +1)
- Score Calculation:
  1. Calculate sentiment polarity for each text
  2. Compute mean and standard deviation of sentiment scores
  3. Categorize into negative (<-0.1), neutral (-0.1 to 0.1), positive (>0.1)
- Interpretation: Similar sentiment distributions indicate good emotional tone preservation

Interpretation: Higher scores (closer to 1.0) indicate better fidelity. Scores above 0.8 are considered good, while scores below 0.6 may indicate significant quality issues.

Utility Evaluation (`utility.py`)

Utility evaluates how useful the synthetic data is for downstream machine learning tasks using the Train on Synthetic, Test on Real (TSTR) methodology.

Task Information

Data Type: Both structured and text data

Task Type: Classification, regression, or text classification
- Classification: For categorical target variables
- Regression: For numerical target variables
- Text Classification: For text input with categorical output
Input Columns: Features used for prediction
Output Columns: Target variables to predict
Training Size: Number of synthetic samples used for training
Test Size: Number of real samples used for testing

Model Performance Comparison

Data Type: Both structured and text data

Real Data Model: Performance metrics when training on real data
Synthetic Data Model: Performance metrics when training on synthetic data

Classification Metrics (for classification tasks)

Data Type: Both structured and text data

Accuracy: Overall prediction accuracy
- Algorithm: sklearn.metrics.accuracy_score
- Score Calculation: (Correct predictions) / (Total predictions)
- Interpretation: Higher accuracy indicates better model performance
Precision: Precision for each class
- Algorithm: sklearn.metrics.precision_score with per-class calculation
- Score Calculation: True Positives / (True Positives + False Positives) for each class
- Interpretation: Higher precision indicates fewer false positive predictions
Recall: Recall for each class
- Algorithm: sklearn.metrics.recall_score with per-class calculation
- Score Calculation: True Positives / (True Positives + False Negatives) for each class
- Interpretation: Higher recall indicates fewer false negative predictions
F1-Score: Harmonic mean of precision and recall
- Algorithm: sklearn.metrics.f1_score
- Score Calculation: 2 × (Precision × Recall) / (Precision + Recall)
- Interpretation: Balanced measure of precision and recall
Macro/Micro Averages: Overall performance across all classes
- Macro: Average of per-class metrics (treats all classes equally)
- Micro: Global metric calculated from total true/false positives/negatives

Regression Metrics (for regression tasks)

Data Type: Numerical target variables only

R² Score: Coefficient of determination
- Algorithm: sklearn.metrics.r2_score
- Score Calculation: 1 - (SS_res / SS_tot) where SS_res = sum of squared residuals, SS_tot = total sum of squares
- Interpretation: Higher R² (closer to 1.0) indicates better model fit
Mean Squared Error: Average squared prediction error
- Algorithm: sklearn.metrics.mean_squared_error
- Score Calculation: Average of (predicted - actual)²
- Interpretation: Lower MSE indicates better predictions
Root Mean Squared Error: Square root of mean squared error
- Algorithm: Square root of MSE
- Score Calculation: √(MSE)
- Interpretation: Lower RMSE indicates better predictions (in same units as target)

Feature Processing

Data Type: Both structured and text data

Text Processing: For text input columns
- Algorithm: TF-IDF vectorization with sklearn
- Parameters: max_features=1000, min_df=2, max_df=0.95, stop_words='english'
- Process: Convert text to numerical features using term frequency-inverse document frequency
Categorical Processing: For categorical input columns
- Algorithm: One-hot encoding using pandas.get_dummies()
- Process: Convert categorical variables to binary columns
Numerical Processing: For numerical input columns
- Algorithm: Direct use with missing value imputation
- Process: Fill NaN values with 0 or mean/median

Interpretation: The synthetic data model should perform close to the real data model. A performance gap of less than 10% is considered good utility, while gaps above 20% may indicate poor synthetic data quality.

Diversity Evaluation (`diversity.py`)

Diversity assesses the variety and uniqueness of the generated data across multiple dimensions.

Tabular Diversity Metrics

Data Type: Structured data only

Coverage Metrics

Column Coverage: Percentage of original data values/range covered by synthetic data
- For numerical columns: Range overlap percentage
  - Algorithm: Calculate overlap between original and synthetic value ranges
  - Score Calculation: (Overlap range) / (Original range) × 100
- For categorical columns: Category overlap percentage
  - Algorithm: Count common categories between original and synthetic data
  - Score Calculation: (Common categories) / (Original categories) × 100

Uniqueness Metrics

Synthetic Duplicate Ratio: Percentage of duplicate rows in synthetic data
- Algorithm: pandas.drop_duplicates() to identify unique rows
- Score Calculation: (Total rows - Unique rows) / Total rows × 100
Original Duplicate Ratio: Percentage of duplicate rows in original data
- Algorithm: Same as synthetic duplicate ratio
Relative Duplication: How synthetic duplication compares to original duplication
- Score Calculation: (Synthetic duplicate ratio) / (Original duplicate ratio) × 100

Numerical Metrics (for numerical columns)

Statistical Differences: Differences in mean, standard deviation, skewness, and kurtosis
- Algorithm: Direct statistical calculation and comparison
- Score Calculation: |Synthetic value - Original value| / |Original value| for relative differences
Range Coverage: Percentage of original value range covered by synthetic data
- Algorithm: Calculate overlap between value ranges
- Score Calculation: (Overlap range) / (Original range) × 100
Quartile Coverage: How well synthetic data covers the 25th, 50th, and 75th percentiles
- Algorithm: Compare quartile values between datasets
- Score Calculation: Relative difference for each quartile
Distribution Similarity: KL divergence and similarity score between distributions
- Algorithm: Histogram-based KL divergence calculation
- Score Calculation: KL divergence and similarity = exp(-KL_divergence)

Categorical Metrics (for categorical columns)

Category Coverage: Percentage of original categories present in synthetic data
- Algorithm: Set intersection of category sets
- Score Calculation: (Common categories) / (Original categories) × 100
Distribution Similarity: How well synthetic data preserves category frequency distributions
- Algorithm: Compare normalized frequency distributions
- Score Calculation: 1 - (Total absolute difference / 2) × 100
Entropy: Information content comparison between original and synthetic data
- Algorithm: Shannon entropy calculation: -Σ p(x) × log2(p(x))
- Score Calculation: Entropy for each dataset and their difference
Top Categories Coverage: How well synthetic data covers the most common categories
- Algorithm: Compare top N most frequent categories
- Score Calculation: (Common top categories) / (Top N categories) × 100
Rare Categories Coverage: How well synthetic data preserves rare categories
- Algorithm: Identify categories with frequency < 1% and check coverage
- Score Calculation: (Common rare categories) / (Total rare categories) × 100

Entropy Metrics

Column Entropy: Information content for each column
- Algorithm: Shannon entropy calculation for each column
- Score Calculation: -Σ p(x) × log2(p(x)) where p(x) is probability of value x
Dataset Entropy: Overall information content comparison
- Algorithm: Average entropy across all columns
- Score Calculation: Mean of column entropies
Entropy Ratio: Synthetic entropy relative to original entropy
- Score Calculation: (Synthetic entropy) / (Original entropy)

Text Diversity Metrics

Data Type: Text columns only

Lexical Diversity

N-gram Analysis: For n=1 to 5, measures:
- Algorithm: NLTK n-gram generation and analysis
- total: Total number of n-grams
- unique: Number of unique n-grams
- unique_ratio: Ratio of unique to total n-grams
- entropy: Information content of n-gram distribution
  - Algorithm: Shannon entropy: -Σ p(x) × log2(p(x))
- normalized_entropy: Entropy normalized by maximum possible entropy
  - Score Calculation: Entropy / log2(unique_count)

Semantic Diversity

Total MST Weight: Total weight of minimum spanning tree connecting text embeddings
- Algorithm:
  1. Train Word2Vec model on text corpus
  2. Generate embeddings for each text
  3. Calculate cosine distances between all pairs
  4. Construct minimum spanning tree using Kruskal's algorithm
- Score Calculation: Sum of edge weights in MST
Average Edge Weight: Average distance between semantically similar texts
- Score Calculation: Total MST weight / Number of edges
Distinct Nodes: Number of unique semantic representations
- Algorithm: Count unique embeddings after rounding to 6 decimal places
Distinct Ratio: Ratio of distinct semantic representations to total texts
- Score Calculation: Distinct nodes / Total texts

Sentiment Diversity

Sentiment by Rating: Positive sentiment percentage for each rating level
- Algorithm: Flair sentiment classifier applied to each text
- Score Calculation: Percentage of positive sentiment texts for each rating
Ideal Sentiment: Expected sentiment distribution based on rating
- Algorithm: Linear mapping from rating to expected positive sentiment
- Score Calculation: (rating - 1) / 4 for 1-5 scale ratings
Sentiment Alignment Score: How well sentiment aligns with rating expectations
- Algorithm: Calculate deviation from ideal sentiment distribution
- Score Calculation: Average of (1 - |actual - ideal|) across all ratings

Interpretation: Higher diversity scores indicate more varied and unique synthetic data. Good diversity should show:

Coverage scores above 80%
Duplicate ratios below 5%
Entropy ratios close to 1.0
High lexical and semantic diversity scores

Privacy Evaluation (`privacy.py`)

Privacy analysis evaluates the protection level of sensitive information in synthetic data.

Exact Match Analysis

Data Type: Both structured and text data

Exact Match Percentage: Percentage of synthetic rows that exactly match original rows
- Algorithm: Row-by-row comparison using pandas equality operations
- Score Calculation: (Matching rows) / (Total synthetic rows) × 100
- Risk Level: High if >5% exact matches, Low otherwise

Membership Inference Attack

Data Type: Both structured and text data

MIA AUC Score: Area under ROC curve for membership inference classifier (0-1 scale)
- Algorithm:
  1. Combine synthetic and original data with labels (1=synthetic, 0=original)
  2. Extract features using TF-IDF for text and one-hot encoding for categorical
  3. Train RandomForest classifier to distinguish between datasets
  4. Calculate ROC-AUC score
- Score Calculation: sklearn.metrics.roc_auc_score
- Risk Level: High if AUC >0.7, Low otherwise
Synthetic Confidence: Average confidence of classifier on synthetic data
- Algorithm: Mean of classifier prediction probabilities for synthetic samples
Original Confidence: Average confidence of classifier on original data
- Algorithm: Mean of classifier prediction probabilities for original samples

Named Entity Recognition (for text data)

Data Type: Text columns only

Entity Statistics: Count and density of named entities (persons, organizations, locations)
- Algorithm: Flair NER model (flair/ner-english-large)
- Entity Types: PER (Person), ORG (Organization), LOC (Location), MISC (Miscellaneous)
- Score Calculation:
  - Total entities = sum of all detected entities
  - Entity density = total entities / total tokens
- Risk Level: High if entity density >0.1 or overlap >50%
Entity Overlap: Percentage of entities from original data found in synthetic data
- Algorithm: Set intersection of detected entities
- Score Calculation: (Common entities) / (Original entities) × 100

Nominal Mentions Analysis (for text data)

Data Type: Text columns only

Nominal Statistics: Count and density of person/role/relationship mentions
- Algorithm: spaCy NLP pipeline with custom filtering
- Detection: Nouns and proper nouns in subject positions or matching role/relationship patterns
- Score Calculation:
  - Total nominals = sum of detected nominal mentions
  - Nominal density = total nominals / total tokens
- Risk Level: High if nominal density >0.15 or overlap >50%
Nominal Overlap: Percentage of nominal mentions from original data found in synthetic data
- Algorithm: Set intersection of detected nominal mentions
- Score Calculation: (Common nominals) / (Original nominals) × 100

Stylistic Outliers Analysis (for text data)

Data Type: Text columns only

Outlier Statistics: Number and percentage of stylistically unique texts
- Algorithm:
  1. Generate Word2Vec embeddings for all texts
  2. Calculate cosine distances between all text pairs
  3. Identify texts with average distance > 2 standard deviations from mean
- Score Calculation: (Outlier texts) / (Total texts) × 100
- Risk Level: High if outlier patterns significantly differ
Outlier Comparison: How outlier patterns compare between original and synthetic data
- Algorithm: Compare outlier percentages and patterns between datasets

Anonymeter Re-identification Risks

Data Type: Structured data only

Singling Out Attack (Univariate): Risk of identifying unique individuals using single attributes
- Algorithm: Anonymeter's SinglingOutEvaluator with univariate mode
- Process:
  1. Find unique combinations of single attributes in synthetic data
  2. Check if these combinations exist in original data
  3. Calculate attack success rate vs. baseline random guessing
- Score Calculation: Risk = (Attack rate - Baseline rate) / (1 - Baseline rate)
- Risk Level: High if risk > 0.5, Low otherwise
Singling Out Attack (Multivariate): Risk of identifying unique individuals using attribute combinations
- Algorithm: Anonymeter's SinglingOutEvaluator with multivariate mode
- Process: Same as univariate but using combinations of up to 4 attributes
- Score Calculation: Same as univariate
Linkability Attack: Risk of linking synthetic records to original records
- Algorithm: Anonymeter's LinkabilityEvaluator
- Process:
  1. Use auxiliary columns to find similar records
  2. Attempt to link synthetic records to original records
  3. Calculate success rate vs. baseline
- Score Calculation: Same as singling out attacks
Inference Attack: Risk of inferring sensitive attributes from other attributes
- Algorithm: Anonymeter's InferenceEvaluator
- Process:
  1. For each column as "secret", use other columns as auxiliary
  2. Train model to predict secret from auxiliary columns
  3. Test on synthetic data and calculate inference success
- Score Calculation: Same as other attacks
Overall Risk: Maximum risk score across all attack types
- Algorithm: Maximum of all individual attack risks
- Score Calculation: max(singling_out_uni, singling_out_multi, linkability, max_inference_risks)

Interpretation: Lower risk scores indicate better privacy protection:

Low Risk: Scores <0.3, good privacy protection
Medium Risk: Scores 0.3-0.7, moderate privacy concerns
High Risk: Scores >0.7, significant privacy vulnerabilities

Output Format

The evaluation results are returned in JSON format. Each evaluation dimension will have its own section with specific metrics and scores. Example output structure:

{
  "fidelity": {
    "diagnostic": {
      "Data Validity": 0.95,
      "Data Structure": 0.87,
      "Overall": {"score": 0.91}
    },
    "quality": {
      "Column Shapes": 0.89,
      "Column Pair Trends": 0.82,
      "Overall": {"score": 0.86}
    },
    "text": {
      "text_column": {
        "length_stats": {...},
        "word_count_stats": {...},
        "keyword_analysis": {...},
        "sentiment_analysis": {...}
      }
    }
  },
  "utility": {
    "task_type": "classification",
    "training_size": 10000,
    "test_size": 5000,
    "real_data_model": {...},
    "synthetic_data_model": {...}
  },
  "diversity": {
    "tabular_diversity": {
      "coverage": {...},
      "uniqueness": {...},
      "numerical_metrics": {...},
      "categorical_metrics": {...},
      "entropy_metrics": {...}
    },
    "text_diversity": {
      "synthetic": {...},
      "real": {...}
    }
  },
  "privacy": {
    "exact_matches": {...},
    "membership_inference": {...},
    "named_entities": {...},
    "nominal_mentions": {...},
    "stylistic_outliers": {...},
    "anonymeter": {...}
  }
}

Data Requirements

Both synthetic and original data files must be in CSV format
Column names in the data must match those specified in the metadata
Data types should be consistent with the metadata specifications

Error Handling

The framework performs several validation checks:

Ensures all required files exist
Validates that at least one evaluation type is selected
Verifies that the metadata structure matches the data files

Additional Tools

Amazon Fashion Dataset - Named Entity Recognition Analysis

SynEval includes a specialized tool for analyzing large text datasets using Named Entity Recognition (NER). The amazon_fashion_ner_analysis_fast.py script performs comprehensive NER analysis on the Amazon Fashion dataset, processing 10k+ records efficiently.

Features

Large Dataset Processing: Efficiently handles 10k+ records with batch processing
Entity Density Analysis: Calculates and analyzes entity density for each text
Comprehensive Reporting: Generates multiple detailed report files
Top 200 High Entity Texts: Identifies and reports texts with the most entities
Visualizations: Creates charts and graphs for better understanding
Caching: Automatic caching for faster subsequent runs
Progress Tracking: Real-time progress bars for long-running operations

Usage

Basic Usage

Simply run the script to analyze your entire dataset:

python amazon_fashion_ner_analysis_fast.py

Configuration Options

You can modify the script to:

Use a subset of data for testing
Change the text column name
Adjust batch sizes for your hardware

Edit the configuration section in the main() function:

# Configuration
csv_file = 'Amazon_Fashion.csv'
text_column = 'text'
sample_size = None  # Set to 10000 to test with first 10K records

Output Files

The script generates several report files in the ./reports directory:

1. Main Analysis Report

File: amazon_fashion_ner_report_YYYYMMDD_HHMMSS.txt
Content:
- Dataset statistics
- Entity counts by type
- Overall entity density analysis
- Sample entities for each type

2. Top 200 High Entity Texts Report

File: top_200_high_entity_texts_YYYYMMDD_HHMMSS.txt
Content:
- 200 texts with the highest entity counts
- Entity density for each text
- List of entities found in each text

3. Entity Density Analysis Report

File: entity_density_analysis_YYYYMMDD_HHMMSS.txt
Content:
- Detailed density statistics (mean, median, percentiles)
- Density distribution analysis
- Top 50 texts by entity density

4. Visualizations

Files:
- entity_distribution_YYYYMMDD_HHMMSS.png
- entity_density_histogram_YYYYMMDD_HHMMSS.png
- entity_count_vs_density_YYYYMMDD_HHMMSS.png

Entity Types Detected

The script identifies and categorizes entities into:

PER: Person names (e.g., "John Smith", "Dr. Emily Brown")
ORG: Organizations (e.g., "Nike", "Adidas", "Amazon")
LOC: Locations (e.g., "New York", "California", "Paris")
MISC: Miscellaneous entities that don't fit other categories

Entity Density Analysis

Entity density is calculated as:

Entity Density = Number of Entities / Number of Tokens

The analysis provides:

Statistical measures: Mean, median, standard deviation, percentiles
Distribution categories:
- Low density (< 0.01): Minimal entity presence
- Medium density (0.01-0.05): Moderate entity presence
- High density (≥ 0.05): High entity presence

Performance Considerations

For Large Datasets (10k+ records)

Memory Usage: The script processes data in batches to manage memory
Processing Time: Expect 2-4 hours for full dataset analysis
Caching: Results are cached for faster subsequent runs
Hardware Requirements:
- Minimum 8GB RAM
- Multi-core CPU recommended
- SSD storage for faster I/O

Optimization Tips

Test with Subset: Set sample_size = 10000 to test with first 10K records
Adjust Batch Size: Modify batch_size in _process_entities_batch() method
CPU Threads: Adjust torch.set_num_threads() based on your CPU cores

Sample Output

Main Report Excerpt

================================================================================
AMAZON FASHION DATASET - NAMED ENTITY RECOGNITION ANALYSIS
================================================================================
Generated on: 2024-01-15 14:30:25

DATASET INFORMATION
----------------------------------------
Total texts analyzed: 2,500,000
Total tokens: 45,678,901
Total entities found: 1,234,567
Average entities per text: 0.49

ENTITY STATISTICS
----------------------------------------
Average entity density: 0.0270
Risk level: low

Entities by type:
  PER: 456,789
  ORG: 345,678
  LOC: 234,567
  MISC: 197,533

ENTITY DENSITY ANALYSIS
----------------------------------------
Mean density: 0.0270
Median density: 0.0150
Standard deviation: 0.0450
Min density: 0.0000
Max density: 0.5000

Density percentiles:
  25th percentile: 0.0050
  50th percentile: 0.0150
  75th percentile: 0.0350
  90th percentile: 0.0650
  95th percentile: 0.0950
  99th percentile: 0.1850

Density distribution:
  Low density (< 0.01): 1,250,000 texts
  Medium density (0.01-0.05): 875,000 texts
  High density (≥ 0.05): 375,000 texts

Top 200 Report Excerpt

================================================================================
TOP 200 TEXTS WITH HIGHEST ENTITY COUNTS
================================================================================
Generated on: 2024-01-15 14:30:25

  1. Entity Count: 15
     Entity Density: 0.2500
     Text: "Nike Air Max 90 shoes designed by John Smith in Portland, Oregon..."
     Entities: Nike (ORG), Air Max 90 (MISC), John Smith (PER), Portland (LOC), Oregon (LOC)
--------------------------------------------------------------------------------

  2. Entity Count: 12
     Entity Density: 0.2000
     Text: "Adidas Ultraboost running shoes from Germany, designed by Dr. Sarah Johnson..."
     Entities: Adidas (ORG), Ultraboost (MISC), Germany (LOC), Dr. Sarah Johnson (PER)
--------------------------------------------------------------------------------

Troubleshooting

Common Issues

Memory Errors: Reduce batch size or use a subset of data
Slow Processing: First run is slower due to model loading
File Not Found: Ensure Amazon_Fashion.csv is in the correct directory
Column Not Found: Check that the "text" column exists in your CSV

Error Messages

"Text column 'text' not found": Verify column name in your CSV file
"Failed to load Flair NER model": Check internet connection for model download
Memory errors: Reduce sample_size or batch_size

Advanced Usage

Custom Analysis

You can use the analyzer class directly for custom analysis:

from amazon_fashion_ner_analysis_fast import AmazonFashionNERAnalyzer

# Initialize analyzer
analyzer = AmazonFashionNERAnalyzer()

# Analyze with custom parameters
results = analyzer.analyze_dataset(
    csv_file='your_data.csv',
    text_column='your_text_column',
    sample_size=50000
)

# Generate custom reports
analyzer.generate_report(results, output_dir='./custom_reports')

Batch Processing for Very Large Datasets

For extremely large datasets, you can process in chunks:

# Process in chunks of 100K records
chunk_size = 100000
total_records = 2500000

for start_idx in range(0, total_records, chunk_size):
    end_idx = min(start_idx + chunk_size, total_records)
    # Process chunk from start_idx to end_idx
    # Save intermediate results

Dependencies

pandas: Data manipulation and CSV reading
numpy: Numerical computations and statistics
torch: PyTorch for deep learning (Flair dependency)
flair: Flair NLP library for named entity recognition
tqdm: Progress bars for long-running operations
matplotlib: Plotting and visualization
seaborn: Enhanced plotting and statistical visualizations

Contributing

As we implement more evaluation metrics, this README will be updated with additional documentation for each component.

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
cache		cache
reports		reports
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SynEval_Demo.ipynb		SynEval_Demo.ipynb
amazon_fashion_ner_analysis_fast.py		amazon_fashion_ner_analysis_fast.py
diversity.py		diversity.py
fidelity.py		fidelity.py
flair_ner.py		flair_ner.py
metadata.json		metadata.json
plotting.py		plotting.py
prepare_environment.py		prepare_environment.py
privacy.py		privacy.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py
utility.py		utility.py

License

privacy-enhancing-technologies/SynEval

Folders and files

Latest commit

History

Repository files navigation

SynEval: Synthetic Data Evaluation Framework

Installation

Quick Start

Running SynEval Demo on Google Colab

Command Line Usage

Requirements

Usage

Running Evaluations

Basic Usage

Required Arguments

Evaluation Flags

Optional Arguments

General Example

Metadata Format

Evaluation Results Description

Fidelity Evaluation (fidelity.py)

Diagnostic Metrics (SDV-based)

Quality Metrics (SDV-based)

Numerical Statistics Analysis (Custom)

Text-Specific Metrics (for text columns)

Utility Evaluation (utility.py)

Task Information

Model Performance Comparison

Classification Metrics (for classification tasks)

Regression Metrics (for regression tasks)

Feature Processing

Diversity Evaluation (diversity.py)

Tabular Diversity Metrics

Text Diversity Metrics

Privacy Evaluation (privacy.py)

Exact Match Analysis

Membership Inference Attack

Named Entity Recognition (for text data)

Nominal Mentions Analysis (for text data)

Stylistic Outliers Analysis (for text data)

Anonymeter Re-identification Risks

Output Format

Data Requirements

Error Handling

Additional Tools

Amazon Fashion Dataset - Named Entity Recognition Analysis

Features

Usage

Basic Usage

Configuration Options

Output Files

1. Main Analysis Report

2. Top 200 High Entity Texts Report

3. Entity Density Analysis Report

4. Visualizations

Entity Types Detected

Entity Density Analysis

Performance Considerations

For Large Datasets (10k+ records)

Optimization Tips

Sample Output

Main Report Excerpt

Top 200 Report Excerpt

Troubleshooting

Common Issues

Error Messages

Advanced Usage

Custom Analysis

Batch Processing for Very Large Datasets

Dependencies

Contributing

License

MIT License

About

Resources

License

Code of conduct

Contributing

Security policy

Fidelity Evaluation (`fidelity.py`)

Utility Evaluation (`utility.py`)

Diversity Evaluation (`diversity.py`)

Privacy Evaluation (`privacy.py`)

Packages