Skip to content

privacy-enhancing-technologies/SynEval

SynEval: Synthetic Data Evaluation Framework

SynEval is a comprehensive evaluation framework for assessing the quality of synthetic data generated by Large Language Models (LLMs). The framework provides quantitative scoring across four key dimensions:

  • Fidelity: Measures how well the synthetic data preserves the statistical properties and patterns of the original data
  • Utility: Evaluates the usefulness of synthetic data for downstream tasks
  • Diversity: Assesses the variety and uniqueness of the generated data
  • Privacy: Analyzes the privacy protection level of the synthetic data

Installation

  1. Clone the repository:
git clone https://github.com/SCU-TrustworthyAI/SynEval.git
cd SynEval
  1. (Optional) Create and activate a conda virtual environment:
conda create -n syneval python=3.10
conda activate syneval
  1. Prepare environment (one command):
python prepare_environment.py

This script will automatically:

  • Install all required Python packages from requirements.txt (with dependency conflict resolution)
  • Download required NLTK data packages (including punkt_tab)
  • Create necessary directories (plots/)
  • Test the installation

Note: You may see dependency conflict warnings during installation. This is normal in environments like Google Colab or when other packages are already installed. These conflicts won't affect SynEval functionality.

For a clean installation without conflicts, consider using a virtual environment:

conda create -n syneval python=3.10
conda activate syneval
python prepare_environment.py

Quick Start

Running SynEval Demo on Google Colab

The easiest way to get started with SynEval is to run the demo notebook on Google Colab:

  1. Open the Demo Notebook:

    • Navigate to SynEval_Demo.ipynb in the repository
    • Click "Open in Colab" or upload the notebook to Google Colab
  2. Install Dependencies:

    • The notebook includes a setup cell that automatically installs all required dependencies
    • Run the setup cell to prepare the environment
  3. Run the Demo:

    • The notebook provides step-by-step examples of running SynEval evaluations
    • Includes sample data and metadata for testing
    • Demonstrates all four evaluation dimensions (fidelity, utility, diversity, privacy)
  4. View Results:

    • Evaluation results are displayed inline with detailed explanations
    • Plots are generated and shown directly in the notebook
    • Results are also saved to files for further analysis

Command Line Usage

After installation, you can use SynEval from the command line:

python run.py \
    --synthetic synthetic_data.csv \
    --original original_data.csv \
    --metadata metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output results.json \
    --plot

Requirements

  • Python 3.10+
  • pandas
  • Additional dependencies will be installed automatically

Usage

Running Evaluations

The main entry point for the framework is run.py. This script allows you to evaluate synthetic data against original data using various metrics.

Basic Usage

Here's an example of running the evaluation framework:

python run.py \
    --synthetic ../Data/claude.csv \
    --original ../Data/real_10k.csv \
    --metadata ../Data/metadata.json \
    --dimensions fidelity \
    --utility-input text \
    --utility-output rating

The general command format is:

python run.py --synthetic <synthetic_data.csv> --original <original_data.csv> --metadata <metadata.json> [evaluation_flags] [--output <results.json>]

Required Arguments

  • --synthetic: Path to the synthetic data CSV file
  • --original: Path to the original data CSV file
  • --metadata: Path to the metadata JSON file

Evaluation Flags

You can select one or more evaluation dimensions to run:

  • --fidelity: Run fidelity evaluation
  • --utility: Run utility evaluation
  • --diversity: Run diversity evaluation
  • --privacy: Run privacy evaluation

Optional Arguments

  • --output: Path to save evaluation results in JSON format. If not specified, results will be printed to stdout.
  • --plot: Generate plots for all evaluation metrics and save them to the ./plots directory. Plots will visualize key metrics from fidelity, utility, diversity, and privacy evaluations.

General Example

python run.py \
    --synthetic synthetic_hotel_data.csv \
    --original original_hotel_data.csv \
    --metadata hotel_metadata.json \
    --fidelity \
    --utility \
    --output evaluation_results.json \
    --plot

Metadata Format

The metadata file should be a JSON file that describes the structure of your data. It should include:

  1. Column names and their types
  2. Dataset name
  3. Primary key information

Example metadata format:

{
  "columns": {
    "_id": {
      "sdtype": "numerical",
      "pii": false,
      "is_primary_key": true
    },
    "rating": {
      "sdtype": "categorical",
      "values": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    "title": {
      "sdtype": "text"
    },
    "text": {
      "sdtype": "text"
    },
    "asin": {
      "sdtype": "numerical",
      "pii": true
    },
    "parent_asin": {
      "sdtype": "numerical",
      "pii": true
    },
    "user_id": {
      "sdtype": "numerical",
      "pii": true
    },
    "timestamp": {
      "sdtype": "datetime"
    },
    "helpful_vote": {
      "sdtype": "numerical"
    },
    "verified_purchase": {
      "sdtype": "boolean",
      "values": [true, false]
    }
  },
  "text_columns": ["title", "text"],
  "utility": {
    "input_columns": ["text"],
    "output_columns": ["rating"],
    "task_type": "classification"
  }
}

Evaluation Results Description

Fidelity Evaluation (fidelity.py)

Fidelity measures how well the synthetic data preserves the statistical properties and patterns of the original data. This evaluation uses both SDV (Synthetic Data Vault) metrics and custom statistical analysis.

Diagnostic Metrics (SDV-based)

Data Type: Structured data only

  • Data Validity: Measures the percentage of valid data in synthetic dataset (0-1 scale)

    • Algorithm: SDV's diagnostic evaluation checks for data type consistency, missing value patterns, and constraint violations
    • Score Calculation: Percentage of rows that pass all validity checks divided by total rows
    • Interpretation: Higher scores indicate better data quality and adherence to original data constraints
  • Data Structure: Evaluates how well the synthetic data maintains the structural relationships of the original data (0-1 scale)

    • Algorithm: SDV analyzes primary key uniqueness, foreign key relationships, and referential integrity
    • Score Calculation: Weighted average of structural constraint compliance scores
    • Interpretation: Higher scores indicate better preservation of data relationships and constraints
  • Overall Score: Combined diagnostic score indicating overall data quality

    • Algorithm: Weighted average of Data Validity and Data Structure scores
    • Score Calculation: (Data Validity × 0.6) + (Data Structure × 0.4)
    • Interpretation: Comprehensive measure of basic data quality and structural integrity

Quality Metrics (SDV-based)

Data Type: Structured data only

  • Column Shapes: Measures how well the synthetic data preserves the distribution shapes of individual columns (0-1 scale)

    • Algorithm: SDV uses statistical tests (Kolmogorov-Smirnov for continuous, Chi-square for categorical) to compare distributions
    • Score Calculation: Average of distribution similarity scores across all columns, normalized to 0-1 scale
    • Interpretation: Higher scores indicate better preservation of individual column distributions
  • Column Pair Trends: Evaluates the preservation of relationships between column pairs (0-1 scale)

    • Algorithm: SDV analyzes correlation coefficients, mutual information, and conditional distributions between column pairs
    • Score Calculation: Average of pairwise relationship preservation scores across all column combinations
    • Interpretation: Higher scores indicate better preservation of inter-column relationships and correlations
  • Overall Quality Score: Combined quality score for statistical fidelity

    • Algorithm: Weighted average of Column Shapes and Column Pair Trends
    • Score Calculation: (Column Shapes × 0.7) + (Column Pair Trends × 0.3)
    • Interpretation: Comprehensive measure of statistical fidelity and relationship preservation

Numerical Statistics Analysis (Custom)

Data Type: Numerical columns only

  • Basic Statistics Comparison: Compares fundamental statistical measures between original and synthetic data

    • Measures: Mean, median, standard deviation, min/max, quartiles (Q25, Q75), skewness, kurtosis
    • Algorithm: Direct statistical calculation using pandas/numpy functions
    • Score Calculation: Relative differences calculated as |synthetic - original| / |original|
    • Interpretation: Lower relative differences indicate better statistical preservation
  • Range Coverage: Measures how much of the original data range is covered by synthetic data

    • Algorithm: Calculates overlap between original and synthetic value ranges
    • Score Calculation: Overlap range / Original range, where overlap = min(max_syn, max_orig) - max(min_syn, min_orig)
    • Interpretation: Higher coverage (closer to 1.0) indicates better range preservation
  • Distribution Similarity: Compares the shape and characteristics of data distributions

    • KL Divergence: Measures information loss between original and synthetic distributions
      • Algorithm: Kullback-Leibler divergence using histogram binning
      • Score Calculation: Σ p(x) * log(p(x)/q(x)) where p=original, q=synthetic
      • Interpretation: Lower values indicate more similar distributions (0 = identical)
    • Histogram Intersection: Measures overlap between distribution histograms
      • Algorithm: Calculates intersection of normalized histograms
      • Score Calculation: Σ min(hist_orig[i], hist_syn[i])
      • Interpretation: Higher values (closer to 1.0) indicate better distribution similarity
  • Overall Fidelity Score: Combined numerical fidelity metric

    • Algorithm: Weighted average of multiple preservation metrics
    • Score Calculation: Average of (mean preservation, std preservation, skewness preservation, range coverage, histogram similarity)
    • Interpretation:
      • 0.9-1.0: Excellent fidelity
      • 0.8-0.9: Good fidelity
      • 0.7-0.8: Fair fidelity
      • 0.6-0.7: Poor fidelity
      • <0.6: Very poor fidelity

Text-Specific Metrics (for text columns)

Data Type: Text columns only

  • Length Statistics: Compares text length distributions between original and synthetic data

    • Algorithm: Character count and word count analysis using string operations
    • Measures: Mean and standard deviation of text lengths and word counts
    • Score Calculation: Direct statistical comparison of length distributions
    • Interpretation: Similar means and standard deviations indicate good text length preservation
  • Keyword Analysis: Compares the most important keywords (TF-IDF scores) between datasets

    • Algorithm: TF-IDF (Term Frequency-Inverse Document Frequency) vectorization
    • Score Calculation:
      1. Fit TF-IDF vectorizer on original data
      2. Transform both datasets using the same vectorizer
      3. Calculate mean TF-IDF scores for each term
      4. Rank terms by importance scores
    • Interpretation: Similar top keywords and scores indicate good content preservation
  • Sentiment Analysis: Compares sentiment distributions between datasets

    • Algorithm: TextBlob sentiment analysis using polarity scores (-1 to +1)
    • Score Calculation:
      1. Calculate sentiment polarity for each text
      2. Compute mean and standard deviation of sentiment scores
      3. Categorize into negative (<-0.1), neutral (-0.1 to 0.1), positive (>0.1)
    • Interpretation: Similar sentiment distributions indicate good emotional tone preservation

Interpretation: Higher scores (closer to 1.0) indicate better fidelity. Scores above 0.8 are considered good, while scores below 0.6 may indicate significant quality issues.

Utility Evaluation (utility.py)

Utility evaluates how useful the synthetic data is for downstream machine learning tasks using the Train on Synthetic, Test on Real (TSTR) methodology.

Task Information

Data Type: Both structured and text data

  • Task Type: Classification, regression, or text classification
    • Classification: For categorical target variables
    • Regression: For numerical target variables
    • Text Classification: For text input with categorical output
  • Input Columns: Features used for prediction
  • Output Columns: Target variables to predict
  • Training Size: Number of synthetic samples used for training
  • Test Size: Number of real samples used for testing

Model Performance Comparison

Data Type: Both structured and text data

  • Real Data Model: Performance metrics when training on real data
  • Synthetic Data Model: Performance metrics when training on synthetic data

Classification Metrics (for classification tasks)

Data Type: Both structured and text data

  • Accuracy: Overall prediction accuracy

    • Algorithm: sklearn.metrics.accuracy_score
    • Score Calculation: (Correct predictions) / (Total predictions)
    • Interpretation: Higher accuracy indicates better model performance
  • Precision: Precision for each class

    • Algorithm: sklearn.metrics.precision_score with per-class calculation
    • Score Calculation: True Positives / (True Positives + False Positives) for each class
    • Interpretation: Higher precision indicates fewer false positive predictions
  • Recall: Recall for each class

    • Algorithm: sklearn.metrics.recall_score with per-class calculation
    • Score Calculation: True Positives / (True Positives + False Negatives) for each class
    • Interpretation: Higher recall indicates fewer false negative predictions
  • F1-Score: Harmonic mean of precision and recall

    • Algorithm: sklearn.metrics.f1_score
    • Score Calculation: 2 × (Precision × Recall) / (Precision + Recall)
    • Interpretation: Balanced measure of precision and recall
  • Macro/Micro Averages: Overall performance across all classes

    • Macro: Average of per-class metrics (treats all classes equally)
    • Micro: Global metric calculated from total true/false positives/negatives

Regression Metrics (for regression tasks)

Data Type: Numerical target variables only

  • R² Score: Coefficient of determination

    • Algorithm: sklearn.metrics.r2_score
    • Score Calculation: 1 - (SS_res / SS_tot) where SS_res = sum of squared residuals, SS_tot = total sum of squares
    • Interpretation: Higher R² (closer to 1.0) indicates better model fit
  • Mean Squared Error: Average squared prediction error

    • Algorithm: sklearn.metrics.mean_squared_error
    • Score Calculation: Average of (predicted - actual)²
    • Interpretation: Lower MSE indicates better predictions
  • Root Mean Squared Error: Square root of mean squared error

    • Algorithm: Square root of MSE
    • Score Calculation: √(MSE)
    • Interpretation: Lower RMSE indicates better predictions (in same units as target)

Feature Processing

Data Type: Both structured and text data

  • Text Processing: For text input columns

    • Algorithm: TF-IDF vectorization with sklearn
    • Parameters: max_features=1000, min_df=2, max_df=0.95, stop_words='english'
    • Process: Convert text to numerical features using term frequency-inverse document frequency
  • Categorical Processing: For categorical input columns

    • Algorithm: One-hot encoding using pandas.get_dummies()
    • Process: Convert categorical variables to binary columns
  • Numerical Processing: For numerical input columns

    • Algorithm: Direct use with missing value imputation
    • Process: Fill NaN values with 0 or mean/median

Interpretation: The synthetic data model should perform close to the real data model. A performance gap of less than 10% is considered good utility, while gaps above 20% may indicate poor synthetic data quality.

Diversity Evaluation (diversity.py)

Diversity assesses the variety and uniqueness of the generated data across multiple dimensions.

Tabular Diversity Metrics

Data Type: Structured data only

Coverage Metrics

  • Column Coverage: Percentage of original data values/range covered by synthetic data
    • For numerical columns: Range overlap percentage
      • Algorithm: Calculate overlap between original and synthetic value ranges
      • Score Calculation: (Overlap range) / (Original range) × 100
    • For categorical columns: Category overlap percentage
      • Algorithm: Count common categories between original and synthetic data
      • Score Calculation: (Common categories) / (Original categories) × 100

Uniqueness Metrics

  • Synthetic Duplicate Ratio: Percentage of duplicate rows in synthetic data
    • Algorithm: pandas.drop_duplicates() to identify unique rows
    • Score Calculation: (Total rows - Unique rows) / Total rows × 100
  • Original Duplicate Ratio: Percentage of duplicate rows in original data
    • Algorithm: Same as synthetic duplicate ratio
  • Relative Duplication: How synthetic duplication compares to original duplication
    • Score Calculation: (Synthetic duplicate ratio) / (Original duplicate ratio) × 100

Numerical Metrics (for numerical columns)

  • Statistical Differences: Differences in mean, standard deviation, skewness, and kurtosis
    • Algorithm: Direct statistical calculation and comparison
    • Score Calculation: |Synthetic value - Original value| / |Original value| for relative differences
  • Range Coverage: Percentage of original value range covered by synthetic data
    • Algorithm: Calculate overlap between value ranges
    • Score Calculation: (Overlap range) / (Original range) × 100
  • Quartile Coverage: How well synthetic data covers the 25th, 50th, and 75th percentiles
    • Algorithm: Compare quartile values between datasets
    • Score Calculation: Relative difference for each quartile
  • Distribution Similarity: KL divergence and similarity score between distributions
    • Algorithm: Histogram-based KL divergence calculation
    • Score Calculation: KL divergence and similarity = exp(-KL_divergence)

Categorical Metrics (for categorical columns)

  • Category Coverage: Percentage of original categories present in synthetic data
    • Algorithm: Set intersection of category sets
    • Score Calculation: (Common categories) / (Original categories) × 100
  • Distribution Similarity: How well synthetic data preserves category frequency distributions
    • Algorithm: Compare normalized frequency distributions
    • Score Calculation: 1 - (Total absolute difference / 2) × 100
  • Entropy: Information content comparison between original and synthetic data
    • Algorithm: Shannon entropy calculation: -Σ p(x) × log2(p(x))
    • Score Calculation: Entropy for each dataset and their difference
  • Top Categories Coverage: How well synthetic data covers the most common categories
    • Algorithm: Compare top N most frequent categories
    • Score Calculation: (Common top categories) / (Top N categories) × 100
  • Rare Categories Coverage: How well synthetic data preserves rare categories
    • Algorithm: Identify categories with frequency < 1% and check coverage
    • Score Calculation: (Common rare categories) / (Total rare categories) × 100

Entropy Metrics

  • Column Entropy: Information content for each column
    • Algorithm: Shannon entropy calculation for each column
    • Score Calculation: -Σ p(x) × log2(p(x)) where p(x) is probability of value x
  • Dataset Entropy: Overall information content comparison
    • Algorithm: Average entropy across all columns
    • Score Calculation: Mean of column entropies
  • Entropy Ratio: Synthetic entropy relative to original entropy
    • Score Calculation: (Synthetic entropy) / (Original entropy)

Text Diversity Metrics

Data Type: Text columns only

Lexical Diversity

  • N-gram Analysis: For n=1 to 5, measures:
    • Algorithm: NLTK n-gram generation and analysis
    • total: Total number of n-grams
    • unique: Number of unique n-grams
    • unique_ratio: Ratio of unique to total n-grams
    • entropy: Information content of n-gram distribution
      • Algorithm: Shannon entropy: -Σ p(x) × log2(p(x))
    • normalized_entropy: Entropy normalized by maximum possible entropy
      • Score Calculation: Entropy / log2(unique_count)

Semantic Diversity

  • Total MST Weight: Total weight of minimum spanning tree connecting text embeddings
    • Algorithm:
      1. Train Word2Vec model on text corpus
      2. Generate embeddings for each text
      3. Calculate cosine distances between all pairs
      4. Construct minimum spanning tree using Kruskal's algorithm
    • Score Calculation: Sum of edge weights in MST
  • Average Edge Weight: Average distance between semantically similar texts
    • Score Calculation: Total MST weight / Number of edges
  • Distinct Nodes: Number of unique semantic representations
    • Algorithm: Count unique embeddings after rounding to 6 decimal places
  • Distinct Ratio: Ratio of distinct semantic representations to total texts
    • Score Calculation: Distinct nodes / Total texts

Sentiment Diversity

  • Sentiment by Rating: Positive sentiment percentage for each rating level
    • Algorithm: Flair sentiment classifier applied to each text
    • Score Calculation: Percentage of positive sentiment texts for each rating
  • Ideal Sentiment: Expected sentiment distribution based on rating
    • Algorithm: Linear mapping from rating to expected positive sentiment
    • Score Calculation: (rating - 1) / 4 for 1-5 scale ratings
  • Sentiment Alignment Score: How well sentiment aligns with rating expectations
    • Algorithm: Calculate deviation from ideal sentiment distribution
    • Score Calculation: Average of (1 - |actual - ideal|) across all ratings

Interpretation: Higher diversity scores indicate more varied and unique synthetic data. Good diversity should show:

  • Coverage scores above 80%
  • Duplicate ratios below 5%
  • Entropy ratios close to 1.0
  • High lexical and semantic diversity scores

Privacy Evaluation (privacy.py)

Privacy analysis evaluates the protection level of sensitive information in synthetic data.

Exact Match Analysis

Data Type: Both structured and text data

  • Exact Match Percentage: Percentage of synthetic rows that exactly match original rows
    • Algorithm: Row-by-row comparison using pandas equality operations
    • Score Calculation: (Matching rows) / (Total synthetic rows) × 100
    • Risk Level: High if >5% exact matches, Low otherwise

Membership Inference Attack

Data Type: Both structured and text data

  • MIA AUC Score: Area under ROC curve for membership inference classifier (0-1 scale)

    • Algorithm:
      1. Combine synthetic and original data with labels (1=synthetic, 0=original)
      2. Extract features using TF-IDF for text and one-hot encoding for categorical
      3. Train RandomForest classifier to distinguish between datasets
      4. Calculate ROC-AUC score
    • Score Calculation: sklearn.metrics.roc_auc_score
    • Risk Level: High if AUC >0.7, Low otherwise
  • Synthetic Confidence: Average confidence of classifier on synthetic data

    • Algorithm: Mean of classifier prediction probabilities for synthetic samples
  • Original Confidence: Average confidence of classifier on original data

    • Algorithm: Mean of classifier prediction probabilities for original samples

Named Entity Recognition (for text data)

Data Type: Text columns only

  • Entity Statistics: Count and density of named entities (persons, organizations, locations)

    • Algorithm: Flair NER model (flair/ner-english-large)
    • Entity Types: PER (Person), ORG (Organization), LOC (Location), MISC (Miscellaneous)
    • Score Calculation:
      • Total entities = sum of all detected entities
      • Entity density = total entities / total tokens
    • Risk Level: High if entity density >0.1 or overlap >50%
  • Entity Overlap: Percentage of entities from original data found in synthetic data

    • Algorithm: Set intersection of detected entities
    • Score Calculation: (Common entities) / (Original entities) × 100

Nominal Mentions Analysis (for text data)

Data Type: Text columns only

  • Nominal Statistics: Count and density of person/role/relationship mentions

    • Algorithm: spaCy NLP pipeline with custom filtering
    • Detection: Nouns and proper nouns in subject positions or matching role/relationship patterns
    • Score Calculation:
      • Total nominals = sum of detected nominal mentions
      • Nominal density = total nominals / total tokens
    • Risk Level: High if nominal density >0.15 or overlap >50%
  • Nominal Overlap: Percentage of nominal mentions from original data found in synthetic data

    • Algorithm: Set intersection of detected nominal mentions
    • Score Calculation: (Common nominals) / (Original nominals) × 100

Stylistic Outliers Analysis (for text data)

Data Type: Text columns only

  • Outlier Statistics: Number and percentage of stylistically unique texts

    • Algorithm:
      1. Generate Word2Vec embeddings for all texts
      2. Calculate cosine distances between all text pairs
      3. Identify texts with average distance > 2 standard deviations from mean
    • Score Calculation: (Outlier texts) / (Total texts) × 100
    • Risk Level: High if outlier patterns significantly differ
  • Outlier Comparison: How outlier patterns compare between original and synthetic data

    • Algorithm: Compare outlier percentages and patterns between datasets

Anonymeter Re-identification Risks

Data Type: Structured data only

  • Singling Out Attack (Univariate): Risk of identifying unique individuals using single attributes

    • Algorithm: Anonymeter's SinglingOutEvaluator with univariate mode
    • Process:
      1. Find unique combinations of single attributes in synthetic data
      2. Check if these combinations exist in original data
      3. Calculate attack success rate vs. baseline random guessing
    • Score Calculation: Risk = (Attack rate - Baseline rate) / (1 - Baseline rate)
    • Risk Level: High if risk > 0.5, Low otherwise
  • Singling Out Attack (Multivariate): Risk of identifying unique individuals using attribute combinations

    • Algorithm: Anonymeter's SinglingOutEvaluator with multivariate mode
    • Process: Same as univariate but using combinations of up to 4 attributes
    • Score Calculation: Same as univariate
  • Linkability Attack: Risk of linking synthetic records to original records

    • Algorithm: Anonymeter's LinkabilityEvaluator
    • Process:
      1. Use auxiliary columns to find similar records
      2. Attempt to link synthetic records to original records
      3. Calculate success rate vs. baseline
    • Score Calculation: Same as singling out attacks
  • Inference Attack: Risk of inferring sensitive attributes from other attributes

    • Algorithm: Anonymeter's InferenceEvaluator
    • Process:
      1. For each column as "secret", use other columns as auxiliary
      2. Train model to predict secret from auxiliary columns
      3. Test on synthetic data and calculate inference success
    • Score Calculation: Same as other attacks
  • Overall Risk: Maximum risk score across all attack types

    • Algorithm: Maximum of all individual attack risks
    • Score Calculation: max(singling_out_uni, singling_out_multi, linkability, max_inference_risks)

Interpretation: Lower risk scores indicate better privacy protection:

  • Low Risk: Scores <0.3, good privacy protection
  • Medium Risk: Scores 0.3-0.7, moderate privacy concerns
  • High Risk: Scores >0.7, significant privacy vulnerabilities

Output Format

The evaluation results are returned in JSON format. Each evaluation dimension will have its own section with specific metrics and scores. Example output structure:

{
  "fidelity": {
    "diagnostic": {
      "Data Validity": 0.95,
      "Data Structure": 0.87,
      "Overall": {"score": 0.91}
    },
    "quality": {
      "Column Shapes": 0.89,
      "Column Pair Trends": 0.82,
      "Overall": {"score": 0.86}
    },
    "text": {
      "text_column": {
        "length_stats": {...},
        "word_count_stats": {...},
        "keyword_analysis": {...},
        "sentiment_analysis": {...}
      }
    }
  },
  "utility": {
    "task_type": "classification",
    "training_size": 10000,
    "test_size": 5000,
    "real_data_model": {...},
    "synthetic_data_model": {...}
  },
  "diversity": {
    "tabular_diversity": {
      "coverage": {...},
      "uniqueness": {...},
      "numerical_metrics": {...},
      "categorical_metrics": {...},
      "entropy_metrics": {...}
    },
    "text_diversity": {
      "synthetic": {...},
      "real": {...}
    }
  },
  "privacy": {
    "exact_matches": {...},
    "membership_inference": {...},
    "named_entities": {...},
    "nominal_mentions": {...},
    "stylistic_outliers": {...},
    "anonymeter": {...}
  }
}

Data Requirements

  1. Both synthetic and original data files must be in CSV format
  2. Column names in the data must match those specified in the metadata
  3. Data types should be consistent with the metadata specifications

Error Handling

The framework performs several validation checks:

  1. Ensures all required files exist
  2. Validates that at least one evaluation type is selected
  3. Verifies that the metadata structure matches the data files

Additional Tools

Amazon Fashion Dataset - Named Entity Recognition Analysis

SynEval includes a specialized tool for analyzing large text datasets using Named Entity Recognition (NER). The amazon_fashion_ner_analysis_fast.py script performs comprehensive NER analysis on the Amazon Fashion dataset, processing 10k+ records efficiently.

Features

  • Large Dataset Processing: Efficiently handles 10k+ records with batch processing
  • Entity Density Analysis: Calculates and analyzes entity density for each text
  • Comprehensive Reporting: Generates multiple detailed report files
  • Top 200 High Entity Texts: Identifies and reports texts with the most entities
  • Visualizations: Creates charts and graphs for better understanding
  • Caching: Automatic caching for faster subsequent runs
  • Progress Tracking: Real-time progress bars for long-running operations

Usage

Basic Usage

Simply run the script to analyze your entire dataset:

python amazon_fashion_ner_analysis_fast.py
Configuration Options

You can modify the script to:

  • Use a subset of data for testing
  • Change the text column name
  • Adjust batch sizes for your hardware

Edit the configuration section in the main() function:

# Configuration
csv_file = 'Amazon_Fashion.csv'
text_column = 'text'
sample_size = None  # Set to 10000 to test with first 10K records

Output Files

The script generates several report files in the ./reports directory:

1. Main Analysis Report
  • File: amazon_fashion_ner_report_YYYYMMDD_HHMMSS.txt
  • Content:
    • Dataset statistics
    • Entity counts by type
    • Overall entity density analysis
    • Sample entities for each type
2. Top 200 High Entity Texts Report
  • File: top_200_high_entity_texts_YYYYMMDD_HHMMSS.txt
  • Content:
    • 200 texts with the highest entity counts
    • Entity density for each text
    • List of entities found in each text
3. Entity Density Analysis Report
  • File: entity_density_analysis_YYYYMMDD_HHMMSS.txt
  • Content:
    • Detailed density statistics (mean, median, percentiles)
    • Density distribution analysis
    • Top 50 texts by entity density
4. Visualizations
  • Files:
    • entity_distribution_YYYYMMDD_HHMMSS.png
    • entity_density_histogram_YYYYMMDD_HHMMSS.png
    • entity_count_vs_density_YYYYMMDD_HHMMSS.png

Entity Types Detected

The script identifies and categorizes entities into:

  • PER: Person names (e.g., "John Smith", "Dr. Emily Brown")
  • ORG: Organizations (e.g., "Nike", "Adidas", "Amazon")
  • LOC: Locations (e.g., "New York", "California", "Paris")
  • MISC: Miscellaneous entities that don't fit other categories

Entity Density Analysis

Entity density is calculated as:

Entity Density = Number of Entities / Number of Tokens

The analysis provides:

  • Statistical measures: Mean, median, standard deviation, percentiles
  • Distribution categories:
    • Low density (< 0.01): Minimal entity presence
    • Medium density (0.01-0.05): Moderate entity presence
    • High density (≥ 0.05): High entity presence

Performance Considerations

For Large Datasets (10k+ records)
  1. Memory Usage: The script processes data in batches to manage memory
  2. Processing Time: Expect 2-4 hours for full dataset analysis
  3. Caching: Results are cached for faster subsequent runs
  4. Hardware Requirements:
    • Minimum 8GB RAM
    • Multi-core CPU recommended
    • SSD storage for faster I/O
Optimization Tips
  1. Test with Subset: Set sample_size = 10000 to test with first 10K records
  2. Adjust Batch Size: Modify batch_size in _process_entities_batch() method
  3. CPU Threads: Adjust torch.set_num_threads() based on your CPU cores

Sample Output

Main Report Excerpt
================================================================================
AMAZON FASHION DATASET - NAMED ENTITY RECOGNITION ANALYSIS
================================================================================
Generated on: 2024-01-15 14:30:25

DATASET INFORMATION
----------------------------------------
Total texts analyzed: 2,500,000
Total tokens: 45,678,901
Total entities found: 1,234,567
Average entities per text: 0.49

ENTITY STATISTICS
----------------------------------------
Average entity density: 0.0270
Risk level: low

Entities by type:
  PER: 456,789
  ORG: 345,678
  LOC: 234,567
  MISC: 197,533

ENTITY DENSITY ANALYSIS
----------------------------------------
Mean density: 0.0270
Median density: 0.0150
Standard deviation: 0.0450
Min density: 0.0000
Max density: 0.5000

Density percentiles:
  25th percentile: 0.0050
  50th percentile: 0.0150
  75th percentile: 0.0350
  90th percentile: 0.0650
  95th percentile: 0.0950
  99th percentile: 0.1850

Density distribution:
  Low density (< 0.01): 1,250,000 texts
  Medium density (0.01-0.05): 875,000 texts
  High density (≥ 0.05): 375,000 texts
Top 200 Report Excerpt
================================================================================
TOP 200 TEXTS WITH HIGHEST ENTITY COUNTS
================================================================================
Generated on: 2024-01-15 14:30:25

  1. Entity Count: 15
     Entity Density: 0.2500
     Text: "Nike Air Max 90 shoes designed by John Smith in Portland, Oregon..."
     Entities: Nike (ORG), Air Max 90 (MISC), John Smith (PER), Portland (LOC), Oregon (LOC)
--------------------------------------------------------------------------------

  2. Entity Count: 12
     Entity Density: 0.2000
     Text: "Adidas Ultraboost running shoes from Germany, designed by Dr. Sarah Johnson..."
     Entities: Adidas (ORG), Ultraboost (MISC), Germany (LOC), Dr. Sarah Johnson (PER)
--------------------------------------------------------------------------------

Troubleshooting

Common Issues
  1. Memory Errors: Reduce batch size or use a subset of data
  2. Slow Processing: First run is slower due to model loading
  3. File Not Found: Ensure Amazon_Fashion.csv is in the correct directory
  4. Column Not Found: Check that the "text" column exists in your CSV
Error Messages
  • "Text column 'text' not found": Verify column name in your CSV file
  • "Failed to load Flair NER model": Check internet connection for model download
  • Memory errors: Reduce sample_size or batch_size

Advanced Usage

Custom Analysis

You can use the analyzer class directly for custom analysis:

from amazon_fashion_ner_analysis_fast import AmazonFashionNERAnalyzer

# Initialize analyzer
analyzer = AmazonFashionNERAnalyzer()

# Analyze with custom parameters
results = analyzer.analyze_dataset(
    csv_file='your_data.csv',
    text_column='your_text_column',
    sample_size=50000
)

# Generate custom reports
analyzer.generate_report(results, output_dir='./custom_reports')
Batch Processing for Very Large Datasets

For extremely large datasets, you can process in chunks:

# Process in chunks of 100K records
chunk_size = 100000
total_records = 2500000

for start_idx in range(0, total_records, chunk_size):
    end_idx = min(start_idx + chunk_size, total_records)
    # Process chunk from start_idx to end_idx
    # Save intermediate results

Dependencies

  • pandas: Data manipulation and CSV reading
  • numpy: Numerical computations and statistics
  • torch: PyTorch for deep learning (Flair dependency)
  • flair: Flair NLP library for named entity recognition
  • tqdm: Progress bars for long-running operations
  • matplotlib: Plotting and visualization
  • seaborn: Enhanced plotting and statistical visualizations

Contributing

As we implement more evaluation metrics, this README will be updated with additional documentation for each component.

License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2024 SynEval Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

About

Comprehensive evaluation tools for synthetic data.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published