Skip to content

ODU-Storymodelers/RIF-Index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Relative Importance Factor (RIF) Analysis Framework

Overview

This repository contains a comprehensive framework for analyzing categorical data that follow power-law distributions using the Relative Importance Factor (RIF) methodology. The RIF index quantifies the relative importance relationships between concepts by computing ratios of their theoretical power-law probabilities, providing insights into the hierarchical structure of concepts within any domain that exhibits power-law behavior.

Repository Structure

rif-index/
├── README.md                    # This file
├── requirements.txt             # Python dependencies
├── data/                        # Input data files
│   ├── papers.csv               # Research papers dataset
│   └── thesaurus_terms.txt      # Thesaurus for keyword normalization
├── utils/                       # Core utility modules
│   ├── keywords_extraction.py   # Keyword extraction and cleaning
│   ├── powerlaw_science.py      # Power-law distribution analysis
│   └── rif_index.py             # RIF calculation and visualization
├── example/                     # Example implementations
│   ├── case1-explanation.py     # Basic RIF analysis example
│   ├── case1-rif.py             # Single dataset RIF analysis
│   ├── case2-explanation.py     # Comparative RIF analysis example
│   ├── case2-rif.py             # Multi-group RIF analysis
│   └── *-output/                # Generated outputs from examples
└── social-resilience/           # Case study: Social resilience research
  ├── code/                    # Analysis pipeline scripts
  │   ├── 1_keywords-dataset.py    # Extract and clean keywords
  │   ├── 2_powerlaw-keywords.py   # Power-law analysis
  │   └── 3_rif-keywords.py        # RIF index calculation
  └── output/                  # Generated analysis results
    ├── keywords/            # Extracted keyword datasets
    ├── powerlaw/            # Power-law analysis results
    └── rif/                 # RIF analysis results and visualizations

Methodology

The RIF analysis framework follows a structured nine-step methodology for analyzing term distributions in various domains:

RIF Analysis Workflow

Figure 1: Workflow diagram of the Relative Importance Factor (RIF) analysis methodology showing the comprehensive analytical process.

The methodology systematically processes data through: (1) data collection, (2) variable selection with categorical validation, (3) frequency estimation, (4) total count calculation, (5) relative frequency computation, (6) ranking assignment, (7) power-law model fitting with statistical validation, (8) RIF index computation for valid models, and (9) visualization through matrices and networks. This ensures that only statistically valid power-law distributions proceed to RIF analysis, providing robust insights into the relative importance relationships between concepts.

1. Keyword Extraction and Cleaning (for Case Study)

The framework begins by extracting keywords from scientific literature and applying several cleaning steps:

  • Thesaurus normalization: Standardizes synonymous terms
  • Country name removal: Filters out geographical references
  • Frequency-based filtering: Removes low-frequency terms
  • Duplicate merging: Consolidates identical terms

2. Power-Law Distribution Analysis

Analyzes the frequency distribution of keywords to establish theoretical baseline:

  • Parameter Estimation: Uses maximum likelihood estimation to fit power-law parameters (α, xmin)
  • Goodness-of-Fit Testing: Kolmogorov-Smirnov test with bootstrap significance testing
  • Log-Log Visualization: Generates rank-frequency plots for visual inspection
  • Statistical Validation: Computes p-values to assess power-law fit quality
  • Threshold Selection: Determines minimum frequency (xmin) for power-law regime

3. RIF Index Calculation

The Relative Importance Factor (RIF) quantifies the relative importance relationships between concepts by computing ratios of their theoretical power-law probabilities. The RIF index is calculated as:

RIF_i = P_theoretical,1 / P_theoretical,i

Where:

  • P_theoretical,i is the theoretical probability of keyword i under the fitted power-law distribution
  • P_theoretical,1 is the theoretical probability of the highest-ranked keyword (rank = 1)
  • The RIF matrix Φ(s,r) = P_theoretical,s / P_theoretical,r shows relative importance between any two concepts
  • RIF values ≥ 1 indicate how many times more important concept s is compared to concept r

Algorithm Overview:

  1. Frequency Filtering: Filter data based on minimum frequency threshold (xmin)
  2. Rank Assignment: Assign ranks 1 to n based on frequency ordering
  3. Theoretical Probabilities: Calculate P(r) = A × r^(-θ) where:
    • A = 1/Σ(r^(-θ)) is the normalization constant
    • θ is the power-law exponent from previous analysis
  4. RIF Calculation: Compute RIF_i = P_theoretical,1 / P_theoretical,i for each keyword
  5. Empirical Fitting: Fit log-log regression to validate: log(frequency) = log(δ) + β × log(rank)
  6. Parameter Relationships: Calculate α = θ/(-β) and additional normalization constants
  7. Matrix Generation: Create pairwise RIF matrix Φ(s,r) = P_theoretical,s / P_theoretical,r for s ≤ r

4. Visualization and Analysis

The framework generates comprehensive outputs for interpretation:

  • RIF Matrix Heatmaps: Lower-triangular matrices showing pairwise RIF relationships
  • Network Graphs: Visualizations of keyword relationships and hierarchies
  • Comparative Analyses: Side-by-side comparisons of different datasets or groups
  • Statistical Summaries: Detailed tables with RIF indices, probabilities, and fit parameters

Mathematical Formulation

The RIF methodology follows a structured mathematical approach:

Power-Law Model

For a set of keywords with frequencies, the theoretical power-law distribution is:

P(r) = A × r^(-θ)

Where:

  • r is the rank (1, 2, 3, ...)
  • θ is the power-law exponent parameter
  • A is the normalization constant: A = 1/Σ(r^(-θ))

Empirical Fitting

The empirical data is fitted using log-log regression:

log(frequency) = log(δ) + β × log(rank)

Where:

  • δ is the scaling factor
  • β is the empirical slope (negative for power-law decay)

RIF Index Computation

The RIF index measures relative importance using theoretical power-law probabilities:

RIF_i = P_theoretical,1 / P_theoretical,i

Where theoretical probabilities are calculated as:

P_theoretical,i = A × r_i^(-θ)
A = 1 / Σ(r^(-θ))  (normalization constant)

For pairwise comparisons, the RIF matrix is:

Φ(s,r) = P_theoretical,s / P_theoretical,r  for s ≤ r

Additional parameters computed:

  • α = θ/(-β): Relationship between theoretical (θ) and empirical (β) exponents
  • δ = exp(intercept): Scaling factor from log-log regression
  • Normalization constants B and C: Additional scaling factors for comprehensive analysis

This creates a lower-triangular matrix where values ≥ 1 indicate higher relative importance.

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Setup

  1. Clone the repository:
git clone <repository-url>
cd rif
  1. Install dependencies:
pip install -r requirements.txt

Key Dependencies

  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • matplotlib: Plotting and visualization
  • scipy: Scientific computing
  • networkx: Network analysis
  • powerlaw: Power-law distribution fitting
  • bertopic: Topic modeling (for advanced analyses)
  • scikit-learn: Machine learning utilities

Usage

Synthetic Data Analysis

Basic RIF Analysis (Case 1)

# Run the basic example
python example/case1-rif.py

This demonstrates RIF analysis on a single dataset, generating:

  • RIF matrix visualization
  • Network graph of keyword relationships
  • Statistical summary table

Comparative RIF Analysis (Case 2)

# Run the comparative example
python example/case2-rif.py

This shows how to compare RIF patterns across different groups or time periods.

Case Study: Full Analysis Pipeline

For a complete analysis of your own dataset, follow these steps:

Step 1: Prepare Your Data

Ensure your dataset is in CSV format with columns for:

  • Author Keywords
  • Index Keywords
  • Other relevant metadata

Step 2: Extract and Clean Keywords

python social-resilience/code/1_keywords-dataset.py

This script:

  • Extracts keywords from your dataset
  • Applies thesaurus normalization
  • Removes geographical terms
  • Saves cleaned keyword datasets

Step 3: Analyze Power-Law Distributions

python social-resilience/code/2_powerlaw-keywords.py

This performs:

  • Power-law parameter estimation
  • Goodness-of-fit testing
  • Log-log plot generation
  • Statistical significance testing

Step 4: Calculate RIF Indices

python social-resilience/code/3_rif-keywords.py

This generates:

  • RIF index calculations
  • Matrix and network visualizations
  • Comparative analysis reports
  • Publication-ready figures

Output Files

Keywords Analysis

  • author_keywords.csv: Cleaned author-assigned keywords
  • index_keywords.csv: Cleaned index terms
  • Frequency distributions and summary statistics

Power-Law Analysis

  • *_powerlaw_summary.csv: Statistical parameters and fit quality
  • *_loglog_plot.png: Log-log distribution plots
  • Bootstrap test results and p-values

RIF Analysis

  • *_rif.csv: RIF indices and statistical measures
  • *_rif_matrix.png: Heatmap visualizations
  • *_rif_network.png: Network graph visualizations
  • *_rif_plots.pdf: Comprehensive analysis reports

Contributing

We welcome contributions to improve the framework:

  1. Bug Reports: Submit issues with detailed descriptions
  2. Feature Requests: Propose new functionality or improvements
  3. Code Contributions: Submit pull requests with enhancements
  4. Documentation: Help improve documentation and examples

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions, suggestions, or collaboration opportunities, please contact:

Primary Contact:

  • Brian Llinas - bllin001@odu.edu
    Computer Science Department & Virginia Modeling, Analysis, and Simulation Center (VMASC)
    Old Dominion University, Norfolk/Suffolk, Virginia, USA

Co-authors:

Acknowledgments

This research is funded by grant number N000141912624 by the Office of Naval Research through the Minerva Research Initiative and grant number P116S210003 by the US Department of Education.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors