Relative Importance Factor (RIF) Analysis Framework

Overview

This repository contains a comprehensive framework for analyzing categorical data that follow power-law distributions using the Relative Importance Factor (RIF) methodology. The RIF index quantifies the relative importance relationships between concepts by computing ratios of their theoretical power-law probabilities, providing insights into the hierarchical structure of concepts within any domain that exhibits power-law behavior.

Repository Structure

rif-index/
├── README.md                    # This file
├── requirements.txt             # Python dependencies
├── data/                        # Input data files
│   ├── papers.csv               # Research papers dataset
│   └── thesaurus_terms.txt      # Thesaurus for keyword normalization
├── utils/                       # Core utility modules
│   ├── keywords_extraction.py   # Keyword extraction and cleaning
│   ├── powerlaw_science.py      # Power-law distribution analysis
│   └── rif_index.py             # RIF calculation and visualization
├── example/                     # Example implementations
│   ├── case1-explanation.py     # Basic RIF analysis example
│   ├── case1-rif.py             # Single dataset RIF analysis
│   ├── case2-explanation.py     # Comparative RIF analysis example
│   ├── case2-rif.py             # Multi-group RIF analysis
│   └── *-output/                # Generated outputs from examples
└── social-resilience/           # Case study: Social resilience research
  ├── code/                    # Analysis pipeline scripts
  │   ├── 1_keywords-dataset.py    # Extract and clean keywords
  │   ├── 2_powerlaw-keywords.py   # Power-law analysis
  │   └── 3_rif-keywords.py        # RIF index calculation
  └── output/                  # Generated analysis results
    ├── keywords/            # Extracted keyword datasets
    ├── powerlaw/            # Power-law analysis results
    └── rif/                 # RIF analysis results and visualizations

Methodology

The RIF analysis framework follows a structured nine-step methodology for analyzing term distributions in various domains:

Figure 1: Workflow diagram of the Relative Importance Factor (RIF) analysis methodology showing the comprehensive analytical process.

The methodology systematically processes data through: (1) data collection, (2) variable selection with categorical validation, (3) frequency estimation, (4) total count calculation, (5) relative frequency computation, (6) ranking assignment, (7) power-law model fitting with statistical validation, (8) RIF index computation for valid models, and (9) visualization through matrices and networks. This ensures that only statistically valid power-law distributions proceed to RIF analysis, providing robust insights into the relative importance relationships between concepts.

1. Keyword Extraction and Cleaning (for Case Study)

The framework begins by extracting keywords from scientific literature and applying several cleaning steps:

Thesaurus normalization: Standardizes synonymous terms
Country name removal: Filters out geographical references
Frequency-based filtering: Removes low-frequency terms
Duplicate merging: Consolidates identical terms

2. Power-Law Distribution Analysis

Analyzes the frequency distribution of keywords to establish theoretical baseline:

Parameter Estimation: Uses maximum likelihood estimation to fit power-law parameters (α, xmin)
Goodness-of-Fit Testing: Kolmogorov-Smirnov test with bootstrap significance testing
Log-Log Visualization: Generates rank-frequency plots for visual inspection
Statistical Validation: Computes p-values to assess power-law fit quality
Threshold Selection: Determines minimum frequency (xmin) for power-law regime

3. RIF Index Calculation

The Relative Importance Factor (RIF) quantifies the relative importance relationships between concepts by computing ratios of their theoretical power-law probabilities. The RIF index is calculated as:

RIF_i = P_theoretical,1 / P_theoretical,i

Where:

P_theoretical,i is the theoretical probability of keyword i under the fitted power-law distribution
P_theoretical,1 is the theoretical probability of the highest-ranked keyword (rank = 1)
The RIF matrix Φ(s,r) = P_theoretical,s / P_theoretical,r shows relative importance between any two concepts
RIF values ≥ 1 indicate how many times more important concept s is compared to concept r

Algorithm Overview:

Frequency Filtering: Filter data based on minimum frequency threshold (xmin)
Rank Assignment: Assign ranks 1 to n based on frequency ordering
Theoretical Probabilities: Calculate P(r) = A × r^(-θ) where:
- A = 1/Σ(r^(-θ)) is the normalization constant
- θ is the power-law exponent from previous analysis
RIF Calculation: Compute RIF_i = P_theoretical,1 / P_theoretical,i for each keyword
Empirical Fitting: Fit log-log regression to validate: log(frequency) = log(δ) + β × log(rank)
Parameter Relationships: Calculate α = θ/(-β) and additional normalization constants
Matrix Generation: Create pairwise RIF matrix Φ(s,r) = P_theoretical,s / P_theoretical,r for s ≤ r

4. Visualization and Analysis

The framework generates comprehensive outputs for interpretation:

RIF Matrix Heatmaps: Lower-triangular matrices showing pairwise RIF relationships
Network Graphs: Visualizations of keyword relationships and hierarchies
Comparative Analyses: Side-by-side comparisons of different datasets or groups
Statistical Summaries: Detailed tables with RIF indices, probabilities, and fit parameters

Mathematical Formulation

The RIF methodology follows a structured mathematical approach:

Power-Law Model

For a set of keywords with frequencies, the theoretical power-law distribution is:

P(r) = A × r^(-θ)

Where:

r is the rank (1, 2, 3, ...)
θ is the power-law exponent parameter
A is the normalization constant: A = 1/Σ(r^(-θ))

Empirical Fitting

The empirical data is fitted using log-log regression:

log(frequency) = log(δ) + β × log(rank)

Where:

δ is the scaling factor
β is the empirical slope (negative for power-law decay)

RIF Index Computation

The RIF index measures relative importance using theoretical power-law probabilities:

RIF_i = P_theoretical,1 / P_theoretical,i

Where theoretical probabilities are calculated as:

P_theoretical,i = A × r_i^(-θ)
A = 1 / Σ(r^(-θ))  (normalization constant)

For pairwise comparisons, the RIF matrix is:

Φ(s,r) = P_theoretical,s / P_theoretical,r  for s ≤ r

Additional parameters computed:

α = θ/(-β): Relationship between theoretical (θ) and empirical (β) exponents
δ = exp(intercept): Scaling factor from log-log regression
Normalization constants B and C: Additional scaling factors for comprehensive analysis

This creates a lower-triangular matrix where values ≥ 1 indicate higher relative importance.

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Setup

Clone the repository:

git clone <repository-url>
cd rif

Install dependencies:

pip install -r requirements.txt

Key Dependencies

pandas: Data manipulation and analysis
numpy: Numerical computing
matplotlib: Plotting and visualization
scipy: Scientific computing
networkx: Network analysis
powerlaw: Power-law distribution fitting
bertopic: Topic modeling (for advanced analyses)
scikit-learn: Machine learning utilities

Usage

Synthetic Data Analysis

Basic RIF Analysis (Case 1)

# Run the basic example
python example/case1-rif.py

This demonstrates RIF analysis on a single dataset, generating:

RIF matrix visualization
Network graph of keyword relationships
Statistical summary table

Comparative RIF Analysis (Case 2)

# Run the comparative example
python example/case2-rif.py

This shows how to compare RIF patterns across different groups or time periods.

Case Study: Full Analysis Pipeline

For a complete analysis of your own dataset, follow these steps:

Step 1: Prepare Your Data

Ensure your dataset is in CSV format with columns for:

Author Keywords
Index Keywords
Other relevant metadata

Step 2: Extract and Clean Keywords

python social-resilience/code/1_keywords-dataset.py

This script:

Extracts keywords from your dataset
Applies thesaurus normalization
Removes geographical terms
Saves cleaned keyword datasets

Step 3: Analyze Power-Law Distributions

python social-resilience/code/2_powerlaw-keywords.py

This performs:

Power-law parameter estimation
Goodness-of-fit testing
Log-log plot generation
Statistical significance testing

Step 4: Calculate RIF Indices

python social-resilience/code/3_rif-keywords.py

This generates:

RIF index calculations
Matrix and network visualizations
Comparative analysis reports
Publication-ready figures

Output Files

Keywords Analysis

author_keywords.csv: Cleaned author-assigned keywords
index_keywords.csv: Cleaned index terms
Frequency distributions and summary statistics

Power-Law Analysis

*_powerlaw_summary.csv: Statistical parameters and fit quality
*_loglog_plot.png: Log-log distribution plots
Bootstrap test results and p-values

RIF Analysis

*_rif.csv: RIF indices and statistical measures
*_rif_matrix.png: Heatmap visualizations
*_rif_network.png: Network graph visualizations
*_rif_plots.pdf: Comprehensive analysis reports

Contributing

We welcome contributions to improve the framework:

Bug Reports: Submit issues with detailed descriptions
Feature Requests: Propose new functionality or improvements
Code Contributions: Submit pull requests with enhancements
Documentation: Help improve documentation and examples

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions, suggestions, or collaboration opportunities, please contact:

Primary Contact:

Brian Llinas - bllin001@odu.edu
Computer Science Department & Virginia Modeling, Analysis, and Simulation Center (VMASC)
Old Dominion University, Norfolk/Suffolk, Virginia, USA

Co-authors:

Jose J. Padilla - jpadilla@odu.edu (VMASC, Old Dominion University)
Humberto Llinas - hllinas@uninorte.edu.co (Universidad del Norte, Colombia)
Erika F. Frydenlund - efrydenl@odu.edu (VMASC, Old Dominion University)
Katherine Palacio - kpalacio@uninorte.edu.co (Universidad del Norte, Colombia)

Acknowledgments

This research is funded by grant number N000141912624 by the Office of Naval Research through the Minerva Research Initiative and grant number P116S210003 by the US Department of Education.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
data		data
example		example
images		images
social-resilience		social-resilience
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Relative Importance Factor (RIF) Analysis Framework

Overview

Repository Structure

Methodology

1. Keyword Extraction and Cleaning (for Case Study)

2. Power-Law Distribution Analysis

3. RIF Index Calculation

Algorithm Overview:

4. Visualization and Analysis

Mathematical Formulation

Power-Law Model

Empirical Fitting

RIF Index Computation

Installation

Prerequisites

Setup

Key Dependencies

Usage

Synthetic Data Analysis

Basic RIF Analysis (Case 1)

Comparative RIF Analysis (Case 2)

Case Study: Full Analysis Pipeline

Step 1: Prepare Your Data

Step 2: Extract and Clean Keywords

Step 3: Analyze Power-Law Distributions

Step 4: Calculate RIF Indices

Output Files

Keywords Analysis

Power-Law Analysis

RIF Analysis

Contributing

License

Contact

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages