This repository contains a comprehensive framework for analyzing categorical data that follow power-law distributions using the Relative Importance Factor (RIF) methodology. The RIF index quantifies the relative importance relationships between concepts by computing ratios of their theoretical power-law probabilities, providing insights into the hierarchical structure of concepts within any domain that exhibits power-law behavior.
rif-index/
├── README.md # This file
├── requirements.txt # Python dependencies
├── data/ # Input data files
│ ├── papers.csv # Research papers dataset
│ └── thesaurus_terms.txt # Thesaurus for keyword normalization
├── utils/ # Core utility modules
│ ├── keywords_extraction.py # Keyword extraction and cleaning
│ ├── powerlaw_science.py # Power-law distribution analysis
│ └── rif_index.py # RIF calculation and visualization
├── example/ # Example implementations
│ ├── case1-explanation.py # Basic RIF analysis example
│ ├── case1-rif.py # Single dataset RIF analysis
│ ├── case2-explanation.py # Comparative RIF analysis example
│ ├── case2-rif.py # Multi-group RIF analysis
│ └── *-output/ # Generated outputs from examples
└── social-resilience/ # Case study: Social resilience research
├── code/ # Analysis pipeline scripts
│ ├── 1_keywords-dataset.py # Extract and clean keywords
│ ├── 2_powerlaw-keywords.py # Power-law analysis
│ └── 3_rif-keywords.py # RIF index calculation
└── output/ # Generated analysis results
├── keywords/ # Extracted keyword datasets
├── powerlaw/ # Power-law analysis results
└── rif/ # RIF analysis results and visualizations
The RIF analysis framework follows a structured nine-step methodology for analyzing term distributions in various domains:
Figure 1: Workflow diagram of the Relative Importance Factor (RIF) analysis methodology showing the comprehensive analytical process.
The methodology systematically processes data through: (1) data collection, (2) variable selection with categorical validation, (3) frequency estimation, (4) total count calculation, (5) relative frequency computation, (6) ranking assignment, (7) power-law model fitting with statistical validation, (8) RIF index computation for valid models, and (9) visualization through matrices and networks. This ensures that only statistically valid power-law distributions proceed to RIF analysis, providing robust insights into the relative importance relationships between concepts.
The framework begins by extracting keywords from scientific literature and applying several cleaning steps:
- Thesaurus normalization: Standardizes synonymous terms
- Country name removal: Filters out geographical references
- Frequency-based filtering: Removes low-frequency terms
- Duplicate merging: Consolidates identical terms
Analyzes the frequency distribution of keywords to establish theoretical baseline:
- Parameter Estimation: Uses maximum likelihood estimation to fit power-law parameters (α, xmin)
- Goodness-of-Fit Testing: Kolmogorov-Smirnov test with bootstrap significance testing
- Log-Log Visualization: Generates rank-frequency plots for visual inspection
- Statistical Validation: Computes p-values to assess power-law fit quality
- Threshold Selection: Determines minimum frequency (xmin) for power-law regime
The Relative Importance Factor (RIF) quantifies the relative importance relationships between concepts by computing ratios of their theoretical power-law probabilities. The RIF index is calculated as:
RIF_i = P_theoretical,1 / P_theoretical,i
Where:
P_theoretical,iis the theoretical probability of keyword i under the fitted power-law distributionP_theoretical,1is the theoretical probability of the highest-ranked keyword (rank = 1)- The RIF matrix Φ(s,r) = P_theoretical,s / P_theoretical,r shows relative importance between any two concepts
- RIF values ≥ 1 indicate how many times more important concept s is compared to concept r
- Frequency Filtering: Filter data based on minimum frequency threshold (xmin)
- Rank Assignment: Assign ranks 1 to n based on frequency ordering
- Theoretical Probabilities: Calculate P(r) = A × r^(-θ) where:
- A = 1/Σ(r^(-θ)) is the normalization constant
- θ is the power-law exponent from previous analysis
- RIF Calculation: Compute RIF_i = P_theoretical,1 / P_theoretical,i for each keyword
- Empirical Fitting: Fit log-log regression to validate: log(frequency) = log(δ) + β × log(rank)
- Parameter Relationships: Calculate α = θ/(-β) and additional normalization constants
- Matrix Generation: Create pairwise RIF matrix Φ(s,r) = P_theoretical,s / P_theoretical,r for s ≤ r
The framework generates comprehensive outputs for interpretation:
- RIF Matrix Heatmaps: Lower-triangular matrices showing pairwise RIF relationships
- Network Graphs: Visualizations of keyword relationships and hierarchies
- Comparative Analyses: Side-by-side comparisons of different datasets or groups
- Statistical Summaries: Detailed tables with RIF indices, probabilities, and fit parameters
The RIF methodology follows a structured mathematical approach:
For a set of keywords with frequencies, the theoretical power-law distribution is:
P(r) = A × r^(-θ)
Where:
ris the rank (1, 2, 3, ...)θis the power-law exponent parameterAis the normalization constant:A = 1/Σ(r^(-θ))
The empirical data is fitted using log-log regression:
log(frequency) = log(δ) + β × log(rank)
Where:
δis the scaling factorβis the empirical slope (negative for power-law decay)
The RIF index measures relative importance using theoretical power-law probabilities:
RIF_i = P_theoretical,1 / P_theoretical,i
Where theoretical probabilities are calculated as:
P_theoretical,i = A × r_i^(-θ)
A = 1 / Σ(r^(-θ)) (normalization constant)
For pairwise comparisons, the RIF matrix is:
Φ(s,r) = P_theoretical,s / P_theoretical,r for s ≤ r
Additional parameters computed:
- α = θ/(-β): Relationship between theoretical (θ) and empirical (β) exponents
- δ = exp(intercept): Scaling factor from log-log regression
- Normalization constants B and C: Additional scaling factors for comprehensive analysis
This creates a lower-triangular matrix where values ≥ 1 indicate higher relative importance.
- Python 3.8 or higher
- pip package manager
- Clone the repository:
git clone <repository-url>
cd rif- Install dependencies:
pip install -r requirements.txtpandas: Data manipulation and analysisnumpy: Numerical computingmatplotlib: Plotting and visualizationscipy: Scientific computingnetworkx: Network analysispowerlaw: Power-law distribution fittingbertopic: Topic modeling (for advanced analyses)scikit-learn: Machine learning utilities
# Run the basic example
python example/case1-rif.pyThis demonstrates RIF analysis on a single dataset, generating:
- RIF matrix visualization
- Network graph of keyword relationships
- Statistical summary table
# Run the comparative example
python example/case2-rif.pyThis shows how to compare RIF patterns across different groups or time periods.
For a complete analysis of your own dataset, follow these steps:
Ensure your dataset is in CSV format with columns for:
- Author Keywords
- Index Keywords
- Other relevant metadata
python social-resilience/code/1_keywords-dataset.pyThis script:
- Extracts keywords from your dataset
- Applies thesaurus normalization
- Removes geographical terms
- Saves cleaned keyword datasets
python social-resilience/code/2_powerlaw-keywords.pyThis performs:
- Power-law parameter estimation
- Goodness-of-fit testing
- Log-log plot generation
- Statistical significance testing
python social-resilience/code/3_rif-keywords.pyThis generates:
- RIF index calculations
- Matrix and network visualizations
- Comparative analysis reports
- Publication-ready figures
author_keywords.csv: Cleaned author-assigned keywordsindex_keywords.csv: Cleaned index terms- Frequency distributions and summary statistics
*_powerlaw_summary.csv: Statistical parameters and fit quality*_loglog_plot.png: Log-log distribution plots- Bootstrap test results and p-values
*_rif.csv: RIF indices and statistical measures*_rif_matrix.png: Heatmap visualizations*_rif_network.png: Network graph visualizations*_rif_plots.pdf: Comprehensive analysis reports
We welcome contributions to improve the framework:
- Bug Reports: Submit issues with detailed descriptions
- Feature Requests: Propose new functionality or improvements
- Code Contributions: Submit pull requests with enhancements
- Documentation: Help improve documentation and examples
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, suggestions, or collaboration opportunities, please contact:
Primary Contact:
- Brian Llinas - bllin001@odu.edu
Computer Science Department & Virginia Modeling, Analysis, and Simulation Center (VMASC)
Old Dominion University, Norfolk/Suffolk, Virginia, USA
Co-authors:
- Jose J. Padilla - jpadilla@odu.edu (VMASC, Old Dominion University)
- Humberto Llinas - hllinas@uninorte.edu.co (Universidad del Norte, Colombia)
- Erika F. Frydenlund - efrydenl@odu.edu (VMASC, Old Dominion University)
- Katherine Palacio - kpalacio@uninorte.edu.co (Universidad del Norte, Colombia)
This research is funded by grant number N000141912624 by the Office of Naval Research through the Minerva Research Initiative and grant number P116S210003 by the US Department of Education.
