Skip to content

celsomsilva/statistical-model-explainer

Repository files navigation

Statistical Model Explainer

“A data scientist is not a button pusher.” — Prof. Luiz Paulo Fávero - USP

Making GLM, Linear Models, and Multilevel Models easier to understand — without cutting corners

Why this project exists

Statistical models are powerful. But if you’ve ever stared at a regression table, a GLM summary, or a mixed-effects output, you already know the problem:

The math is right. The software works. But the interpretation is not obvious.

This project exists because good models are often misunderstood, miscommunicated, or oversimplified — especially once results leave the hands of the person who built them.

The goal here is straightforward:

Take real statistical model outputs and explain them clearly, correctly, and responsibly.

No shortcuts. No “AI magic”. No replacing statistical thinking with buzzwords.


What this project is (and what it is not)

What it is

  • A statistical interpretation assistant

  • Focused on:

    • Linear Regression
    • Generalized Linear Models (GLM)
    • Multilevel / Mixed-Effects Models (HLM)
  • Built around:

    • carefully written statistical knowledge
    • semantic retrieval (vector search)
  • Designed to help with:

    • learning
    • teaching
    • communicating results to others

What it is not

  • Not an automatic modeling tool
  • Not a black-box predictor
  • Not a “just ask GPT” wrapper
  • Not a replacement for statistical judgment

This system supports interpretation — it does not invent conclusions.


The core idea

The project is built around a very deliberate separation of responsibilities:

  1. Statistical knowledge lives in human-written documents
  2. Retrieval finds relevant concepts based on what appears in the model output
  3. (Later) LLMs may help turn technical explanations into readable text

At every step, statistics comes first.


How it works (conceptually)

  1. You provide an output from a statistical model (for example, a summary() from R or statsmodels)

  2. The system identifies key elements such as:

    • coefficients
    • standard errors
    • p-values
    • AIC / BIC
    • random effects
    • diagnostics
  3. A semantic search retrieves relevant statistical explanations

  4. That content is then used to generate:

    • line-by-line interpretations
    • contextual explanations
    • warnings and caveats when appropriate

The focus is always on correct interpretation, not storytelling.


Knowledge base philosophy

The knowledge base is intentionally:

  • small
  • modular
  • explicit
  • written by humans

Each file covers one statistical concept only:

  • coefficients
  • residuals
  • multicollinearity
  • AIC / BIC
  • random effects
  • and so on

This makes the system:

  • easier to audit
  • easier to extend
  • safer to use

A simple example

Model output

Coefficient (Promo = Yes) = 1.85
p-value = 0.001

Interpretation

  • Promotion is associated with an increase in the outcome variable.
  • The positive coefficient indicates a positive effect.
  • The low p-value suggests strong evidence against the null hypothesis.
  • This interpretation assumes all other variables are held constant.

No exaggeration. No causal claims. Just correct statistical language.


Current status

This repository is intentionally minimal and focused.

  • No web interface (yet)
  • No production deployment
  • No unnecessary infrastructure

The priority right now is clarity and correctness, not scale.


Project structure

Below is the directory structure, shown in list format to reflect the conceptual separation of the project:

statistical-model-explainer/

  retrieval/
     __init__.py
     build_index.py
     search.py
     logging_config.py

  kb/

   aic_bic.md
   coefficients_interpretation.md
   diagnostics.md
   glm_basics.md
   goodness_of_fit.md
   linear_regression.md
   loglikelihood.md
   mixed_effects_models.md
   model_selection.md
   multicollinearity.md
   multilevel_interpretation_advanced.md
   nlme_lme_specific.md
   p_values.md
   residuals.md
   standard_error.md


  examples/
     glm_output_R.txt
     hlm_output_multilevel_R.txt

  tests/
     test_pipeline_minimal

  interpret_example.py
  interpret_example_multilevel.py
  pyproject.toml
  README.md
  LICENSE
  requirements.txt

Why this structure matters

Each directory has one clear responsibility:

  • kb/ : statistical knowledge (the core asset)
  • retrieval/ : semantic access to that knowledge
  • examples/ → real-world model outputs
  • tests/ → safety and correctness checks

This keeps the project:

  • easy to reason about
  • easy to extend
  • easy to explain to others

Running the example locally

This project is intentionally lightweight and can be run locally with a standard Python setup.

Requirements

  • Python 3.10+
  • No GPU required

Setup

Clone the repository and install the dependencies:

git clone https://github.com/celsomsilva/statistical-model-explainer.git
cd statistical-model-explainer
pip install -r requirements.txt
pip install -e .

Running tests

pytest

Build the knowledge index

The semantic index is built from the human-written statistical knowledge base:

python3 retrieval/build_index.py

This step creates a local vector index used for retrieval.

Run the interpretation example

To interpret a real model output using retrieval-augmented statistical knowledge:

python3 interpret_example.py

The script reads a sample model output from examples/ and produces a structured interpretation based on retrieved statistical concepts.


Future directions (intentionally postponed)

These are possible next steps, not current goals:

  • Add an LLM layer for natural language generation
  • Build a simple CLI or notebook interface
  • Export interpretations to Markdown or PDF
  • Publish the knowledge base as a standalone resource

None of these are required to validate the core idea.


Final note

This project is based on a simple belief:

Good statistical explanations matter as much as good models.

If you can’t explain your model clearly, you probably don’t understand it well enough.

This repository is an attempt to help close that gap.


About the Author

This project was developed by an engineer and data scientist with a background in:

  • Postgraduate degree in Data Science and Analytics (USP)
  • Bachelor of Science in Electrical and Computer Engineering (UERJ)
  • Special interest in statistical models, interpretability, and applied AI

Contact

About

Practical system for interpreting GLM and multilevel model outputs using retrieval-augmented statistical knowledge. Combines statistical rigor with explainable, context-aware AI for real-world decision support.

Resources

Stars

Watchers

Forks