Statistical Model Explainer

“A data scientist is not a button pusher.” — Prof. Luiz Paulo Fávero - USP

Making GLM, Linear Models, and Multilevel Models easier to understand — without cutting corners

Why this project exists

Statistical models are powerful. But if you’ve ever stared at a regression table, a GLM summary, or a mixed-effects output, you already know the problem:

The math is right. The software works. But the interpretation is not obvious.

This project exists because good models are often misunderstood, miscommunicated, or oversimplified — especially once results leave the hands of the person who built them.

The goal here is straightforward:

Take real statistical model outputs and explain them clearly, correctly, and responsibly.

No shortcuts. No “AI magic”. No replacing statistical thinking with buzzwords.

What this project is (and what it is not)

What it is

A statistical interpretation assistant
Focused on:
- Linear Regression
- Generalized Linear Models (GLM)
- Multilevel / Mixed-Effects Models (HLM)
Built around:
- carefully written statistical knowledge
- semantic retrieval (vector search)
Designed to help with:
- learning
- teaching
- communicating results to others

What it is not

Not an automatic modeling tool
Not a black-box predictor
Not a “just ask GPT” wrapper
Not a replacement for statistical judgment

This system supports interpretation — it does not invent conclusions.

The core idea

The project is built around a very deliberate separation of responsibilities:

Statistical knowledge lives in human-written documents
Retrieval finds relevant concepts based on what appears in the model output
(Later) LLMs may help turn technical explanations into readable text

At every step, statistics comes first.

How it works (conceptually)

You provide an output from a statistical model (for example, a summary() from R or statsmodels)
The system identifies key elements such as:
- coefficients
- standard errors
- p-values
- AIC / BIC
- random effects
- diagnostics
A semantic search retrieves relevant statistical explanations
That content is then used to generate:
- line-by-line interpretations
- contextual explanations
- warnings and caveats when appropriate

The focus is always on correct interpretation, not storytelling.

Knowledge base philosophy

The knowledge base is intentionally:

small
modular
explicit
written by humans

Each file covers one statistical concept only:

coefficients
residuals
multicollinearity
AIC / BIC
random effects
and so on

This makes the system:

easier to audit
easier to extend
safer to use

A simple example

Model output

Coefficient (Promo = Yes) = 1.85
p-value = 0.001

Interpretation

Promotion is associated with an increase in the outcome variable.
The positive coefficient indicates a positive effect.
The low p-value suggests strong evidence against the null hypothesis.
This interpretation assumes all other variables are held constant.

No exaggeration. No causal claims. Just correct statistical language.

Current status

This repository is intentionally minimal and focused.

No web interface (yet)
No production deployment
No unnecessary infrastructure

The priority right now is clarity and correctness, not scale.

Project structure

Below is the directory structure, shown in list format to reflect the conceptual separation of the project:

statistical-model-explainer/

  retrieval/
     __init__.py
     build_index.py
     search.py
     logging_config.py

  kb/

   aic_bic.md
   coefficients_interpretation.md
   diagnostics.md
   glm_basics.md
   goodness_of_fit.md
   linear_regression.md
   loglikelihood.md
   mixed_effects_models.md
   model_selection.md
   multicollinearity.md
   multilevel_interpretation_advanced.md
   nlme_lme_specific.md
   p_values.md
   residuals.md
   standard_error.md


  examples/
     glm_output_R.txt
     hlm_output_multilevel_R.txt

  tests/
     test_pipeline_minimal

  interpret_example.py
  interpret_example_multilevel.py
  pyproject.toml
  README.md
  LICENSE
  requirements.txt

Why this structure matters

Each directory has one clear responsibility:

kb/ : statistical knowledge (the core asset)
retrieval/ : semantic access to that knowledge
examples/ → real-world model outputs
tests/ → safety and correctness checks

This keeps the project:

easy to reason about
easy to extend
easy to explain to others

Running the example locally

This project is intentionally lightweight and can be run locally with a standard Python setup.

Requirements

Python 3.10+
No GPU required

Setup

Clone the repository and install the dependencies:

git clone https://github.com/celsomsilva/statistical-model-explainer.git
cd statistical-model-explainer
pip install -r requirements.txt
pip install -e .

Running tests

pytest

Build the knowledge index

The semantic index is built from the human-written statistical knowledge base:

python3 retrieval/build_index.py

This step creates a local vector index used for retrieval.

Run the interpretation example

To interpret a real model output using retrieval-augmented statistical knowledge:

python3 interpret_example.py

The script reads a sample model output from examples/ and produces a structured interpretation based on retrieved statistical concepts.

Future directions (intentionally postponed)

These are possible next steps, not current goals:

Add an LLM layer for natural language generation
Build a simple CLI or notebook interface
Export interpretations to Markdown or PDF
Publish the knowledge base as a standalone resource

None of these are required to validate the core idea.

Final note

This project is based on a simple belief:

Good statistical explanations matter as much as good models.

If you can’t explain your model clearly, you probably don’t understand it well enough.

This repository is an attempt to help close that gap.

About the Author

This project was developed by an engineer and data scientist with a background in:

Postgraduate degree in Data Science and Analytics (USP)
Bachelor of Science in Electrical and Computer Engineering (UERJ)
Special interest in statistical models, interpretability, and applied AI

Contact

LinkedIn
Or open an issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistical Model Explainer

Making GLM, Linear Models, and Multilevel Models easier to understand — without cutting corners

Why this project exists

What this project is (and what it is not)

What it is

What it is not

The core idea

How it works (conceptually)

Knowledge base philosophy

A simple example

Current status

Project structure

Why this structure matters

Running the example locally

Requirements

Setup

Running tests

Build the knowledge index

Run the interpretation example

Future directions (intentionally postponed)

Final note

About the Author

Contact

About

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
examples		examples
kb		kb
retrieval		retrieval
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
interpret_example.py		interpret_example.py
interpret_example_multilevel.py		interpret_example_multilevel.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Statistical Model Explainer

Making GLM, Linear Models, and Multilevel Models easier to understand — without cutting corners

Why this project exists

What this project is (and what it is not)

What it is

What it is not

The core idea

How it works (conceptually)

Knowledge base philosophy

A simple example

Current status

Project structure

Why this structure matters

Running the example locally

Requirements

Setup

Running tests

Build the knowledge index

Run the interpretation example

Future directions (intentionally postponed)

Final note

About the Author

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks