Signal Recovery in the Presence of Background: Multi-dimensional Likelihood vs. sWeight Reconstruction
This repository compares the statistical power and performance of a multidimensional Extended Maximum Likelihood Estimate (MLE) with an ‘sWeighted’ fit, which isolates the Signal distribution in the control variable using fits from the independent variables. It contains the package, its documentation, and implementation required for the analysis.
This repository forms part of the submission for the MPhil in Data Intensive Science's S1 Statistics Course at the University of Cambridge.
This example is built upon four fundamental probability distributions, which are implemented as individual classes within the Base_Dist module. These distributions serve as the building blocks which functions and properties can be inherited by classes which combine them to describe more.
The base probability distributions are as follows:
Each of these distributions is encapsulated in its own class, providing methods for calculating probability density functions (PDFs), cumulative distribution functions (CDFs), and performing distribution fitting.
The compound distributions are two-dimensional (2D) probability distributions that combine properties of the base distributions. These are implemented as classes in the Compound_Dist module and inherit the behaviors of their constituent base distributions.
Represents the signal region in 2D space.
- Constituents:
- Crystal Ball Distribution in the X-dimension.
- Exponential Decay Distribution in the Y-dimension.
Represents the background noise in 2D space.
- Constituents:
- Uniform Distribution in the X-dimension.
- Normal Distribution in the Y-dimension.
The total distribution is constructed from the Signal and Background distributions. This is implemented as a separate class in the Compound_Dist module and inherits the properties of the Signal and Background distributions, along with their respective base distributions.
- Constituents:
- Signal Distribution
- Background Distribution
By using inheritance, the total distribution can integrate all its constituent distributions in a modular way and easily adaptable for different base distributions in other senarios.
The notebooks in this repository serve as walkthroughs for the analysis performed. They include derivations of the mathematical implementations, explanations of key choices made, and present the main results. Five notebooks are provided:
| Notebook | Description |
|---|---|
| Notebook 1 | Introduces and implements the four base probability distributions and their combination into signal and background components. Verifies proper normalisation over the truncated domain. |
| Notebook 2 | Demonstrates the calculation and visualisation of marginal probability distributions in both X and Y, including how to implement it in the pipeline. |
| Notebook 3 | Overview of the sampler (accept/reject algorithm) with automatic scaling and recovery of model parameters using Extended Unbineed Maxmimium Likelihood Fitting with iminuit. |
| Notebook 4 | Performs a full bootstrap simulation study, including generation of samples and analysing trends in bias and uncertainty as functions of sample size. |
| Notebook 5 | This explores the use of Sweights, an algorithm in which fits the marginal distribution in a marginalised axis using an Extended Likelihood fit, assigns statistical weights to events, and reconstructs the signal distribution in an indenpendent axis, removing all consideration of background distribution for this dimension |
Documentation on Read the Docs
The pipeline uses a modular, inherited class-based structure, which is explained below, to make it adaptable to different probability distributions. As a result documentation has been created for easier understanding of each functions methods and implementation:
- Class and Function References: Includes detailed descriptions of all classes and functions used in the coursework.
- Source Code Links: Direct links to the source code for easy review.
- Notebook Integration: Hyperlinks throughout the notebooks provide direct access to relevant sections of the documentation.
To run the notebooks, please follow these steps:
Clone the repository from the remote repository to your local machine. Or your
git clone https://github.com/JacobTutt/stat_frequentist_analysis.gitUse a clean virtual environment to avoid dependency conflicts.
python -m venv env
source env/bin/activate # For macOS/Linux
env\Scripts\activate # For WindowsNavigate to the repository’s root directory and install the package along with its dependencies:
pip install -e .To ensure the virtual environment is recognised within Jupyter notebooks, set up a kernel:
python -m ipykernel install --user --name=env --display-name "Statistical Analysis"Open the notebooks and select the created kernel (Statisical Analysis) to run the code.
- The associated project report can be found under Project Report.
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions, run into issues, or just want to discuss the project, feel free to:
- Open an issue on the GitHub Issues page.
- Reach out to me directly via email.
This project is maintained by Jacob Tutt