GitHub - grahamWroberts/AutomatedSAS: A library of tools and tests for conducting an automated SAS analysis via a custom hierarchical ML classifier and a ML regression

#AutomatedSAS - V1.0.0

Author: Graham Roberts

Associated paper (preprint): Roberts G, Nieh M-P, Ma A, Yang Q. Automated Structure Analysis of Small Angle Scattering Data via Machine Learning. ChemRxiv. 2024; doi:10.26434/chemrxiv-2024-ggnch

Summary: This repository contains a set of tools and tests for conducting automated SAS analysis via a custom hierarchical ML classifier and a ML regression model.

Installation

If using conda run

conda env create -f environment.yaml
conda activate autosas
python -m pip install -e .

If not using conda you can use pip pip install --user numpy==1.24.3 The following packages and versions have been tested to work. |:---|:---| |module| version used| |numpy|1.24.3| |scikit-learn|1.2.2| |pandas|2.0.2| |argparse|1.1| |mapie|0.9.2| |matplotlib|3.7.1|

Key files:

run_model.py - parses a set of arguments to construct a hierarchical classification model, and calculates its performance on test data
baselines.py - evaluates performance using any of the baseline models; either a k-fold cross validation * or on test data
hierarchical.py - set of functions for defining the objects for classifiers
sas_krr_reg.py - set of functions needed for the regression model
loaders.py - set of utility functions for loading and formatting data

run_model.py

This script is the bread and butter of both the hierarchical classifier and the regression components as described in the associated paper. This loads the data, trains both the classification portion and the regression portion and then evaluates both. The included script "example_run.sh" will execute the script and provides an example of the call structure.

python3 src/run_model.py --datadir data --configdir configs --resultsdir results --evaluate_file ./data/experimental_spectra.csv

All arguments are keyword arguments, and can be passed in any order, but must include flags. ###arguments

targets: the list of space separated target morphologies, i.e.,

--targets cylinder disk sphere cs_cylinder cs_disk cs_sphere

datadir: A directory containing all the data. There should be a file called "TRAIN_[target].csv" and "TEST_[target].csv" for each target.
configdir: The directory containing the configuration files.
resultsdir: A directory to save results to.
hierarchy_file: A file contaiing the structure of the hierarchical model, should be in the configdir directory.
reg_file: A file containing the hyperparameters and targets for the regression models, should be in the configdir directory.
extrapolation: A flag for whether to limit the test data to aspect ratios and shell ratios outside the range of the training data.
evaluate_file: An optional path to a file containing curves to evaluate, this is where to point to new data of interest. Curves must have the same q values.

baselines.py

This script contains allows one to run all the baseline comparisons included in the paper. The script "baselines.sh" shows examples of running each of the included baselines. It loads the data, trains the off-the-shelf classifier, and evaluates the result. There are a variety of arguments; many are only applicable to a particular classifier.
The arguments are as follows:
###arguments

targets: The set of morphologies to include, same as above.
classifier: Which baseline model to use; choose from svc, knn, random-forest.
datadir: The directory containing the source data.
k_fold: A flag for whether or not to compare k_fold performance.
test: A flag for whether or not to evaluate performance on test data.
extrapolation: A flag for whether to test on all test data or only test data with aspect ratio or shell ratio outside the range of training data.

####parameters for svc

c: the c regularization parameter
degree: the degree of polynomial if using polynomial kernel
gamma: a kernel coefficient, normalized by 1/number of features
kernel: which kind of kernel to use, such as 'poly' for polynomial or 'rbf' for radial basis function.
coeff0: the coefficient for the intercept of the polynomial

####parameters for knn

k: the number of neighbors to use for classification
weight: in ['uniform', 'distance'] whether or not to weight the votes from neighbors depending on distance

####parameters for random-forest

n_est: the number of estmators
rfcriterion: the criterion on which to split
max_depth: the maximum depth of the tree
min_samples: the minimum number of samples per split

k_fold

As described in the accompanying paper we opted to perform k_fold with an inverted number of training and validation data. The data is split into k folds. For each fold the model is trained on that fold and evaluated on all other folds. This leads to lower performance on validation, but selects a model that when trained on all data performs well on test data. This essentially is a form of implicit regularization, looking for models that on small data can still perform moderately well in generalization.

Citing

If using the methods described here please cite as @unpublished{ author={Graham Roberts, Mu-Ping Nieh, Anson Ma, and Qian Yang} title = {Automated Structure Analysis of Small Angle Scattering Data via Machine Learning}, month={December}, year={2024}, DOI={10.26434/chemrxiv-2024-ggnch}, publisher={ChemRxiv} }

##Contributing Please reach out by Email if you would like to contribute to this effort, either at graham.roberts@uconn.edu (lead developer) or qyang@uconn.edu (principal investigator)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
data		data
results		results
src		src
LICENSE		LICENSE
README.md		README.md
SimulateData.ipynb		SimulateData.ipynb
Tutorial.ipynb		Tutorial.ipynb
environment.yaml		environment.yaml
example_baselines.sh		example_baselines.sh
example_run.sh		example_run.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Key files:

run_model.py

baselines.py

k_fold

Citing

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Key files:

run_model.py

baselines.py

k_fold

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages