This fork provides major improvements to the SMILES-X utilities for molecular data processing and classification.
The updates focus on increasing robustness, preserving data integrity, and enhancing uncertainty estimation for more reliable model evaluation.
This modified version of the SMILES-X suite refines data preprocessing, feature encoding, and classification evaluation steps commonly used in molecular machine learning workflows.
It introduces better input handling, flexible sequence encoding, and uncertainty-aware classification metrics.
- Enhanced validation to prevent misuse when a single SMILES string is passed instead of a list.
- Produces clear logging errors for incorrect input types.
- Ensures concatenation only occurs for lists or tuples of SMILES sequences using
'j'.
if isinstance(smiles, str): logging.error( "smiles_concat expected a list of SMILES per entry but got a STRING." ) logging.error("Wrap your SMILES into a list, e.g. ['CCO']")
text
✅ Result: Reduced runtime errors and improved robustness during SMILES batch processing.
- Replaced fixed-length truncation with dynamic padding to handle variable-length molecule sequences.
- Pads all SMILES to the maximum sequence length in the batch.
- Preserves critical information for longer SMILES strings.
- Supports
'unk'for unknown tokens and'pad'for padding.
pad_len = max_length - len(ismiles) ismiles_tmp = ismiles + ['pad'] * pad_len
text
✅ Advantages:
- Prevents data loss from truncation.
- Compatible with transformer-based models (e.g., SMILES-BERT).
- Future-ready for attention mask integration.
- Added Monte Carlo simulation for improved robustness in classification metrics.
- Injects Gaussian noise into predictions using predicted error estimates.
- Computes standard deviation across multiple stochastic runs for:
- Accuracy
- Precision
- Recall
- F1-score
- PR-AUC
pred_mc = pred + np.random.normal(0, err_pred, size=len(pred)) metrics_mat[i] = [acc, prec, rec, f1, pr_auc]
text
✅ Impact:
- Produces uncertainty-aware metrics.
- Enables statistically meaningful interpretation of model performance.
- Pruned unnecessary imports from
main.py. - Improved readability, modularity, and general code hygiene.
- Reduced clutter, making it easier to extend or integrate new models.
Safe concatenation of SMILES sequences smiles_concat(smiles_list)
Integer encoding with dynamic padding int_vec_encode(tokenized_smiles_list, vocab)
Monte Carlo uncertainty estimation for classification metrics sigma_classification_metrics(true, pred, err_pred, n_mc=1000)
text
| Improvement | Benefit |
|---|---|
Input validation in smiles_concat |
Prevents misuse and clarifies error handling |
Dynamic padding in int_vec_encode |
Preserves complete molecule information |
| Monte Carlo uncertainty metrics | Enables uncertainty quantification in evaluation |
| Code cleanup and reorganization | Enhances maintainability and readability |
- Integrate attention-mask generation for transformer compatibility.
- Extend uncertainty estimation to regression tasks.
- Add configuration utilities for dynamic hyperparameter control.
These updates were implemented as part of a research-oriented fork to improve reliability, interpretability, and extendability in molecular machine learning workflows.