Skip to content

Commit ed2f16d

Browse files
committed
v0.4.0: Major repository overhaul
- Restructured examples: replaced multiple demo files with single quickstart.py - Added CLI module (themap/cli.py) with command-line interface support - Added new data modules: converter.py, loader.py for improved data handling - Added features module with protein/molecule feature extraction - Added metalearning subpackage (data, eval, models, train) - Added pipeline orchestrator for improved workflow management - Added configuration module (themap/config.py) and pipeline config example - Added protein representation generation script - Updated documentation and tutorials - Cleaned up test suite: removed obsolete tests, updated existing ones - Added py.typed marker for PEP 561 type hints support - Updated dependencies and project configuration
1 parent db4b5cc commit ed2f16d

File tree

72 files changed

+6635
-11290
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+6635
-11290
lines changed

.github/dependabot.yml

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,10 @@
11
version: 2
22
updates:
3-
- package-ecosystem: "pip"
4-
directory: "/"
5-
schedule:
6-
interval: "weekly"
7-
open-pull-requests-limit: 10
8-
labels:
9-
- "dependencies"
10-
- "python"
11-
123
- package-ecosystem: "github-actions"
134
directory: "/"
145
schedule:
156
interval: "weekly"
16-
open-pull-requests-limit: 10
7+
open-pull-requests-limit: 5
178
labels:
189
- "dependencies"
1910
- "github_actions"

.gitignore

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,3 +184,37 @@ results/
184184

185185
# ignore CLAUDE file
186186
CLAUDE.md
187+
188+
# ignore .ruff_cache
189+
.ruff_cache/
190+
191+
# ignore .coverage
192+
.coverage/
193+
194+
# ignore .pytest_cache
195+
.pytest_cache/
196+
197+
# ignore .mypy_cache
198+
.mypy_cache/
199+
200+
# ignore .ipynb_checkpoints
201+
.ipynb_checkpoints/
202+
203+
# ignore .pytest_cache
204+
.pytest_cache/
205+
206+
# ignore all the output folders
207+
output/
208+
output_cache/
209+
output_results/
210+
output_cache/
211+
test_output/
212+
cli_output/
213+
214+
# ignore all the cache folders
215+
task_distance_cache/
216+
217+
# ignore embeddings and generated cache files
218+
datasets/embeddings/
219+
datasets/protein_features_cache.pkl
220+
datasets/processing_summary.json

.readthedocs.yaml

Lines changed: 13 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,20 @@
1-
# .readthedocs.yml
1+
# .readthedocs.yaml
22
# Read the Docs configuration file
33
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
44

5-
# Required
65
version: 2
76

8-
# Optionally set the version of Python and requirements required to build your docs
9-
python:
10-
version: "3.8"
11-
install:
12-
- method: pip
13-
path: .
14-
extra_requirements:
15-
- rtd
16-
- requirements: docs/requirements.txt
7+
build:
8+
os: ubuntu-22.04
9+
tools:
10+
python: "3.10"
1711

12+
mkdocs:
13+
configuration: mkdocs.yml
1814

19-
# Build documentation in the docs/ directory with Sphinx
20-
sphinx:
21-
builder: html
22-
configuration: docs/source/conf.py
23-
fail_on_warning: true
15+
python:
16+
install:
17+
- method: pip
18+
path: .
19+
extra_requirements:
20+
- docs

README.md

Lines changed: 72 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ A Python library for calculating distances between chemical datasets to enable i
1919
- [Installation](#installation)
2020
- [Quick Start](#quick-start)
2121
- [Usage Examples](#usage-examples)
22-
- [Use Cases](#use-cases)
22+
- [Reproducing FS-Mol Experiments](#reproducing-fs-mol-experiments)
2323
- [Documentation](#documentation)
2424
- [Contributing](#contributing)
2525
- [Citation](#citation)
@@ -92,109 +92,90 @@ pip install -e . --no-deps
9292

9393
## Quick Start
9494

95-
### Basic Dataset Analysis
95+
### Compute Dataset Distances
96+
97+
The simplest way to compute distances between molecular datasets:
9698

9799
```python
98-
import os
99-
from dpu_utils.utils.richpath import RichPath
100-
from themap.data.molecule_dataset import MoleculeDataset
101-
102-
# Load datasets
103-
source_dataset_path = RichPath.create(os.path.join("datasets", "train", "CHEMBL1023359.jsonl.gz"))
104-
source_dataset = MoleculeDataset.load_from_file(source_dataset_path)
105-
106-
# Basic dataset analysis (works with minimal installation)
107-
print(f"Dataset size: {len(source_dataset)}")
108-
print(f"Positive ratio: {source_dataset.get_ratio}")
109-
print(f"Dataset statistics: {source_dataset.get_statistics()}")
110-
111-
# Validate dataset integrity
112-
try:
113-
source_dataset.validate_dataset_integrity()
114-
print("✅ Dataset is valid")
115-
except ValueError as e:
116-
print(f"❌ Dataset validation failed: {e}")
117-
```
100+
from themap import quick_distance
118101

119-
### Molecular Embeddings
102+
results = quick_distance(
103+
data_dir="datasets", # Directory with train/ and test/ folders
104+
output_dir="output", # Where to save results
105+
molecule_featurizer="ecfp", # Fingerprint type (ecfp, maccs, etc.)
106+
molecule_method="euclidean", # Distance metric
107+
)
120108

121-
```python
122-
# Only works with pip install -e ".[ml]" or higher
123-
from themap.data.molecule_dataset import MoleculeDataset
124-
dataset_path = RichPath.create(os.path.join("datasets", "train", "CHEMBL1023359.jsonl.gz"))
125-
126-
# Load dataset
127-
dataset = MoleculeDataset.load_from_file(dataset_path)
128-
129-
# Calculate molecular embeddings (requires ML dependencies)
130-
try:
131-
features = dataset.get_features("ecfp")
132-
print(f"Features shape: {features.shape}")
133-
except ImportError:
134-
print("❌ ML dependencies not installed. Use: pip install -e '.[ml]'")
109+
# Results saved to output/molecule_distances.csv
135110
```
136111

137-
### Distance Calculation
112+
### Using a Config File
113+
114+
For reproducible experiments, use a YAML configuration:
138115

139116
```python
140-
# Only works with pip install -e ".[all]"
141-
from themap.data.tasks import Tasks, Task
142-
from themap.distance import MoleculeDatasetDistance, ProteinDatasetDistance, TaskDistance
143-
144-
# Create Tasks collection from your datasets
145-
source_dataset_path = RichPath.create(os.path.join("datasets", "train", "CHEMBL1023359.jsonl.gz"))
146-
source_dataset = MoleculeDataset.load_from_file(source_dataset_path)
147-
target_dataset_path = RichPath.create(os.path.join("datasets", "test", "CHEMBL2219358.jsonl.gz"))
148-
target_dataset = MoleculeDataset.load_from_file(target_dataset_path)
149-
source_task = Task(task_id="CHEMBL1023359", molecule_dataset=source_dataset)
150-
target_task = Task(task_id="CHEMBL2219358", molecule_dataset=target_dataset)
151-
152-
# Step 1: Create Tasks collection with train/test split
153-
tasks = Tasks(train_tasks=[source_task], test_tasks=[target_task])
154-
155-
# Step 2: Compute molecule distance with method-specific configuration
156-
try:
157-
# Use different methods for different data types
158-
mol_dist = MoleculeDatasetDistance(
159-
tasks=tasks,
160-
molecule_method="otdd", # OTDD for molecules
161-
)
162-
mol_dist._compute_features()
163-
distance = mol_dist.get_distance()
164-
print(distance)
165-
166-
except ImportError:
167-
print("❌ Distance calculation dependencies not installed. Use: pip install -e '.[all]'")
117+
from themap import run_pipeline
118+
119+
results = run_pipeline("config.yaml")
168120
```
169121

122+
Example `config.yaml`:
123+
```yaml
124+
data:
125+
directory: "datasets"
170126

171-
## Usage Examples
127+
molecule:
128+
enabled: true
129+
featurizer: "ecfp"
130+
method: "euclidean"
172131

173-
### Transfer Learning Dataset Selection
174-
```python
175-
# Find the most similar training datasets for your target task
176-
candidate_datasets = ["CHEMBL1023359", "CHEMBL2219358", "CHEMBL1243967"]
177-
target_dataset = "my_target_assay"
132+
output:
133+
directory: "output"
134+
format: "csv"
135+
```
136+
137+
### Data Format
138+
139+
Organize your data in this structure:
178140
179-
distances = calculate_all_distances(candidate_datasets, target_dataset)
180-
best_source = min(distances, key=distances.get) # Closest dataset for transfer learning
141+
```
142+
datasets/
143+
├── train/ # Source datasets
144+
│ ├── CHEMBL123456.jsonl.gz
145+
│ └── ...
146+
└── test/ # Target datasets
147+
├── CHEMBL111111.jsonl.gz
148+
└── ...
181149
```
182150

183-
### Domain Adaptation Assessment
184-
```python
185-
# Assess how much domain shift exists between datasets
186-
domain_gap = calculate_dataset_distance(source_domain, target_domain)
187-
if domain_gap < threshold:
188-
print("Direct transfer likely to work well")
189-
else:
190-
print("Domain adaptation strategies recommended")
151+
Each `.jsonl.gz` file contains molecules in JSON lines format:
152+
```json
153+
{"SMILES": "CCO", "Property": 1}
154+
{"SMILES": "CCCO", "Property": 0}
191155
```
192156

193-
### Task Hardness Prediction
157+
158+
## Usage Examples
159+
160+
### Analyzing Distance Results
161+
194162
```python
195-
# Predict task difficulty based on dataset characteristics
196-
hardness_score = estimate_task_hardness(dataset, reference_datasets)
197-
print(f"Predicted task difficulty: {hardness_score}")
163+
import pandas as pd
164+
165+
# Load computed distances
166+
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)
167+
168+
# Find closest source for each target (transfer learning selection)
169+
for target in distances.columns:
170+
closest = distances[target].idxmin()
171+
dist = distances[target].min()
172+
print(f"{target} <- {closest} (distance: {dist:.4f})")
173+
174+
# Estimate task hardness (average distance to k-nearest sources)
175+
k = 3
176+
for target in distances.columns:
177+
hardness = distances[target].nsmallest(k).mean()
178+
print(f"Task hardness for {target}: {hardness:.4f}")
198179
```
199180

200181
## Reproducing FS-Mol Experiments
@@ -204,7 +185,7 @@ Pre-computed molecular embeddings and distance matrices for the FS-Mol dataset a
204185
### Setup
205186
1. Download data from [Zenodo](https://zenodo.org/records/10605093)
206187
2. Extract to `datasets/fsmol_hardness/`
207-
3. Run the provided Jupyter notebooks in the `notebooks/` directory
188+
3. See `examples/` directory for usage examples
208189

209190
## Documentation
210191

@@ -261,11 +242,8 @@ If you use THEMAP in your research, please cite our paper:
261242

262243
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
263244

264-
## 🤝 Support
265-
266-
- 📖 [Documentation](https://hfooladi.github.io/THEMAP/)
267-
- 🐛 [Issue Tracker](https://github.com/HFooladi/THEMAP/issues)
268-
- 💬 [Discussions](https://github.com/HFooladi/THEMAP/discussions)
269-
---
245+
## Support
270246

271-
**Ready to optimize your chemical dataset selection for machine learning?** Start with THEMAP today! 🚀
247+
- [Documentation](https://hfooladi.github.io/THEMAP/)
248+
- [Issue Tracker](https://github.com/HFooladi/THEMAP/issues)
249+
- [Discussions](https://github.com/HFooladi/THEMAP/discussions)

configs/pipeline_example.yaml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# THEMAP Pipeline Configuration Example
2+
# Run with: themap run configs/pipeline_example.yaml
3+
4+
data:
5+
directory: "datasets" # Path to dataset directory
6+
task_list: null # Optional: task list JSON file (auto-discover if null)
7+
8+
distances:
9+
molecule:
10+
enabled: true
11+
featurizer: "ecfp" # Options: ecfp, maccs, desc2D, mordred, ChemBERTa-77M-MLM, etc.
12+
method: "euclidean" # Options: euclidean, cosine, otdd
13+
14+
protein:
15+
enabled: false # Set to true if you have protein FASTA files
16+
featurizer: "esm2_t33_650M_UR50D"
17+
method: "cosine" # Options: euclidean, cosine, manhattan
18+
layer: null # Auto-detect based on model
19+
20+
combination:
21+
strategy: "weighted_average" # Options: average, weighted_average, separate
22+
weights:
23+
molecule: 0.7
24+
protein: 0.3
25+
26+
output:
27+
directory: "output/"
28+
format: "csv" # Options: csv, json, npz
29+
save_features: true # Cache features for reuse
30+
31+
compute:
32+
n_jobs: 8 # Parallel workers
33+
batch_size: 1000 # Batch size for featurization
34+
device: "auto" # Options: auto, cpu, cuda

datasets/sample_tasks_list.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
{"train": ["CHEMBL894522", "CHEMBL1023359", "CHEMBL2218944", "CHEMBL2219012", "CHEMBL3371729", "CHEMBL3705844", "CHEMBL3866221", "CHEMBL4224224"], "valid": [], "test": ["CHEMBL2219236", "CHEMBL2219358"]}
1+
{"train": ["CHEMBL894522", "CHEMBL1023359", "CHEMBL2218944", "CHEMBL2219012", "CHEMBL3371729", "CHEMBL3705844", "CHEMBL3866221", "CHEMBL4224224"], "valid": [], "test": ["CHEMBL2219236", "CHEMBL2219358", "CHEMBL1963831"]}

datasets/test/test_proteins.fasta

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@ MSLHFLYYCSEPTLDVKIAFCQGFDKQVDVSYIAKHYNMSKSKVDNQFYSVEVGDSTFTVLKRYQNLKPIGSGAQGIVCA
33
>sp|Q13177|PAK2_HUMAN
44
MSDNGELEDKPPAPPVRMSSTIFSTGGKDPLSANHSLKPLPSVPEEKKPRHKIISIFSGTEKGSKKKEKERPEISPPSDFEHTIHVGFDAVTGEFTGMPEQWARLLQTSNITKLEQKKNPQAVLDVLKFYDSNTVKQKYLSFTPPEKDGFPSGTPALNAKGTEAPAVVTEEEDDDEETAPPVIAPRPDHTKSIYTRSVIDPVPAPVGDSHVDGAAKSLDKQKKKTKMTDEEIMEKLRTIVSIGDPKKKYTRYEKIGQGASGTVFTATDVALGQEVAIKQINLQKQPKKELIINEILVMKELKNPNIVNFLDSYLVGDELFVVMEYLAGGSLTDVVTETCMDEAQIAAVCRECLQALEFLHANQVIHRDIKSDNVLLGMEGSVKLTDFGFCAQITPEQSKRSTMVGTPYWMAPEVVTRKAYGPKVDIWSLGIMAIEMVEGEPPYLNENPLRALYLIATNGTPELQNPEKLSPIFRDFLNRCLEMDVEKRGSAKELLQHPFLKLAKPLSSLTPLIMAAKEAMKSNR
55
>sp|P50750|CDK9_HUMAN
6-
MAKQYDSVECPFCDEVSKYEKLAKIGQGTFGEVFKARHRKTGQKVALKKVLMENEKEGFPITALREIKILQLLKHENVVNLIEICRTKASPYNRCKGSIYLVFDFCEHDLAGLLSNVLVKFTLSEIKRVMQMLLNGLYYIHRNKILHRDMKAANVLITRDGVLKLADFGLARAFSLAKNSQPNRYTNRVVTLWYRPPELLLGERDYGPPIDLWGAGCIMAEMWTRSPIMQGNTEQHQLALISQLCGSITPEVWPNVDNYELYEKLELVKGQKRKVKDRLKAYVRDPYALDLIDKLLVLDPAQRIDSDDALNHDFFWSDPMPSDLKGMLSTHLTSMFEYLAPPRRKGSQITQQSTNQSRNPATTNQTEFERVF
6+
MAKQYDSVECPFCDEVSKYEKLAKIGQGTFGEVFKARHRKTGQKVALKKVLMENEKEGFPITALREIKILQLLKHENVVNLIEICRTKASPYNRCKGSIYLVFDFCEHDLAGLLSNVLVKFTLSEIKRVMQMLLNGLYYIHRNKILHRDMKAANVLITRDGVLKLADFGLARAFSLAKNSQPNRYTNRVVTLWYRPPELLLGERDYGPPIDLWGAGCIMAEMWTRSPIMQGNTEQHQLALISQLCGSITPEVWPNVDNYELYEKLELVKGQKRKVKDRLKAYVRDPYALDLIDKLLVLDPAQRIDSDDALNHDFFWSDPMPSDLKGMLSTHLTSMFEYLAPPRRKGSQITQQSTNQSRNPATTNQTEFERVF

datasets/train/train_proteins.fasta

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,4 @@ MTMTLHTKASGMALLHQIQGNELEPLNRPQLKIPLERPLGEVYLDSSKPAVYNYPEGAAYEFNAAAAANAQVYGQTGLPY
1717
>SP|O75460|ERN1_HUMAN
1818
MPARRLLLLLTLLLPGLGIFGSTSTVTLPETLLFVSTLDGSLHAVSKRTGSIKWTLKEDPVLQVPTHVEEPAFLPDPNDGSLYTLGSKNNEGLTKLPFTIPELVQASPCRSSDGILYMGKKQDIWYVIDLLTGEKQQTLSSAFADSLCPSTSLLYLGRTEYTITMYDTKTRELRWNATYFDYAASLPEDDVDYKMSHFVSNGDGLVVTVDSESGDVLWIQNYASPVVAFYVWQREGLRKVMHINVAVETLRYLTFMSGEVGRITKWKYPFPKETEAKSKLTPTLYVGKYSTSLYASPSMVHEGVAVVPRGSTLPLLEGPQTDGVTIGDKGECVITPSTDVKFDPGLKSKNKLNYLRNYWLLIGHHETPLSASTKMLERFPNNLPKHRENVIPADSEKKSFEEVINLVDQTSENAPTTVSRDVEEKPAHAPARPEAPVDSMLKDMATIILSTFLLIGWVAFIITYPLSMHQQQQLQHQQFQKELEKIQLLQQQQQQLPFHPPGDTAQDGELLDTSGPYSESSGTSSPSTSPRASNHSLCSGSSASKAGSSPSLEQDDGDEETSVVIVGKISFCPKDVLGHGAEGTIVYRGMFDNRDVAVKRILPECFSFADREVQLLRESDEHPNVIRYFCTEKDRQFQYIAIELCAATLQEYVEQKDFAHLGLEPITLLQQTTSGLAHLHSLNIVHRDLKPHNILISMPNAHGKIKAMISDFGLCKKLAVGRHSFSRRSGVPGTEGWIAPEMLSEDCKENPTYTVDIFSAGCVFYYVISEGSHPFGKSLQRQANILLGACSLDCLHPEKHEDVIARELIEKMIAMDPQKRPSAKHVLKHPFFWSLEKQLQFFQDVSDRIEKESLDGPIVKQLERGGRAVVKMDWRENITVPLQTDLRKFRTYKGGSVRDLLRAMRNKKHHYRELPAEVRETLGSLPDDFVCYFTSRFPHLLAHTYRAMELCSHERLFQPYYFHEPPEPQPPVTPDAL
1919
>SP|Q16581|C3AR_HUMAN
20-
MASFSAETNSTDLLSQPWNEPPVILSMVILSLTFLLGLPGNGLVLWVAGLKMQRTVNTIWFLHLTLADLLCCLSLPFSLAHLALQGQWPYGRFLCKLIPSIIVLNMFASVFLLTAISLDRCLVVFKPIWCQNHRNVGMACSICGCIWVVAFVMCIPVFVYREIFTTDNHNRCGYKFGLSSSLDYPDFYGDPLENRSLENIVQPPGEMNDRLDPSSFQTNDHPWTVPTVFQPQTFQRPSADSLPRGSARLTSQNLYSNVFKPADVVSPKIPSGFPIEDHETSPLDNSDAFLSTHLKLFPSASSNSFYESELPQGFQDYYNLGQFTDDDQVPTPLVAITITRLVVGFLLPSVIMIACYSFIVFRMQRGRFAKSQSKTFRVAVVVVAVFLVCWTPYHIFGVLSLLTDPETPLGKTLMSWDHVCIALASANSCFNPFLYALLGKDFRKKARQSIQGILEAAFSEELTRSTHCPSNNVISERNSTTV
20+
MASFSAETNSTDLLSQPWNEPPVILSMVILSLTFLLGLPGNGLVLWVAGLKMQRTVNTIWFLHLTLADLLCCLSLPFSLAHLALQGQWPYGRFLCKLIPSIIVLNMFASVFLLTAISLDRCLVVFKPIWCQNHRNVGMACSICGCIWVVAFVMCIPVFVYREIFTTDNHNRCGYKFGLSSSLDYPDFYGDPLENRSLENIVQPPGEMNDRLDPSSFQTNDHPWTVPTVFQPQTFQRPSADSLPRGSARLTSQNLYSNVFKPADVVSPKIPSGFPIEDHETSPLDNSDAFLSTHLKLFPSASSNSFYESELPQGFQDYYNLGQFTDDDQVPTPLVAITITRLVVGFLLPSVIMIACYSFIVFRMQRGRFAKSQSKTFRVAVVVVAVFLVCWTPYHIFGVLSLLTDPETPLGKTLMSWDHVCIALASANSCFNPFLYALLGKDFRKKARQSIQGILEAAFSEELTRSTHCPSNNVISERNSTTV

0 commit comments

Comments
 (0)