Skip to content

Commit 089ede3

Browse files
committed
pipeline for provenance
1 parent b161820 commit 089ede3

21 files changed

+7306
-0
lines changed

mpcontribs-lux/mpcontribs/lux/projects/alab/pipelines/data/analyses/base_analyzer.py

Lines changed: 479 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# A-Lab Pipeline Configuration
2+
3+
Configuration system with three-layer priority:
4+
5+
```
6+
1. Environment Variables (highest) → 2. YAML Files → 3. Code Defaults (fallback)
7+
```
8+
9+
## Quick Start
10+
11+
### Using Environment Variables (Recommended for Production)
12+
13+
**Option 1: Copy example file**
14+
15+
```bash
16+
# Copy the example .env file (with current defaults)
17+
cp data/config/env.example .env
18+
19+
# Edit .env and uncomment/modify values
20+
# Then source it before running
21+
source .env
22+
./update_data.sh
23+
```
24+
25+
**Option 2: Export directly**
26+
27+
```bash
28+
# Set MongoDB connection
29+
export ALAB_MONGO_URI="mongodb://production-host:27017/"
30+
export ALAB_MONGO_DB="production"
31+
32+
# Set S3 bucket
33+
export ALAB_S3_BUCKET="my-custom-bucket"
34+
35+
# Run pipeline (uses env vars automatically)
36+
./update_data.sh
37+
```
38+
39+
### Using YAML Files (Recommended for Development)
40+
41+
Edit `data/config/defaults.yaml`:
42+
43+
```yaml
44+
mongodb:
45+
uri: 'mongodb://localhost:27017/'
46+
database: 'my_database'
47+
collection: 'my_collection'
48+
```
49+
50+
### View Current Configuration
51+
52+
```bash
53+
python data/config/config_loader.py
54+
```
55+
56+
Shows all loaded values and their sources (env vs yaml vs defaults).
57+
58+
## Configuration Files
59+
60+
| File | Purpose |
61+
| -------------------- | ------------------------------- |
62+
| **defaults.yaml** | Global pipeline defaults |
63+
| **filters.yaml** | Experiment filter presets |
64+
| **analyses.yaml** | Analysis plugin documentation |
65+
| **config_loader.py** | Configuration loading system |
66+
| **env.example** | Example env file (copy to .env) |
67+
68+
## Environment Variables
69+
70+
### MongoDB
71+
72+
```bash
73+
ALAB_MONGO_URI=mongodb://localhost:27017/ # MongoDB connection URI
74+
ALAB_MONGO_DB=temporary # Database name
75+
ALAB_MONGO_COLLECTION=release # Collection name
76+
```
77+
78+
### S3 Upload
79+
80+
```bash
81+
ALAB_S3_BUCKET=materialsproject-contribs # S3 bucket name
82+
ALAB_S3_PREFIX=alab_synthesis # S3 prefix path
83+
ALAB_S3_EXCLUDE_LARGE=true # Exclude large files
84+
ALAB_S3_LARGE_THRESHOLD_MB=50 # Large file threshold (MB)
85+
```
86+
87+
### Parquet Options
88+
89+
```bash
90+
ALAB_SKIP_TEMP_LOGS=false # Skip temperature logs
91+
ALAB_SKIP_XRD_POINTS=false # Skip XRD data points
92+
ALAB_SKIP_WORKFLOW_TASKS=false # Skip workflow tasks
93+
ALAB_PARQUET_COMPRESSION=snappy # Compression: snappy, gzip, brotli
94+
ALAB_PARQUET_ENGINE=pyarrow # Engine: pyarrow, fastparquet
95+
```
96+
97+
### Materials Project API
98+
99+
```bash
100+
ALAB_MP_API_KEY=your_api_key # MP API key (for XRD analysis)
101+
# OR
102+
MP_API_KEY=your_api_key # Alternative name
103+
```
104+
105+
## Usage in Scripts
106+
107+
### Python
108+
109+
```python
110+
from config_loader import get_config
111+
112+
# Get configuration
113+
config = get_config()
114+
115+
# Access values
116+
print(config.mongo_uri) # mongodb://localhost:27017/
117+
print(config.mongo_db) # temporary
118+
print(config.s3_bucket) # materialsproject-contribs
119+
120+
# Or use convenience functions
121+
from config_loader import get_mongo_uri, get_s3_bucket
122+
123+
uri = get_mongo_uri() # Gets from env > yaml > default
124+
bucket = get_s3_bucket()
125+
```
126+
127+
### Shell Scripts
128+
129+
```bash
130+
# Use environment variables directly
131+
: ${ALAB_MONGO_URI:="mongodb://localhost:27017/"}
132+
133+
# Or source from .env file
134+
if [ -f data/.env ]; then
135+
export $(grep -v '^#' data/.env | xargs)
136+
fi
137+
```
138+
139+
## Configuration Priority Examples
140+
141+
### Example 1: All from YAML
142+
143+
```bash
144+
# No env vars set
145+
$ python data/config/config_loader.py
146+
MongoDB URI: mongodb://localhost:27017/ (from YAML)
147+
```
148+
149+
### Example 2: Override with Env
150+
151+
```bash
152+
# Set env var
153+
$ export ALAB_MONGO_URI="mongodb://production:27017/"
154+
$ python data/config/config_loader.py
155+
MongoDB URI: mongodb://production:27017/ (from ENV) ✓
156+
```
157+
158+
### Example 3: Mixed Sources
159+
160+
```bash
161+
# Some from env, some from yaml
162+
$ export ALAB_MONGO_URI="mongodb://prod:27017/" # Custom URI
163+
# Leave ALAB_MONGO_DB unset # Use YAML default
164+
$ python data/config/config_loader.py
165+
MongoDB URI: mongodb://prod:27017/ (from ENV) ✓
166+
MongoDB DB: temporary (from YAML)
167+
```
168+
169+
## Best Practices
170+
171+
1. **Development**: Use `defaults.yaml` for local development
172+
2. **Production**: Use environment variables for sensitive values
173+
3. **Testing**: Use env vars to point to test databases
174+
4. **CI/CD**: Set env vars in your deployment pipeline
175+
5. **Never commit** `.env` files (already in `.gitignore`)
176+
177+
## Troubleshooting
178+
179+
### Config not loading?
180+
181+
```bash
182+
# Check current config
183+
python data/config/config_loader.py
184+
185+
# Verify env vars are set
186+
env | grep ALAB_
187+
```
188+
189+
### Want to use .env file?
190+
191+
```bash
192+
# Create from example
193+
cp data/config/env.example .env
194+
195+
# Edit .env with your values (uncomment lines to override defaults)
196+
nano .env
197+
198+
# Source it before running scripts
199+
source .env
200+
./update_data.sh
201+
```
202+
203+
### Reset to defaults
204+
205+
```bash
206+
# Unset all ALAB env vars
207+
unset $(env | grep ALAB_ | cut -d= -f1)
208+
209+
# Now uses YAML/defaults only
210+
./update_data.sh
211+
```
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# =============================================================================
2+
# A-Lab Analysis Registry
3+
# =============================================================================
4+
# Documentation for available analysis plugins.
5+
# Analyses are auto-discovered from data/analyses/*.py
6+
#
7+
# This file serves as:
8+
# 1. Documentation of available analyses
9+
# 2. Default configuration for each analysis
10+
# 3. Template for adding new analyses
11+
# =============================================================================
12+
13+
# Built-in analyses (always available)
14+
analyses:
15+
xrd_dara:
16+
description: 'XRD phase identification using DARA'
17+
class: XRDAnalyzer
18+
file: base_analyzer.py
19+
cli_flag: '--xrd'
20+
output_parquet: xrd_refinements.parquet, xrd_phases.parquet
21+
default_config:
22+
wmin: 10 # Minimum 2-theta angle
23+
wmax: 80 # Maximum 2-theta angle
24+
save_viz: false # Save visualization images
25+
outputs:
26+
- xrd_success: 'Whether analysis succeeded'
27+
- xrd_rwp: 'Weighted profile R-factor'
28+
- xrd_num_phases: 'Number of phases identified'
29+
- xrd_error: 'Error message if failed'
30+
requirements:
31+
- experiments.parquet (with xrd_sampleid_in_aeris)
32+
- xrd_data_points.parquet (optional, for patterns)
33+
notes: |
34+
Uses DARA (Deep Analysis for Rietveld Automation) for automated
35+
phase identification. Requires MP API key for CIF downloads.
36+
Results stored in data/xrd_creation/results/
37+
38+
powder_statistics:
39+
description: 'Calculate powder dosing statistics'
40+
class: PowderStatisticsAnalyzer
41+
file: base_analyzer.py
42+
cli_flag: '--powder-stats'
43+
output_parquet: null # Results merged into main output
44+
default_config: {}
45+
outputs:
46+
- powder_avg_accuracy: 'Average dosing accuracy %'
47+
- powder_total_doses: 'Total number of doses'
48+
- powder_unique_count: 'Number of unique powders'
49+
- powder_total_mass_g: 'Total powder mass in grams'
50+
requirements:
51+
- experiments.parquet
52+
- powder_doses.parquet
53+
notes: |
54+
Calculates statistics about powder dosing accuracy.
55+
Fast analysis, recommended for all products.
56+
57+
# =============================================================================
58+
# Adding a New Analysis
59+
# =============================================================================
60+
# To add a new analysis (e.g., SEM clustering):
61+
#
62+
# 1. Create the analyzer file:
63+
# data/analyses/sem_analyzer.py
64+
#
65+
# 2. Define the analyzer class:
66+
# ```python
67+
# from base_analyzer import BaseAnalyzer
68+
#
69+
# class SEMAnalyzer(BaseAnalyzer):
70+
# name = "sem_clustering"
71+
# description = "Cluster SEM images by morphology"
72+
# cli_flag = "--sem"
73+
#
74+
# def analyze(self, experiments_df, parquet_dir):
75+
# # Your analysis logic here
76+
# results = []
77+
# for _, exp in experiments_df.iterrows():
78+
# # Process each experiment
79+
# results.append({
80+
# 'experiment_name': exp['name'],
81+
# 'cluster_id': compute_cluster(exp),
82+
# 'morphology_score': compute_score(exp)
83+
# })
84+
# return pd.DataFrame(results)
85+
#
86+
# def get_output_schema(self):
87+
# return {
88+
# 'cluster_id': {'type': 'int', 'required': True},
89+
# 'morphology_score': {'type': 'float', 'required': False}
90+
# }
91+
# ```
92+
#
93+
# 3. Document here (optional):
94+
# sem_clustering:
95+
# description: "Cluster SEM images by morphology"
96+
# class: SEMAnalyzer
97+
# file: sem_analyzer.py
98+
# ...
99+
#
100+
# 4. The analysis will be auto-discovered on next pipeline run
101+
# =============================================================================
102+
103+
# Placeholder for future analyses
104+
# Uncomment and modify when adding:
105+
106+
# sem_clustering:
107+
# description: "Cluster SEM images by morphology"
108+
# class: SEMAnalyzer
109+
# file: sem_analyzer.py
110+
# cli_flag: "--sem"
111+
# default_config:
112+
# num_clusters: 5
113+
# feature_extraction: "resnet50"
114+
# outputs:
115+
# - cluster_id: "Cluster assignment"
116+
# - morphology_score: "Morphology similarity score"
117+
# requirements:
118+
# - SEM images in experiments/*/SEM images/
119+
# notes: |
120+
# Uses computer vision to cluster SEM images.
121+
# Requires tensorflow/pytorch.
122+
123+
# heating_profile:
124+
# description: "Analyze heating profile characteristics"
125+
# class: HeatingProfileAnalyzer
126+
# file: heating_analyzer.py
127+
# cli_flag: "--heating"
128+
# default_config: {}
129+
# outputs:
130+
# - heating_rate_avg: "Average heating rate"
131+
# - overshoot_celsius: "Temperature overshoot"
132+
# - time_at_target: "Time at target temperature"
133+
# requirements:
134+
# - temperature_logs.parquet
135+

0 commit comments

Comments
 (0)