materialsproject
diff --git a/‎mpcontribs-lux/mpcontribs/lux/projects/alab/pipelines/data/analyses/base_analyzer.py‎
Lines changed: 479 additions & 0 deletions b/‎mpcontribs-lux/mpcontribs/lux/projects/alab/pipelines/data/analyses/base_analyzer.py‎
Lines changed: 479 additions & 0 deletions
diff --git a/‎mpcontribs-lux/mpcontribs/lux/projects/alab/pipelines/data/config/README.md‎
Lines changed: 211 additions & 0 deletions b/‎mpcontribs-lux/mpcontribs/lux/projects/alab/pipelines/data/config/README.md‎
Lines changed: 211 additions & 0 deletions
diff --git a/‎mpcontribs-lux/mpcontribs/lux/projects/alab/pipelines/data/config/analyses.yaml‎
Lines changed: 135 additions & 0 deletions b/‎mpcontribs-lux/mpcontribs/lux/projects/alab/pipelines/data/config/analyses.yaml‎
Lines changed: 135 additions & 0 deletions
@@ -0,0 +1,211 @@
+# A-Lab Pipeline Configuration
+
+Configuration system with three-layer priority:
+
+```
+1. Environment Variables (highest) → 2. YAML Files → 3. Code Defaults (fallback)
+```
+
+## Quick Start
+
+### Using Environment Variables (Recommended for Production)
+
+**Option 1: Copy example file**
+
+```bash
+# Copy the example .env file (with current defaults)
+cp data/config/env.example .env
+
+# Edit .env and uncomment/modify values
+# Then source it before running
+source .env
+./update_data.sh
+```
+
+**Option 2: Export directly**
+
+```bash
+# Set MongoDB connection
+export ALAB_MONGO_URI="mongodb://production-host:27017/"
+export ALAB_MONGO_DB="production"
+
+# Set S3 bucket
+export ALAB_S3_BUCKET="my-custom-bucket"
+
+# Run pipeline (uses env vars automatically)
+./update_data.sh
+```
+
+### Using YAML Files (Recommended for Development)
+
+Edit `data/config/defaults.yaml`:
+
+```yaml
+mongodb:
+  uri: 'mongodb://localhost:27017/'
+  database: 'my_database'
+  collection: 'my_collection'
+```
+
+### View Current Configuration
+
+```bash
+python data/config/config_loader.py
+```
+
+Shows all loaded values and their sources (env vs yaml vs defaults).
+
+## Configuration Files
+
+| File                 | Purpose                         |
+| -------------------- | ------------------------------- |
+| **defaults.yaml**    | Global pipeline defaults        |
+| **filters.yaml**     | Experiment filter presets       |
+| **analyses.yaml**    | Analysis plugin documentation   |
+| **config_loader.py** | Configuration loading system    |
+| **env.example**      | Example env file (copy to .env) |
+
+## Environment Variables
+
+### MongoDB
+
+```bash
+ALAB_MONGO_URI=mongodb://localhost:27017/      # MongoDB connection URI
+ALAB_MONGO_DB=temporary                        # Database name
+ALAB_MONGO_COLLECTION=release                  # Collection name
+```
+
+### S3 Upload
+
+```bash
+ALAB_S3_BUCKET=materialsproject-contribs       # S3 bucket name
+ALAB_S3_PREFIX=alab_synthesis                  # S3 prefix path
+ALAB_S3_EXCLUDE_LARGE=true                     # Exclude large files
+ALAB_S3_LARGE_THRESHOLD_MB=50                  # Large file threshold (MB)
+```
+
+### Parquet Options
+
+```bash
+ALAB_SKIP_TEMP_LOGS=false                      # Skip temperature logs
+ALAB_SKIP_XRD_POINTS=false                     # Skip XRD data points
+ALAB_SKIP_WORKFLOW_TASKS=false                 # Skip workflow tasks
+ALAB_PARQUET_COMPRESSION=snappy                # Compression: snappy, gzip, brotli
+ALAB_PARQUET_ENGINE=pyarrow                    # Engine: pyarrow, fastparquet
+```
+
+### Materials Project API
+
+```bash
+ALAB_MP_API_KEY=your_api_key                   # MP API key (for XRD analysis)
+# OR
+MP_API_KEY=your_api_key                        # Alternative name
+```
+
+## Usage in Scripts
+
+### Python
+
+```python
+from config_loader import get_config
+
+# Get configuration
+config = get_config()
+
+# Access values
+print(config.mongo_uri)        # mongodb://localhost:27017/
+print(config.mongo_db)          # temporary
+print(config.s3_bucket)         # materialsproject-contribs
+
+# Or use convenience functions
+from config_loader import get_mongo_uri, get_s3_bucket
+
+uri = get_mongo_uri()           # Gets from env > yaml > default
+bucket = get_s3_bucket()
+```
+
+### Shell Scripts
+
+```bash
+# Use environment variables directly
+: ${ALAB_MONGO_URI:="mongodb://localhost:27017/"}
+
+# Or source from .env file
+if [ -f data/.env ]; then
+    export $(grep -v '^#' data/.env | xargs)
+fi
+```
+
+## Configuration Priority Examples
+
+### Example 1: All from YAML
+
+```bash
+# No env vars set
+$ python data/config/config_loader.py
+MongoDB URI: mongodb://localhost:27017/  (from YAML)
+```
+
+### Example 2: Override with Env
+
+```bash
+# Set env var
+$ export ALAB_MONGO_URI="mongodb://production:27017/"
+$ python data/config/config_loader.py
+MongoDB URI: mongodb://production:27017/  (from ENV) ✓
+```
+
+### Example 3: Mixed Sources
+
+```bash
+# Some from env, some from yaml
+$ export ALAB_MONGO_URI="mongodb://prod:27017/"  # Custom URI
+# Leave ALAB_MONGO_DB unset                       # Use YAML default
+$ python data/config/config_loader.py
+MongoDB URI: mongodb://prod:27017/   (from ENV) ✓
+MongoDB DB:  temporary               (from YAML)
+```
+
+## Best Practices
+
+1. **Development**: Use `defaults.yaml` for local development
+2. **Production**: Use environment variables for sensitive values
+3. **Testing**: Use env vars to point to test databases
+4. **CI/CD**: Set env vars in your deployment pipeline
+5. **Never commit** `.env` files (already in `.gitignore`)
+
+## Troubleshooting
+
+### Config not loading?
+
+```bash
+# Check current config
+python data/config/config_loader.py
+
+# Verify env vars are set
+env | grep ALAB_
+```
+
+### Want to use .env file?
+
+```bash
+# Create from example
+cp data/config/env.example .env
+
+# Edit .env with your values (uncomment lines to override defaults)
+nano .env
+
+# Source it before running scripts
+source .env
+./update_data.sh
+```
+
+### Reset to defaults
+
+```bash
+# Unset all ALAB env vars
+unset $(env | grep ALAB_ | cut -d= -f1)
+
+# Now uses YAML/defaults only
+./update_data.sh
+```
@@ -0,0 +1,135 @@
+# =============================================================================
+# A-Lab Analysis Registry
+# =============================================================================
+# Documentation for available analysis plugins.
+# Analyses are auto-discovered from data/analyses/*.py
+#
+# This file serves as:
+# 1. Documentation of available analyses
+# 2. Default configuration for each analysis
+# 3. Template for adding new analyses
+# =============================================================================
+
+# Built-in analyses (always available)
+analyses:
+  xrd_dara:
+    description: 'XRD phase identification using DARA'
+    class: XRDAnalyzer
+    file: base_analyzer.py
+    cli_flag: '--xrd'
+    output_parquet: xrd_refinements.parquet, xrd_phases.parquet
+    default_config:
+      wmin: 10 # Minimum 2-theta angle
+      wmax: 80 # Maximum 2-theta angle
+      save_viz: false # Save visualization images
+    outputs:
+      - xrd_success: 'Whether analysis succeeded'
+      - xrd_rwp: 'Weighted profile R-factor'
+      - xrd_num_phases: 'Number of phases identified'
+      - xrd_error: 'Error message if failed'
+    requirements:
+      - experiments.parquet (with xrd_sampleid_in_aeris)
+      - xrd_data_points.parquet (optional, for patterns)
+    notes: |
+      Uses DARA (Deep Analysis for Rietveld Automation) for automated
+      phase identification. Requires MP API key for CIF downloads.
+      Results stored in data/xrd_creation/results/
+
+  powder_statistics:
+    description: 'Calculate powder dosing statistics'
+    class: PowderStatisticsAnalyzer
+    file: base_analyzer.py
+    cli_flag: '--powder-stats'
+    output_parquet: null # Results merged into main output
+    default_config: {}
+    outputs:
+      - powder_avg_accuracy: 'Average dosing accuracy %'
+      - powder_total_doses: 'Total number of doses'
+      - powder_unique_count: 'Number of unique powders'
+      - powder_total_mass_g: 'Total powder mass in grams'
+    requirements:
+      - experiments.parquet
+      - powder_doses.parquet
+    notes: |
+      Calculates statistics about powder dosing accuracy.
+      Fast analysis, recommended for all products.
+
+# =============================================================================
+# Adding a New Analysis
+# =============================================================================
+# To add a new analysis (e.g., SEM clustering):
+#
+# 1. Create the analyzer file:
+#    data/analyses/sem_analyzer.py
+#
+# 2. Define the analyzer class:
+#    ```python
+#    from base_analyzer import BaseAnalyzer
+#
+#    class SEMAnalyzer(BaseAnalyzer):
+#        name = "sem_clustering"
+#        description = "Cluster SEM images by morphology"
+#        cli_flag = "--sem"
+#
+#        def analyze(self, experiments_df, parquet_dir):
+#            # Your analysis logic here
+#            results = []
+#            for _, exp in experiments_df.iterrows():
+#                # Process each experiment
+#                results.append({
+#                    'experiment_name': exp['name'],
+#                    'cluster_id': compute_cluster(exp),
+#                    'morphology_score': compute_score(exp)
+#                })
+#            return pd.DataFrame(results)
+#
+#        def get_output_schema(self):
+#            return {
+#                'cluster_id': {'type': 'int', 'required': True},
+#                'morphology_score': {'type': 'float', 'required': False}
+#            }
+#    ```
+#
+# 3. Document here (optional):
+#    sem_clustering:
+#      description: "Cluster SEM images by morphology"
+#      class: SEMAnalyzer
+#      file: sem_analyzer.py
+#      ...
+#
+# 4. The analysis will be auto-discovered on next pipeline run
+# =============================================================================
+
+# Placeholder for future analyses
+# Uncomment and modify when adding:
+
+# sem_clustering:
+#   description: "Cluster SEM images by morphology"
+#   class: SEMAnalyzer
+#   file: sem_analyzer.py
+#   cli_flag: "--sem"
+#   default_config:
+#     num_clusters: 5
+#     feature_extraction: "resnet50"
+#   outputs:
+#     - cluster_id: "Cluster assignment"
+#     - morphology_score: "Morphology similarity score"
+#   requirements:
+#     - SEM images in experiments/*/SEM images/
+#   notes: |
+#     Uses computer vision to cluster SEM images.
+#     Requires tensorflow/pytorch.
+
+# heating_profile:
+#   description: "Analyze heating profile characteristics"
+#   class: HeatingProfileAnalyzer
+#   file: heating_analyzer.py
+#   cli_flag: "--heating"
+#   default_config: {}
+#   outputs:
+#     - heating_rate_avg: "Average heating rate"
+#     - overshoot_celsius: "Temperature overshoot"
+#     - time_at_target: "Time at target temperature"
+#   requirements:
+#     - temperature_logs.parquet
+