|
| 1 | +# EJ Data Processing Pipeline |
| 2 | + |
| 3 | +This pipeline processes NASA Common Metadata Repository (CMR) data and environmental justice (EJ) classifications to create standardized data dumps for the Science Discovery Engine (SDE). |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The pipeline consists of several components: |
| 8 | +- CMR data processing |
| 9 | +- Environmental justice classification processing |
| 10 | +- Threshold-based filtering |
| 11 | +- Data dump creation |
| 12 | + |
| 13 | +## Prerequisites |
| 14 | + |
| 15 | +- Access to CMR collection data |
| 16 | +- Access to the classification model predictions (contact Bishwas for access) |
| 17 | + |
| 18 | +## Setup |
| 19 | + |
| 20 | +1. Clone the repository |
| 21 | +2. Install dependencies |
| 22 | +3. Configure settings in `scripts/ej/config.py` |
| 23 | + |
| 24 | +## Input Files |
| 25 | + |
| 26 | +You need two main input files: |
| 27 | + |
| 28 | +1. **CMR Collections Data**: Generated using: |
| 29 | +```bash |
| 30 | +github.com/NASA-IMPACT/llm-app-EJ-classifier/blob/develop/scripts/data_processing/download_cmr.py |
| 31 | +``` |
| 32 | + |
| 33 | +2. **Classification Predictions**: Provided by the classification model, contact Bishwas for access |
| 34 | + |
| 35 | +## Configuration |
| 36 | + |
| 37 | +Edit `config.py` to customize: |
| 38 | + |
| 39 | +- Classification thresholds |
| 40 | +- Authorized classifications |
| 41 | +- Input/output filenames |
| 42 | +- Timestamp formats |
| 43 | + |
| 44 | +Example configuration: |
| 45 | +```python |
| 46 | +# Adjust thresholds for different indicators |
| 47 | +INDICATOR_THRESHOLDS = { |
| 48 | + "Climate Change": 1.0, |
| 49 | + "Disasters": 0.80, |
| 50 | + # ... other thresholds |
| 51 | +} |
| 52 | + |
| 53 | +# Change filenames |
| 54 | +CMR_FILENAME = "your_cmr_file.json" |
| 55 | +INFERENCE_FILENAME = "your_predictions.json" |
| 56 | +``` |
| 57 | + |
| 58 | +## Usage |
| 59 | + |
| 60 | +### Basic Usage |
| 61 | + |
| 62 | +Run the pipeline on a local machine with the input files: |
| 63 | +```bash |
| 64 | +python create_ej_dump.py |
| 65 | +``` |
| 66 | + |
| 67 | +## Output |
| 68 | + |
| 69 | +The pipeline generates a JSON file named `ej_dump_YYYYMMDD_HHMMSS.json` containing: |
| 70 | +- Processed CMR metadata |
| 71 | +- Environmental justice classifications |
| 72 | + |
| 73 | +## Server Deployment |
| 74 | + |
| 75 | +To deploy the output to the server: |
| 76 | +```bash |
| 77 | +# Copy to server |
| 78 | +scp ej_dump_YYYYMMDD_HHMMSS.json sde:/home/ec2-user/sde_indexing_helper/backups/ |
| 79 | + |
| 80 | +# Process on server using dm shell |
| 81 | +dmshell |
| 82 | + |
| 83 | +# add your file name to cmr_to_models.py |
| 84 | +# paste and run the contents within the shell |
| 85 | +``` |
0 commit comments