SciDataCopilot is a multi-agent system that turns natural-language research requests into executable scientific data workflows.
It is designed for end-to-end automation: requirement understanding, data discovery/acquisition, hybrid planning (tool + code), execution, and result integration.
- LLM-first intent understanding with structured requirement split:
- data requirements (what data is needed)
- processing requirements (what analysis/transformation is needed)
- Hybrid plan execution with explicit step dependencies:
- tool steps (Tool Lake executors)
- code steps (LLM-generated Python with execute-repair loop)
- Multi-agent orchestration via LangGraph state workflow.
- Knowledge-driven routing with Data Lake + Tool Lake + Case Lake.
- Built-in support for multiple scenarios:
- tabular/Polar-style data processing
- EEG/MNE workflows
- acquisition workflows (for example UniProt/PDF/JSON pipelines)
- Data health and quality tracking across baseline vs post-processing stages.
User Prompt
|
v
DataAccessAgent
- Resolve datasets (local path -> Data Lake -> acquisition tools)
|
v
IntentParsingAgent
- Requirement split
- Case retrieval/adaptation
- Hybrid plan generation + validation
|
v
DataAccessAgent
- Profile/inspect data
|
v
DataProcessingAgent
- Execute hybrid steps in dependency order
- Tool execution + code generation/debug
- Artifact registry + run logs
|
v
DataIntegrationAgent
- Consolidate outputs
- Quality assessment
- Analysis/recommendations
| Agent | File | Responsibility | Core Functions | Example Outputs |
|---|---|---|---|---|
| IntentParsingAgent | agents/intent_parsing_agent.py |
Parse intent, split requirements, generate/repair hybrid plans | RequirementAnalyzer, CaseRetriever, PlanGenerator, StrategyReviewer | processing_plan.json, structured requirement dict |
| DataAccessAgent | agents/data_access_agent.py |
Resolve and inspect data sources before execution | Data Lake search, tool-based acquisition, modality mapping, profiling | data_profile.json, perception_report.txt, resolved/unresolved sources |
| DataProcessingAgent | agents/data_processing_agent.py |
Execute hybrid plan (tool + code), keep artifacts consistent | topological execution, ExecuteRepairLoop, fallback strategy, step logs | hybrid_execution.json, step_<id>.py, step_<id>_result.json |
| DataIntegrationAgent | agents/data_integration_agent.py |
Integrate outputs and evaluate final quality | strategy analysis, output assembly, quality comparison | integration_analysis.txt, final output files, recommendations |
sci_data_copilot.py: main entry point, state machine wiring, CLI.core/plan_schema.py: authoritative schema for structured requirements and hybrid plan steps.tools/plan_generator.py: LLM plan generation + schema validation/repair.tools/strategy_reviewer.py: tool I/O compatibility and dependency checks.tools/tool_registry.py+tools/tool_lake.py: tool descriptors, executors, and registry.knowledge_base/: Data Lake, Tool Lake, Case Lake persistence and retrieval.
sci-data-copilot/
|-- agents/
| |-- intent_parsing_agent.py
| |-- data_access_agent.py
| |-- data_processing_agent.py
| `-- data_integration_agent.py
|-- core/
| |-- plan_schema.py
| |-- execute_repair_loop.py
| `-- workflow.py
|-- tools/
| |-- plan_generator.py
| |-- requirement_analyzer.py
| |-- strategy_reviewer.py
| |-- tool_registry.py
| `-- ...
|-- knowledge_base/
| |-- data/
| | |-- data_lake.json
| | `-- case_lake.json
| `-- *.py
|-- prompts/
| |-- intent_prompts.py
| |-- eeg_prompts.py
| |-- polar_prompts.py
| `-- ...
|-- config/
| `-- config.yaml
|-- scripts/
| `-- init_knowledge_base.py
|-- xlsx/
| `-- sample tabular data
|-- tests/
| `-- tool and integration tests
|-- requirements.txt
`-- sci_data_copilot.py
- Python 3.10+ (recommended)
- pip
pip install -r requirements.txtBefore running, configure both:
config/config.yaml(LLM endpoint/key used by the main workflow)tools/acquire_config.py(acquisition-related API settings, if you run acquire workflows)
Minimal config/config.yaml example:
model_name: "gpt-5.2"
openai_api_key: "YOUR_API_KEY"
openai_base_url: "YOUR_COMPATIBLE_API_BASE"
max_iterations: 5
save_dir: "./exp"
temp_code_path: "./exp/generated_code.py"python scripts/init_knowledge_base.pyRun this once after dependency/config setup. It seeds Case Lake/Data Lake defaults used by routing and planning.
All commands below are run from repository root.
Note: the first run may download MNE sample data and can take longer.
python sci_data_copilot.py -p "Please perform ocular artifact correction on EEG and MEG data using the MNE sample dataset. First, apply a 0.3 Hz high-pass filter to remove slow drifts. Then fit and apply an EOG regression model to remove eye movement artifacts from all EEG, magnetometer, and gradiometer channels. After correction, extract epochs for the event types visual/left and visual/right within the time window 0.1 s to 0.5 s, applying baseline correction between 0.1 s and 0 s. The expected outputs are a topographic map of the EOG regression weights showing how ocular activity projects to different sensors, and comparison plots of evoked responses before and after correction across EEG (59 channels), gradiometers (203 channels), and magnetometers (102 channels), demonstrating the reduction of eye movement artifacts."Before running this example, make sure the two POLAR files in Data Lake map to valid local paths. If needed, update the POLAR dataset section in scripts/init_knowledge_base.py, then re-run initialization.
python sci_data_copilot.py -p "Process polar tabular data: merge header and records, compute daily averages from hourly values, then split outputs by month."python sci_data_copilot.py -p "Download all P450 enzyme records from UniProt, including sequence information and catalytic reaction information."For tasks outside the three examples above, you can force general mode:
python sci_data_copilot.py -t general -p "Analyze taxi GPS logs and detect rush-hour anomalies"General mode is supported but currently weaker than the main tuned examples.
Each run writes an experiment directory under exp/, for example:
exp/eeg_exp_YYYYMMDD_HHMMSS/
|-- processing_plan.json
|-- data_profile.json
|-- hybrid_execution.json
|-- step_<id>.py
|-- step_<id>_result.json
`-- ...
For acquire workflows, downloaded artifacts are now stored inside the same run directory (not a global top-level downloads/), for example:
exp/acquire_exp_YYYYMMDD_HHMMSS/
|-- acquire_results.json
|-- processing_results.json
|-- downloads/
| |-- *.csv
| `-- *.pdf
`-- ...
The project evaluates data quality with three dimensions (weighted):
- intrinsic quality (completeness/consistency)
- distributional quality (statistical reasonableness)
- utility quality (fitness for target task)
Baseline and post-processing scores can be compared to quantify improvement.
- LLM-first reasoning, with deterministic schema validation.
- Explicit artifacts and dependencies for reproducibility.
- Robust execution through tool fallback and code repair loops.
- Keep backward compatibility when possible (
data_typeremains optional override).
- Legacy: regex/rule routing dominates, weaker cross-domain generalization.
- Current: LLM-driven requirement split + hybrid planning, better for mixed tasks.
- Safety net: validation + repair loops + selective fallback paths.
-
Do I need to pass
data_type?- No. Auto routing is supported.
data_typeis an optional override.
- No. Auto routing is supported.
-
Where are logs and intermediate files?
- In the run-specific folder under
exp/.
- In the run-specific folder under
-
Can I provide a direct local data path?
- Yes. Explicit path resolution is first priority in DataAccess.
-
What if data is missing?
- DataAccess tries Data Lake search, then acquisition tool steps, then returns a user follow-up message.
If you are starting for the first time, run these three commands in order:
Add your api key to /config/config.yaml and tools/aquire_config.py, then
pip install -r requirements.txt
python scripts/init_knowledge_base.py
python sci_data_copilot.py -p "Your task here"