Data Science Agent

This project converts a Jupyter notebook-based data science agent into a Python application with a Streamlit interface. The agent uses LangChain and OpenAI to generate and execute data analysis code.

Features

Upload CSV or Excel files for analysis
Configure analysis objectives and hardware specifications
Generate optimized Python code for data analysis
Run in Docker with automatic container management
Configure maximum iterations for analysis
Focus on data analysis results without code explanations
Execute the generated code and display results
Visualize data with automatically generated plots

Project Structure

ds_project/
├── app.py                  # Streamlit application
├── docker-compose.yml      # Docker Compose configuration
├── DOCKER.md               # Docker setup instructions
├── Dockerfile              # Docker configuration
├── requirements.txt        # Project dependencies
├── setup.py                # Setup script for easy installation
├── run_analysis.py         # Command-line interface
├── analyze.sh              # Analysis script for macOS/Linux
├── analyze.bat             # Analysis script for Windows
├── docker-run.sh           # Docker run script for macOS/Linux
├── docker-run.bat          # Docker run script for Windows
├── run.sh                  # Run script for macOS/Linux
├── run.bat                 # Run script for Windows
├── data/                   # Directory for data files
├── eda_plots/              # Directory for generated plots
├── src/                    # Source code
│   ├── __init__.py
│   ├── agent.py            # Data Science Agent implementation
│   ├── models.py           # Pydantic models
│   └── tools.py            # Agent tools
├── .dockerignore           # Docker ignore file
└── .gitignore              # Git ignore file

Quick Setup (Recommended)

The easiest way to set up the Data Science Agent is to use the provided installation scripts, which work on Mac, Windows, and Linux:

On macOS/Linux:

# Run the installation script
./install.sh

On Windows:

# Run the installation script
install.bat

These scripts will:

Check your Python version (requires Python 3.8+)
Install UV (a fast Python package installer)
Create a virtual environment
Install all dependencies using UV
Create platform-specific run scripts
Automatically launch the application

The application will start automatically after installation is complete. In the future, you can run the application using:

On Windows: run.bat
On macOS/Linux: ./run.sh

Manual Setup with setup.py

You can also run the setup script directly:

# Run the setup script
python setup.py

Manual Setup (Alternative)

If you prefer to set up manually:

Create a virtual environment:
```
python -m venv venv
```
Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS/Linux:
```
source venv/bin/activate
```

Install dependencies:

pip install -r requirements.txt

Or with UV (faster):

uv pip install -r requirements.txt

Run the Streamlit app:
```
streamlit run app.py
```

Docker Setup

The Data Science Agent can also be run using Docker, which provides an isolated environment with all dependencies pre-installed.

Quick Start with Docker

Use the provided scripts to run the application in Docker:
- On macOS/Linux:
```
./docker-run.sh
```
- On Windows:
```
docker-run.bat
```
Access the application in your browser at http://localhost:8502

Manual Docker Setup

Alternatively, you can use Docker Compose directly:

# Set your OpenAI API key
export OPENAI_API_KEY=your_openai_api_key_here

# Start the application
docker-compose up

For more detailed instructions on using Docker, see the DOCKER.md file.

Usage

Enter your OpenAI API key in the sidebar
Initialize the agent
Upload a data file or provide a file path
Configure analysis objectives and hardware specifications
Run the analysis
View the results and generated plots

Advanced Configuration

The Data Science Agent includes several advanced configuration options:

Max Iterations: Control the maximum number of iterations the agent can perform. Increase this value if the agent is stopping before completing the analysis.
Model Selection: Choose from available OpenAI models based on your API key.
Analysis Objective: Customize the analysis objective to focus on specific aspects of your data.
Hardware Specifications: Provide information about your hardware to optimize the generated code.

How It Works

The Data Science Agent uses a combination of LangChain, OpenAI's GPT models, and custom tools to:

Read and analyze data files - The agent examines the structure, data types, and content of uploaded CSV or Excel files
Generate optimized Python code - Based on the analysis objectives and hardware specifications, the agent creates custom data analysis code
Execute the generated code - The code is executed in a controlled environment, with results and visualizations captured
Provide comprehensive analysis - Results include an analysis summary, key insights, and recommendations

The agent follows a structured workflow:

Data ingestion and preprocessing
Exploratory data analysis
Visualization generation
Insight extraction

Command-Line Interface

In addition to the Streamlit web interface, the project includes a command-line tool for batch processing:

Using the CLI Scripts (Recommended)

If you used the setup script, you can use the provided CLI scripts:

On Windows:

analyze.bat --file data/train.csv --objective "Analyze survival patterns" --api-key YOUR_API_KEY

On macOS/Linux:

./analyze.sh --file data/train.csv --objective "Analyze survival patterns" --api-key YOUR_API_KEY

Manual CLI Usage

Alternatively, you can run the analysis script directly:

python run_analysis.py --file data/train.csv --objective "Analyze survival patterns" --api-key YOUR_API_KEY

Technologies Used

LangChain - Framework for developing applications powered by language models
OpenAI GPT Models - Advanced language models for code generation and analysis
Streamlit - Web application framework for data applications
Pandas & NumPy - Data manipulation and analysis
Matplotlib & Seaborn - Data visualization
Scikit-learn - Machine learning tools
Docker - Containerization for consistent environments

Requirements

Python 3.8+
OpenAI API key
Dependencies listed in requirements.txt (automatically installed by the setup script)

The setup script will automatically install UV (a fast Python package installer) and use it to install all dependencies.

Example Analyses

The Data Science Agent can perform various types of analyses, including:

Exploratory data analysis with automatic visualization
Feature correlation and importance analysis
Pattern and trend identification
Statistical hypothesis testing
Basic predictive modeling

Recent Updates

Simplified Directory Structure: Consolidated scripts by moving them from platform-specific directories (unix/ and win/) to the project root for easier access
Docker Support: Added Docker configuration for easy deployment and consistent environments
Iteration Configuration: Added ability to configure maximum iterations for analysis
Focus on Data Analysis: Updated AI prompts to focus on data analysis results without code explanations
Container Management: Added automatic container cleanup to prevent conflicts
Performance Improvements: Removed multiprocessing suggestions to avoid execution errors
Project Cleanup: Removed unnecessary files and updated .gitignore
Original Jupyter Notebook: Included the original DataSci_Agent.ipynb notebook for reference

Disclaimer

IMPORTANT: Please read this disclaimer carefully before using the Data Science Agent.

This application is capable of generating and executing code for data analysis, including resource-intensive operations such as machine learning, linear regression, and other computational tasks. When prompted to perform such operations on large datasets, the application will attempt to execute them using the available system resources.

Potential Risks:

System Performance: Executing resource-intensive operations on large datasets may cause your system to become unresponsive or slow.
Hardware Stress: Prolonged execution of computationally intensive tasks may cause excessive CPU/GPU usage, leading to overheating or increased wear on hardware components.
Memory Usage: Large datasets may consume significant amounts of RAM, potentially causing system instability.
Storage Impact: Generated files and intermediate data may consume substantial disk space.

Liability Disclaimer:

The creator and contributors of this software are not liable for any damage, data loss, or hardware issues that may result from using this application. By using this software, you acknowledge that:

You understand the potential risks associated with executing AI-generated code on your system.
You accept full responsibility for monitoring system resources during execution.
You will use appropriate caution when processing large datasets or requesting resource-intensive analyses.
You will implement proper safeguards (such as setting resource limits or using dedicated environments) when necessary.

It is recommended to test the application with small datasets first and gradually increase the size as you become familiar with its performance characteristics on your specific hardware.

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Science Agent

Features

Project Structure

Quick Setup (Recommended)

On macOS/Linux:

On Windows:

Manual Setup with setup.py

Manual Setup (Alternative)

Docker Setup

Quick Start with Docker

Manual Docker Setup

Usage

Advanced Configuration

How It Works

Command-Line Interface

Using the CLI Scripts (Recommended)

Manual CLI Usage

Technologies Used

Requirements

Example Analyses

Recent Updates

Disclaimer

Potential Risks:

Liability Disclaimer:

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ds_project		ds_project
.gitignore		.gitignore
DataSci_Agent.ipynb		DataSci_Agent.ipynb
README.md		README.md
install.bat		install.bat
install.sh		install.sh
run.bat		run.bat
run.sh		run.sh
setup_modified.py		setup_modified.py

Ramnet-Lab/data-science-agent

Folders and files

Latest commit

History

Repository files navigation

Data Science Agent

Features

Project Structure

Quick Setup (Recommended)

On macOS/Linux:

On Windows:

Manual Setup with setup.py

Manual Setup (Alternative)

Docker Setup

Quick Start with Docker

Manual Docker Setup

Usage

Advanced Configuration

How It Works

Command-Line Interface

Using the CLI Scripts (Recommended)

Manual CLI Usage

Technologies Used

Requirements

Example Analyses

Recent Updates

Disclaimer

Potential Risks:

Liability Disclaimer:

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages