diff --git a/.gitignore b/.gitignore index 9fb0f31..bba95ec 100644 --- a/.gitignore +++ b/.gitignore @@ -41,4 +41,5 @@ data/ *.patch *.diff /docs/build -/site \ No newline at end of file +/site +.quarto \ No newline at end of file diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..9d3cb51 --- /dev/null +++ b/docs/README.md @@ -0,0 +1,48 @@ +# DS4CG Job Analytics Documentation + +This directory contains the documentation for the DS4CG Job Analytics project. + +## Overview +The documentation provides detailed information about the data pipeline, analysis scripts, reporting tools, and usage instructions for the DS4CG Job Analytics platform. It is intended for users, contributors, and administrators who want to understand or extend the analytics and reporting capabilities. + +## How to Build and View the Documentation + +The documentation is built using [MkDocs](https://www.mkdocs.org/) and [Quarto](https://quarto.org/) for interactive reports and notebooks. + +### MkDocs +- To serve the documentation locally: + ```sh + mkdocs serve + ``` + This will start a local server (usually at http://127.0.0.1:8000/) where you can browse the docs. + +- To build the static site: + ```sh + mkdocs build + ``` + The output will be in the `site/` directory. + +### Quarto +- Quarto is used for rendering interactive reports and notebooks (e.g., `.qmd` files). +- To render a Quarto report: + ```sh + quarto render path/to/report.qmd + ``` + +## Structure +- `index.md`: Main landing page for the documentation site. +- `about.md`: Project background and team information. +- `preprocess.md`: Data preprocessing details. +- `analysis/`, `visualization/`, `mvp_scripts/`: Subsections for specific topics and scripts. +- `notebooks/`: Example notebooks and interactive analysis. + +## Requirements +- Python 3.10+ +- MkDocs (`pip install mkdocs`) +- Quarto (see https://quarto.org/docs/get-started/ for installation) + +## Contributing +Contributions to the documentation are welcome! Edit or add Markdown files in this directory and submit a pull request. + +--- +For more details, see the main project README or contact the maintainers. diff --git a/docs/analysis/frequency_analysis.md b/docs/analysis/frequency_analysis.md new file mode 100644 index 0000000..d817020 --- /dev/null +++ b/docs/analysis/frequency_analysis.md @@ -0,0 +1,6 @@ +--- +title: Frequency Analysis +--- + + + diff --git a/docs/contact.md b/docs/contact.md new file mode 100644 index 0000000..e1d297a --- /dev/null +++ b/docs/contact.md @@ -0,0 +1,106 @@ +# Contact and Support + +If you encounter issues or need help using the DS4CG Unity Job Analytics project, here are the best ways to get support. + +## GitHub Issues + +For technical problems, bug reports, or feature requests, please create a GitHub issue: + +**🐛 Bug Reports** +- Provide a clear description of the problem +- Include steps to reproduce the issue +- Share error messages and stack traces +- Mention your environment (Python version, OS, etc.) + +**💡 Feature Requests** +- Describe the desired functionality +- Explain the use case and benefits +- Suggest possible implementation approaches + +**📚 Documentation Issues** +- Point out unclear or missing documentation +- Suggest improvements or additions +- Request examples for specific use cases + +[**Create a GitHub Issue →**](https://github.com/your-org/ds4cg-job-analytics/issues) + +## Response Time + +The development team will review and respond to GitHub issues periodically. Please allow: +- **Critical bugs**: 1-2 business days +- **General issues**: 3-5 business days +- **Feature requests**: 1-2 weeks +- **Documentation updates**: 1 week + +## Community Guidelines + +When seeking help, please: + +✅ **Do:** + +- Search existing issues first +- Provide minimal reproducible examples +- Use clear, descriptive titles +- Be respectful and patient +- Share relevant context and details + +❌ **Don't:** + +- Post duplicate issues +- Share sensitive data or credentials +- Expect immediate responses +- Use issues for general questions about Slurm or Unity + +## Unity Slack + +For urgent questions related to Unity cluster operations or data access, you can reach out via the Unity Slack workspace. However, for project-specific issues, GitHub issues are preferred. + +## Contributing + +Interested in contributing to the project? We welcome: + +- **Code contributions**: Bug fixes, new features, optimizations +- **Documentation**: Improvements, examples, tutorials +- **Testing**: Additional test cases, bug reports +- **Feedback**: User experience insights, suggestions + +See our contributing guidelines in the repository for detailed information about: + +- Development setup +- Code style requirements +- Pull request process +- Testing procedures + +## Academic Collaboration + +This project is part of the Data Science for the Common Good (DS4CG) program. For academic collaborations or research partnerships, consider reaching out through: + +- **DS4CG Program**: [DS4CG Website](https://ds.cs.umass.edu/programs/ds4cg) +- **Unity HPC Team**: For cluster-related inquiries + +## Project Maintainers + +- **Project Lead**: Christopher Odoom +- **Contributors**: DS4CG Summer 2025 Internship Team + +## Additional Resources + +Before reaching out for support, please check: + +1. **[FAQ](faq.md)** - Common questions and solutions +2. **[Getting Started](getting-started.md)** - Setup and basic usage +3. **[Demo](demo.md)** - Working examples and code samples +4. **Jupyter Notebooks** - Interactive examples in `notebooks/` directory +5. **API Documentation** - Detailed function/class documentation + +## Reporting Security Issues + +If you discover a security vulnerability, please **do not** create a public GitHub issue. Instead: + +1. Contact the project maintainers directly +2. Provide a detailed description of the vulnerability +3. Allow time for the issue to be addressed before public disclosure + +--- + +**Remember**: The team volunteers their time to maintain this project. Clear, detailed, and respectful communication helps everyone get the help they need more efficiently. Thank you for using the DS4CG Unity Job Analytics project! diff --git a/docs/data-and-metrics.md b/docs/data-and-metrics.md new file mode 100644 index 0000000..6836ef2 --- /dev/null +++ b/docs/data-and-metrics.md @@ -0,0 +1,164 @@ +# Data and Efficiency Metrics + +This page provides comprehensive documentation about the data structure and efficiency metrics available in the DS4CG Unity Job Analytics project. + +## Data Structure + +The project works with job data from the Unity cluster's Slurm scheduler. After preprocessing, the data contains the following key attributes: + +### Job Identification +- **JobID** – Unique identifier for each job. +- **ArrayID** – Array job identifier (`-1` for non-array jobs). +- **User** – Username of the job submitter. +- **Account** – Account/group associated with the job. + +### Time Attributes +- **StartTime** – When the job started execution (datetime). +- **SubmitTime** – When the job was submitted (datetime). +- **Elapsed** – Total runtime duration (timedelta). +- **TimeLimit** – Maximum allowed runtime (timedelta). + +### Resource Allocation +- **GPUs** – Number of GPUs allocated. +- **GPUType** – Type of GPU allocated (e.g., `"v100"`, `"a100"`, or `NA` for CPU-only jobs). +- **Nodes** – Number of nodes allocated. +- **CPUs** – Number of CPU cores allocated. +- **ReqMem** – Requested memory. + +### Job Status +- **Status** – Final job status (`"COMPLETED"`, `"FAILED"`, `"CANCELLED"`, etc.). +- **ExitCode** – Job exit code. +- **QOS** – Quality of Service level. +- **Partition** – Cluster partition used. + +### Resource Usage +- **CPUTime** – Total CPU time used. +- **CPUTimeRAW** – Raw CPU time measurement. + +### Constraints and Configuration +- **Constraints** – Hardware constraints specified. +- **Interactive** – Whether the job was interactive (`"interactive"` or `"non-interactive"`). + +--- + +## Efficiency and Resource Metrics + +### GPU and VRAM Metrics + +- **GPU Count** (`gpu_count`) + Number of GPUs allocated to the job. + +- **Job Hours** (`job_hours`) + $$ + \text{job\_hours} = \frac{\text{Elapsed (seconds)}}{3600} \times \text{gpu\_count} + $$ + +- **VRAM Constraint** (`vram_constraint`) + VRAM requested via constraints, in GiB. Defaults are applied if not explicitly requested. + +- **Partition Constraint** (`partition_constraint`) + VRAM derived from selecting a GPU partition, in GiB. + +- **Requested VRAM** (`requested_vram`) + $$ + \text{requested\_vram} = + \begin{cases} + \text{partition\_constraint}, & \text{if available} \\ + \text{vram\_constraint}, & \text{otherwise} + \end{cases} + $$ + +- **Used VRAM** (`used_vram_gib`) + Sum of peak VRAM used on all allocated GPUs (GiB). + +- **Approximate Allocated VRAM** (`allocated_vram`) + Estimated VRAM based on GPU model(s) and job node allocation. + +- **Total VRAM-Hours** (`vram_hours`) + $$ + \text{vram\_hours} = \text{allocated\_vram} \times \text{job\_hours} + $$ + +- **Allocated VRAM Efficiency** (`alloc_vram_efficiency`) + $$ + \text{alloc\_vram\_efficiency} = \frac{\text{used\_vram\_gib}}{\text{allocated\_vram}} + $$ + +- **VRAM Constraint Efficiency** (`vram_constraint_efficiency`) + $$ + \text{vram\_constraint\_efficiency} = + \frac{\text{used\_vram\_gib}}{\text{vram\_constraint}} + $$ + +- **Allocated VRAM Efficiency Score** (`alloc_vram_efficiency_score`) + $$ + \text{alloc\_vram\_efficiency\_score} = + \ln(\text{alloc\_vram\_efficiency}) \times \text{vram\_hours} + $$ + Penalizes long jobs with low VRAM efficiency. + +- **VRAM Constraint Efficiency Score** (`vram_constraint_efficiency_score`) + $$ + \text{vram\_constraint\_efficiency\_score} = + \ln(\text{vram\_constraint\_efficiency}) \times \text{vram\_hours} + $$ + +### CPU Memory Metrics +- **Used CPU Memory** (`used_cpu_mem_gib`) – Peak CPU RAM usage in GiB. +- **Allocated CPU Memory** (`allocated_cpu_mem_gib`) – Requested CPU RAM in GiB. +- **CPU Memory Efficiency** (`cpu_mem_efficiency`) + $$ + \text{cpu\_mem\_efficiency} = \frac{\text{used\_cpu\_mem\_gib}}{\text{allocated\_cpu\_mem\_gib}} + $$ + +--- + +## User-Level Metrics + +- **Job Count** (`job_count`) – Number of jobs submitted by the user. +- **Total Job Hours** (`user_job_hours`) – Sum of job hours for all jobs of the user. +- **Average Allocated VRAM Efficiency Score** (`avg_alloc_vram_efficiency_score`). +- **Average VRAM Constraint Efficiency Score** (`avg_vram_constraint_efficiency_score`). + +- **Weighted Average Allocated VRAM Efficiency** + $$ + \text{expected\_value\_alloc\_vram\_efficiency} = + \frac{\sum (\text{alloc\_vram\_efficiency} \times \text{vram\_hours})} + {\sum \text{vram\_hours}} + $$ + +- **Weighted Average VRAM Constraint Efficiency** + $$ + \text{expected\_value\_vram\_constraint\_efficiency} = + \frac{\sum (\text{vram\_constraint\_efficiency} \times \text{vram\_hours})} + {\sum \text{vram\_hours}} + $$ + +- **Weighted Average GPU Count** + $$ + \text{expected\_value\_gpu\_count} = + \frac{\sum (\text{gpu\_count} \times \text{vram\_hours})} + {\sum \text{vram\_hours}} + $$ + +- **Total VRAM-Hours** – Sum of allocated_vram × job_hours across all jobs of the user. + +--- + +## Group-Level Metrics + +For a group of users (e.g., PI group): + +- **Job Count** – Total number of jobs across the group. +- **PI Group Job Hours** (`pi_acc_job_hours`). +- **PI Group VRAM Hours** (`pi_ac_vram_hours`). +- **User Count**. +- Group averages and weighted averages of efficiency metrics (similar formulas as above). + +--- + +## Efficiency Categories +- **High**: > 70% +- **Medium**: 30–70% +- **Low**: 10–30% +- **Very Low**: < 10% diff --git a/docs/demo.md b/docs/demo.md new file mode 100644 index 0000000..f5d9ffb --- /dev/null +++ b/docs/demo.md @@ -0,0 +1,134 @@ +# Demo + +This page showcases the DS4CG Unity Job Analytics project in action with interactive examples and demonstrations. + +## Complete Workflow Notebooks + +Explore our comprehensive Jupyter notebooks that demonstrate the full capabilities: + +### 📊 [Frequency Analysis Demo](notebooks/Frequency Analysis/) +**Complete end-to-end workflow** showing: + +- Database connection and preprocessing +- Efficiency analysis setup and filtering +- Time series data preparation +- Interactive visualizations +- Best/worst user identification + +### 📈 [Basic Visualization](notebooks/Basic%20Visualization/) +**Column statistics and exploratory analysis** including: + +- Data loading and preprocessing +- Column-level statistical visualizations +- Distribution analysis +- Data quality assessment + +### 🔍 [Efficiency Analysis](notebooks/Efficiency%20Analysis/) +**Advanced efficiency analysis techniques** covering: + +- Job filtering and metrics calculation +- User and PI group analysis +- Inefficiency identification +- Performance comparison workflows + +### 🎯 [Clustering Analysis](notebooks/clustering_analysis/) +**User behavior clustering and pattern analysis** + +### 📊 [Frequency Analysis](notebooks/Frequency%20Analysis/) +**Time series frequency analysis and patterns** + +--- + +## Quick Start Examples + +For quick reference, here are the key workflow patterns: + +### Database → Preprocessing → Analysis +```python +# See complete implementation in: VRAM Efficiency Analysis Demo notebook +db = DatabaseConnection("../slurm_data_new.db") +gpu_df = db.fetch_query("SELECT * FROM Jobs WHERE GPUs > 0") +processed_df = preprocess_data(gpu_df, min_elapsed_seconds=0) +``` + +### Efficiency Analysis Workflow +```python +# See complete implementation in: Efficiency Analysis notebook +efficiency_analyzer = EfficiencyAnalysis(jobs_df=processed_df) +filtered_jobs = efficiency_analyzer.filter_jobs_for_analysis(...) +job_metrics = efficiency_analyzer.calculate_job_efficiency_metrics(filtered_jobs) +``` + +### Interactive Visualizations +```python +# See complete implementation in: VRAM Efficiency Analysis Demo notebook +time_series_visualizer = TimeSeriesVisualizer(time_series_data) +fig = time_series_visualizer.plot_vram_efficiency_interactive(users=users_to_analyze) +``` + +--- + +## Notebook Features + +### VRAM Efficiency Analysis Demo Features + +- Complete efficiency analysis setup +- Time series data preparation and visualization +- Interactive plot generation +- Best/worst user identification +- Custom date range analysis + +### Basic Visualization Features + +- Database connection and data loading +- Column statistics generation +- Individual column visualizations + +### Efficiency Analysis Features + +- Job filtering and metrics calculation +- User efficiency analysis +- PI group analysis + +--- + +## Performance Tips from Notebooks + +Based on our notebook implementations: + +### For Large Datasets +```python +# From VRAM Efficiency Analysis Demo notebook +filtered_jobs = efficiency_analyzer.filter_jobs_for_analysis( + gpu_count_filter=1, + allocated_vram_filter={"min": 0, "max": np.inf, "inclusive": False}, + gpu_mem_usage_filter={"min": 0, "max": np.inf, "inclusive": False} +) +``` + +### For Interactive Plots +```python +# From VRAM Efficiency Analysis Demo notebook +fig = time_series_visualizer.plot_vram_efficiency_per_job_dot_interactive( + users=["user1", "user2"], + efficiency_metric="alloc_vram_efficiency", + max_points=500, # Limit points for performance + exclude_fields=["Exit Code"] +) +``` + +--- + +## Running the Notebooks + +The notebooks are now integrated directly into this documentation! You can: + +1. **View in Documentation**: Click on any notebook link above to view it rendered in the documentation +2. **Download and Run Locally**: + ```bash + cd notebooks/ + jupyter lab + ``` +3. **Interactive Execution**: The notebooks contain complete, tested implementations with real data and interactive outputs + +The integrated notebooks provide full access to working examples while keeping everything in one place! diff --git a/docs/faq.md b/docs/faq.md new file mode 100644 index 0000000..4a78315 --- /dev/null +++ b/docs/faq.md @@ -0,0 +1,231 @@ +# Frequently Asked Questions (FAQ) + +This page addresses common questions and technical issues encountered when using the DS4CG Unity Job Analytics project. + +## Installation and Setup + +### Q: When I try to install the requirements, I get dependency conflicts. How do I resolve this? + +**A:** This usually happens due to conflicting package versions. Try these steps: + +1. Create a fresh virtual environment: +```bash +python -m venv fresh_env +source fresh_env/bin/activate # Linux/Mac +# or +fresh_env\Scripts\activate # Windows +``` + +2. Update pip and install requirements: +```bash +pip install --upgrade pip +pip install -r requirements.txt +pip install -r dev-requirements.txt +``` + +3. If conflicts persist, try installing packages individually: +```bash +pip install pandas plotly matplotlib duckdb pydantic +``` + +### Q: I'm getting a "Python version not supported" error. What Python version should I use? + +**A:** The project requires Python 3.10 or higher. Check your Python version with: +```bash +python --version +``` + +If you have an older version, install Python 3.10+ from [python.org](https://python.org) or use a version manager like pyenv. + +## Performance and Memory Issues + +### Q: When I run my code on my computer, it crashes or runs very slowly. + +**A:** This happens because the job data can be quite large and memory-intensive. Consider these solutions: + +1. **Run on Unity cluster**: The data is designed to be processed on Unity where more computational resources are available. + +2. **Use data sampling**: Limit the dataset size for testing: +```python +# Sample 10% of the data for testing +sample_df = full_df.sample(frac=0.1, random_state=42) +``` + +3. **Limit visualization points**: +```python +# Limit interactive plots to avoid memory issues +fig = visualizer.plot_vram_efficiency_per_job_dot_interactive( + users=users, + efficiency_metric="alloc_vram_efficiency", + max_points=500 # Reduce from default 1000 +) +``` + +4. **Process in chunks**: +```python +chunk_size = 5000 +for chunk in pd.read_sql(query, connection, chunksize=chunk_size): + process_chunk(chunk) +``` + +### Q: The interactive plots are not loading or are very slow. What can I do? + +**A:** Interactive Plotly visualizations can be resource-intensive. Try these optimizations: + +1. **Reduce data points**: Use the `max_points` parameter +2. **Exclude unnecessary fields**: Use `exclude_fields` to reduce hover text complexity +3. **Use static plots for large datasets**: Switch to matplotlib versions for better performance +4. **Filter users**: Analyze fewer users at once + +## Database and Data Issues + +### Q: I'm getting a "database not found" error. Where should the database file be located? + +**A:** The Slurm database files are typically located on the Unity cluster. Common locations: +- `slurm_data.db` +- `slurm_data_new.db` +- Check the `data/` directory in your project folder + +If working locally, ensure you've copied the database file from Unity. + +### Q: My analysis shows no data or empty results. What's wrong? + +**A:** This usually happens due to filtering issues. Check these common causes: + +1. **User filtering**: Ensure the users you're analyzing actually exist in the dataset: +```python +print("Available users:", df["User"].unique()) +``` + +2. **Date range**: Check if your data covers the expected time period: +```python +print("Date range:", df["StartTime"].min(), "to", df["StartTime"].max()) +``` + +3. **Preprocessing filters**: The preprocessing might be removing your data: +```python +# Check data before and after preprocessing +print("Before preprocessing:", len(raw_df)) +processed_df = preprocess_jobs(raw_df, min_elapsed_seconds=60) +print("After preprocessing:", len(processed_df)) +``` + +### Q: I'm getting "KeyError" when trying to access certain columns. What's happening? + +**A:** This usually means the column doesn't exist in your dataset. Common issues: + +1. **Case sensitivity**: Column names are case-sensitive (`"User"` vs `"user"`) +2. **Column not in dataset**: Check available columns: +```python +print("Available columns:", df.columns.tolist()) +``` +3. **Preprocessing changes**: Some columns might be renamed or removed during preprocessing + +## Visualization Issues + +### Q: The plots are not displaying or showing empty charts. + +**A:** Several possible causes: + +1. **Empty filtered data**: Check if your user/time filters are too restrictive +2. **Zero values**: Try setting `remove_zero_values=False` +3. **Jupyter notebook issues**: Ensure you have the right backend: +```python +%matplotlib inline +import plotly.io as pio +pio.renderers.default = "notebook" +``` + +### Q: The legend in my plots is cut off or overlapping. + +**A:** Adjust the figure layout: +```python +# For matplotlib plots +plt.tight_layout() +plt.subplots_adjust(right=0.8) # Make room for legend + +# For Plotly plots +fig.update_layout( + width=1200, # Increase width + margin=dict(r=200) # Add right margin for legend +) +``` + +## Analysis and Metrics + +### Q: What's the difference between "alloc_vram_efficiency" and "avail_vram_efficiency"? + +**A:** +- **alloc_vram_efficiency**: Measures efficiency against allocated memory per GPU +- **avail_vram_efficiency**: Measures efficiency against total available memory per GPU + +Use allocated efficiency for analyzing how well users utilize their requested resources, and available efficiency for understanding overall cluster utilization. + +### Q: Why are some efficiency values over 100%? + +**A:** This can happen when: +1. Memory usage (`MaxRSS`) exceeds the baseline calculation +2. Shared memory or system overhead affects measurements +3. Multiple processes share GPU memory + +Values slightly over 100% are normal; significantly higher values may indicate measurement issues. + +### Q: How do I interpret the efficiency categories (Excellent, Good, Fair, etc.)? + +**A:** The categories are defined as: +- **Excellent**: >80% - Very efficient resource usage +- **Good**: 60-80% - Acceptable efficiency +- **Fair**: 40-60% - Room for improvement +- **Poor**: 20-40% - Significant waste of resources +- **Very Poor**: <20% - Major inefficiency + +## Development and Contributing + +### Q: How do I run the tests? + +**A:** Run the test suite using pytest: +```bash +# Run all tests +pytest + +# Run specific test file +pytest tests/test_efficiency_analysis.py + +# Run with coverage +pytest --cov=src tests/ +``` + +### Q: I want to add a new visualization. How do I structure the code? + +**A:** Follow these guidelines: + +1. Add visualization classes to `src/visualization/` +2. Inherit from `DataVisualizer` base class +3. Use Pydantic models for parameter validation +4. Add both static (matplotlib) and interactive (Plotly) versions when possible +5. Include comprehensive docstrings and type hints + +### Q: How do I contribute documentation changes? + +**A:** +1. Edit the markdown files in the `docs/` directory +2. Test locally with: `mkdocs serve` +3. Submit a pull request with your changes + +## Getting Help + +### Q: I found a bug or want to request a feature. What should I do? + +**A:** Please create a GitHub issue with: +1. Clear description of the problem/feature request +2. Steps to reproduce (for bugs) +3. Expected vs actual behavior +4. Your environment details (Python version, OS, etc.) + +### Q: The documentation doesn't cover my use case. Where can I get help? + +**A:** +1. Check the [Demo](demo.md) page for examples +2. Look at the Jupyter notebooks in the `notebooks/` directory +3. Create a GitHub issue for documentation improvements +4. Reach out via Unity Slack for urgent questions diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..80e465e --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,253 @@ +# Getting Started + +This guide will help you set up and start using the DS4CG Unity Job Analytics project. + +## Getting the Libraries + +To get started with the project, clone the repository. Since the data is stored on the Unity cluster, we recommend cloning directly on Unity for best performance: + +```bash +git clone https://github.com/Unity-HPC/ds4cg-job-analytics.git +cd ds4cg-job-analytics +``` + +## Dependencies + +This project is compatible with **Python 3.10+**. We recommend first installing Python and then setting up a virtual environment for the project. + +### Setting Up Virtual Environment + +To set up a virtual environment, run the following commands: + +```bash +# Create virtual environment +python -m venv duckdb + +# Activate virtual environment +# On Linux/Mac: +source duckdb/bin/activate +# On Windows: +duckdb\Scripts\activate + +# Install required libraries +pip install -r requirements.txt +pip install -r dev-requirements.txt +``` + +### Required Libraries + +The main dependencies include: + +- pandas for data manipulation +- plotly and matplotlib for visualization +- duckdb for database operations +- pydantic for data validation +- mkdocs for documentation + +## Data Retrieval and Preprocessing + +The project provides streamlined functions to connect to the database and preprocess data: + +### Database Connection + +```python +from src.database.database_connection import DatabaseConnection + +# Connect to the Slurm database +db = DatabaseConnection("path/to/slurm_data.db") + +# Query GPU jobs +gpu_df = db.fetch_query("SELECT * FROM Jobs WHERE GPUs > 0") +``` + +### Data Preprocessing + +The preprocessing pipeline handles data cleaning, type conversion, and filtering: + +```python +from src.preprocess.preprocess import preprocess_data + +# Preprocess raw job data +processed_df = preprocess_data( + gpu_df, + min_elapsed_seconds=600, + include_failed_cancelled_jobs=False, + include_cpu_only_jobs=True +) +``` + +For detailed preprocessing criteria, see the [Data and Efficiency Metrics](data-and-metrics.md) section. + +## Getting Efficiency Metrics + +The analysis workflow follows a specific order as demonstrated in our notebooks. Here's the complete process: + +### Step 1: Initialize the Efficiency Analyzer + +```python +from src.analysis.efficiency_analysis import EfficiencyAnalysis + +# Initialize efficiency analyzer +efficiency_analyzer = EfficiencyAnalysis(jobs_df=processed_df) +``` + +### Step 2: Filter Jobs for Analysis + +```python +import numpy as np + +# Filter jobs based on specific criteria +filtered_jobs = efficiency_analyzer.filter_jobs_for_analysis( + gpu_count_filter=1, + vram_constraint_filter=None, + allocated_vram_filter={"min": 0, "max": np.inf, "inclusive": False}, + gpu_mem_usage_filter={"min": 0, "max": np.inf, "inclusive": False} +) +``` + +### Step 3: Calculate Metrics + +```python +# Calculate job-level efficiency metrics +job_metrics = efficiency_analyzer.calculate_job_efficiency_metrics(filtered_jobs=filtered_jobs) + +# Calculate user-level efficiency metrics +user_metrics = efficiency_analyzer.calculate_user_efficiency_metrics() + +# Find inefficient users +inefficient_users = efficiency_analyzer.find_inefficient_users_by_alloc_vram_efficiency( + alloc_vram_efficiency_filter={"min": 0, "max": 0.3, "inclusive": False}, + min_jobs=5 +) +``` + +### Step 4: Prepare Time Series Data + +```python +from src.analysis.frequency_analysis import FrequencyAnalysis + +# Initialize frequency analyzer +frequency_analyzer = FrequencyAnalysis(job_metrics) + +# Prepare time series data for visualization +time_series_data = frequency_analyzer.prepare_time_series_data( + users=inefficient_users["User"].tolist(), + time_unit="Months", + metric="alloc_vram_efficiency_score", + remove_zero_values=False +) +``` + +**📚 Complete Example**: See [Frequency Analysis Demo](../notebooks/Frequency Analysis/) for a full walkthrough. + +For detailed information about available metrics, see [Efficiency Metrics](visualization/efficiency_metrics.md). + +## Visualizing Job Analysis + +The project offers both static and interactive visualization capabilities with a specific workflow: + +### Step 1: Initialize Time Series Visualizer + +```python +from src.visualization.time_series import TimeSeriesVisualizer + +# Create time series visualizer with your time series data +visualizer = TimeSeriesVisualizer(time_series_data) +``` + +### Step 2: Static Time Series Plots + +```python +# Static VRAM efficiency plot +visualizer.plot_vram_efficiency( + users=["user1", "user2"], + annotation_style="none", + show_secondary_y=False +) + +# Static VRAM hours plot +visualizer.plot_vram_hours( + users=["user1", "user2"], + show_secondary_y=False +) +``` + +### Step 3: Interactive Visualizations + +```python +# Interactive VRAM efficiency plot +fig = visualizer.plot_vram_efficiency_interactive( + users=["user1", "user2"], + max_points=100, + job_count_trace=True +) + +# Interactive per-job dot plot +fig = visualizer.plot_vram_efficiency_per_job_dot_interactive( + users=["user1", "user2"], + efficiency_metric="alloc_vram_efficiency", + vram_metric="job_hours", + max_points=500, + exclude_fields=["Exit Code"] +) +``` + +### Step 4: Per-Job Analysis + +```python +# Initialize with job-level data for individual job analysis +job_visualizer = TimeSeriesVisualizer(job_metrics) + +# Static per-job dot plot +job_visualizer.plot_vram_efficiency_per_job_dot( + users=["user1"], + efficiency_metric="alloc_vram_efficiency", + vram_metric="job_hours" +) +``` + +### Column Statistics + +```python +from src.visualization.columns import ColumnStatsVisualizer + +# Visualize column statistics +col_visualizer = ColumnStatsVisualizer(processed_df) +col_visualizer.visualize_all_columns() +``` + +**📚 Complete Examples**: + +- [Basic Visualization](../notebooks/Basic%20Visualization/) - Column statistics and basic plots +- [Efficiency Analysis](../notebooks/Efficiency%20Analysis/) - Advanced efficiency analysis workflows + +For more visualization options, see [Visualization](visualization/visualization.md). + +## Typical Analysis Workflow Order + +Based on our notebooks, here's the recommended order for conducting analysis: + +1. **Data Setup** → Load database → Preprocess data +2. **Initialize Analyzers** → EfficiencyAnalysis → FrequencyAnalysis +3. **Filter & Calculate** → Filter jobs → Calculate metrics +4. **Identify Users** → Find inefficient/efficient users +5. **Prepare Visualizations** → Time series data → Initialize visualizers +6. **Generate Plots** → Static plots → Interactive plots → Per-job analysis + +## Optional Scripts (MVP Scripts) + +The project includes several standalone scripts for quick analysis: + +- **CPU Metrics**: Analyze CPU usage patterns +- **GPU Metrics**: Analyze GPU utilization and efficiency +- **Zero GPU Usage**: Identify jobs with zero GPU usage + +See [MVP Scripts](mvp_scripts/cpu_metrics.md) for detailed usage instructions. + +--- + +**Next Steps:** + +- Follow the complete workflows in our [Demo Notebooks](demo.md) +- Explore the [Data and Efficiency Metrics](data-and-metrics.md) page for detailed metric definitions +- Visit [FAQ](faq.md) if you encounter any issues \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index 65b16f7..f7e5e1c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,15 +1,37 @@ # DS4CG Unity Job Analytics -Welcome to the documentation for the DS4CG Unity Job Analytics project. +Welcome to the documentation for the Unity Job Analytics project. This documentation exists to: + - Help new users and contributors understand the purpose and structure of the project. - Provide clear instructions for setup, usage, and contribution. - Serve as a reference for the available scripts, modules, and data analysis tools. - Document best practices and project standards for maintainability and collaboration. -**Project Purpose:** +## Project Purpose + +The Unity Job Analytics project provides tools and documentation for analyzing job data from the Unity cluster. It aims to help researchers and administrators gain insights into job performance, resource utilization, and efficiency, and to support reproducible, collaborative data science workflows. + +## Project Background + +The DS4CG Unity Job Analytics project was initiated as part of the DS4CG 2025 summer internship program in collaboration with the Unity HPC cluster at UMass. The goal is to provide robust tools and documentation for analyzing job data, improving resource utilization, and supporting research and operations on the Unity cluster. + +## Team & Contributors + +- Project Lead: Christopher Odoom +- Contributors: DS4CG Summer 2025 Internship Team + +## Acknowledgments +This project is supported by the Unity HPC team at UMass and the Data Science for the Common Good (DS4CG) program. Special thanks to all contributors and users who help improve the project. + +## Further Information + +- [Unity Documentation](https://docs.unity.rc.umass.edu/) +- [DS4CG Program](https://ds.cs.umass.edu/programs/ds4cg) + +For questions or support, please reach out via the Unity Slack or contact the project lead. -The DS4CG Unity Job Analytics project provides tools and documentation for analyzing job data from the Unity cluster. It aims to help researchers and administrators gain insights into job performance, resource utilization, and efficiency, and to support reproducible, collaborative data science workflows. +--- -Use the navigation on the left to explore detailed guides, module documentation, and contributor resources. +Use the navigation on the left to explore detailed guides, module documentation, and contributor resources. \ No newline at end of file diff --git a/docs/notebooks/Basic Visualization.ipynb b/docs/notebooks/Basic Visualization.ipynb new file mode 100644 index 0000000..e9485c6 --- /dev/null +++ b/docs/notebooks/Basic Visualization.ipynb @@ -0,0 +1,129 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "id": "0", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "from pathlib import Path" + ] + }, + { + "cell_type": "markdown", + "id": "1", + "metadata": {}, + "source": [ + "Jupyter server should be run at the notebook directory, so the output of the following cell would be the project root:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2", + "metadata": {}, + "outputs": [], + "source": [ + "project_root = str(Path.cwd().resolve().parent)\n", + "print(f\"Project root: {project_root}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3", + "metadata": {}, + "outputs": [], + "source": [ + "if project_root not in sys.path:\n", + " sys.path.insert(0, project_root)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4", + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext autoreload\n", + "# Reload all modules imported with %aimport every time before executing the Python code typed.\n", + "%autoreload 1\n", + "%aimport src.visualization.columns, src.database.database_connection, \\\n", + " src.visualization.models, src.preprocess.preprocess" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5", + "metadata": {}, + "outputs": [], + "source": [ + "from src.visualization import ColumnVisualizer\n", + "from src.preprocess import preprocess_data\n", + "from src.database import DatabaseConnection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6", + "metadata": {}, + "outputs": [], + "source": [ + "db_connection = DatabaseConnection(\"../data/slurm_data.db\")\n", + "jobs_df = db_connection.fetch_all_jobs()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7", + "metadata": {}, + "outputs": [], + "source": [ + "clean_jobs_df = preprocess_data(jobs_df, min_elapsed_seconds=600)\n", + "clean_jobs_df" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8", + "metadata": {}, + "outputs": [], + "source": [ + "visualizer = ColumnVisualizer(clean_jobs_df.sample(10000, random_state=42))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9", + "metadata": {}, + "outputs": [], + "source": [ + "visualizer.visualize(\n", + " output_dir_path=Path(\"../data/visualizations\"),\n", + " columns=None,\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "duckdb", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/notebooks/Efficiency Analysis.ipynb b/docs/notebooks/Efficiency Analysis.ipynb new file mode 100644 index 0000000..1d07964 --- /dev/null +++ b/docs/notebooks/Efficiency Analysis.ipynb @@ -0,0 +1,614 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0", + "metadata": {}, + "source": [ + "# [Efficiency Analysis](#toc0_)\n", + "This notebook demonstrates the use of `EfficiencyAnalysis` class in `src/analysis/efficiency_analysis.py` for analyzing the efficiency of jobs, users, and PI groups." + ] + }, + { + "cell_type": "markdown", + "id": "1", + "metadata": {}, + "source": [ + "**Table of contents** \n", + "- [Efficiency Analysis](#toc1_) \n", + " - [Setup](#toc1_1_) \n", + " - [Example: Analyze workload efficiency of GPU users who set no VRAM constraints and used 0 GB of VRAM](#toc1_2_) \n", + " - [Job Efficiency Metrics](#toc1_2_1_) \n", + " - [Find most inefficient jobs with no VRAM constraints based on `vram_hours`](#toc1_2_1_1_) \n", + " - [User Efficiency Metrics](#toc1_2_2_) \n", + " - [Find Inefficient Users based on `expected_value_alloc_vram_efficiency`](#toc1_2_2_1_) \n", + " - [Find Inefficient Users based on `vram_hours`](#toc1_2_2_2_) \n", + " - [PI Group Efficiency Metrics](#toc1_2_3_) \n", + " - [Find Inefficient PIs based on `vram_hours`](#toc1_2_3_1_) \n", + " - [Example: Analyze all jobs with no VRAM constraints](#toc1_3_) \n", + " - [Job Efficiency Metrics](#toc1_3_1_) \n", + " - [Problem with duplicate JobIDs](#toc1_3_1_1_) \n", + " - [Top users with most number of jobs that have no VRAM constraints](#toc1_3_1_2_) \n", + " - [Find inefficient jobs with no VRAM Constraints based on `alloc_vram_efficiency_score`](#toc1_3_1_3_) \n", + "\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "2", + "metadata": {}, + "source": [ + "## [Setup](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3", + "metadata": {}, + "outputs": [], + "source": [ + "# Import required modules\n", + "import sys\n", + "from pathlib import Path\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "markdown", + "id": "4", + "metadata": {}, + "source": [ + "Jupyter server should be run at the notebook directory, so the output of the following cell would be the project root:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5", + "metadata": {}, + "outputs": [], + "source": [ + "project_root = str(Path.cwd().resolve().parent)\n", + "print(f\"Project root: {project_root}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6", + "metadata": {}, + "outputs": [], + "source": [ + "# Add project root to sys.path for module imports\n", + "if project_root not in sys.path:\n", + " sys.path.insert(0, project_root)\n", + "\n", + "from src.analysis import efficiency_analysis as ea\n", + "from src.visualization import JobsWithMetricsVisualizer, UsersWithMetricsVisualizer\n", + "\n", + "# Automatically reload modules before executing code\n", + "# This is useful for development to see changes without restarting the kernel.\n", + "%load_ext autoreload\n", + "# Reload all modules imported with %aimport every time before executing the Python code typed.\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7", + "metadata": {}, + "outputs": [], + "source": [ + "# Load the jobs DataFrame from DuckDB\n", + "preprocessed_jobs_df = ea.load_preprocessed_jobs_dataframe_from_duckdb(\n", + " db_path=\"../data/slurm_data.db\",\n", + " table_name=\"Jobs\",\n", + ")\n", + "display(preprocessed_jobs_df.head(10))\n", + "print(preprocessed_jobs_df.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "8", + "metadata": {}, + "source": [ + "## [Example: Analyze workload efficiency of GPU users who set no VRAM constraints and used 0 GB of VRAM](#toc0_)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9", + "metadata": {}, + "outputs": [], + "source": [ + "efficiency_analysis = ea.EfficiencyAnalysis(jobs_df=preprocessed_jobs_df)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "10", + "metadata": {}, + "outputs": [], + "source": [ + "filtered_jobs = efficiency_analysis.filter_jobs_for_analysis(\n", + " vram_constraint_filter=pd.NA, # No VRAM constraints\n", + " gpu_mem_usage_filter=0, # Used 0 GB of VRAM\n", + ")\n", + "filtered_jobs" + ] + }, + { + "cell_type": "markdown", + "id": "11", + "metadata": {}, + "source": [ + "Generate all metrics:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12", + "metadata": {}, + "outputs": [], + "source": [ + "metrics_dict = efficiency_analysis.calculate_all_efficiency_metrics(filtered_jobs)\n", + "\n", + "jobs_with_metrics = metrics_dict[\"jobs_with_efficiency_metrics\"]\n", + "users_with_metrics = metrics_dict[\"users_with_efficiency_metrics\"]\n", + "pi_accounts_with_metrics = metrics_dict[\"pi_accounts_with_efficiency_metrics\"]" + ] + }, + { + "cell_type": "markdown", + "id": "13", + "metadata": {}, + "source": [ + "### [Job Efficiency Metrics](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "14", + "metadata": {}, + "outputs": [], + "source": [ + "# Set option to display all columns\n", + "pd.set_option(\"display.max_columns\", None)\n", + "# Display the DataFrame\n", + "display(jobs_with_metrics.head(10))\n", + "# To revert to default settings (optional)\n", + "pd.reset_option(\"display.max_columns\")\n", + "\n", + "print(f\"Jobs found: {len(jobs_with_metrics)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "15", + "metadata": {}, + "source": [ + "#### [Find most inefficient jobs with no VRAM constraints based on `vram_hours`](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16", + "metadata": {}, + "outputs": [], + "source": [ + "inefficient_jobs_vram_hours = efficiency_analysis.sort_and_filter_records_with_metrics(\n", + " metrics_df_name_enum=ea.MetricsDataFrameNameEnum.JOBS,\n", + " sorting_key=\"vram_hours\",\n", + " ascending=False, # Sort by vram_hours in descending order\n", + " filter_criteria={\n", + " \"vram_hours\": {\"min\": 80 * 24, \"inclusive\": True}, # VRAM-hours threshold for identifying inefficient jobs\n", + " },\n", + ")\n", + "# Display top inefficient users by VRAM-hours\n", + "print(\"\\nTop inefficient Jobs by VRAM-hours:\")\n", + "display(inefficient_jobs_vram_hours.head(10))\n", + "\n", + "# Plot top inefficient jobs by VRAM-hours, with VRAM-hours as labels\n", + "jobs_with_metrics_visualizer = JobsWithMetricsVisualizer(inefficient_jobs_vram_hours.head(20))\n", + "jobs_with_metrics_visualizer.visualize(\n", + " column=\"vram_hours\",\n", + " bar_label_columns=[\"vram_hours\", \"job_hours\"],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "17", + "metadata": {}, + "source": [ + "### [User Efficiency Metrics](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18", + "metadata": {}, + "outputs": [], + "source": [ + "users_with_metrics" + ] + }, + { + "cell_type": "markdown", + "id": "19", + "metadata": {}, + "source": [ + "#### [Find Inefficient Users based on `expected_value_alloc_vram_efficiency`](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20", + "metadata": {}, + "outputs": [], + "source": [ + "inefficient_users_alloc_vram_eff = efficiency_analysis.sort_and_filter_records_with_metrics(\n", + " metrics_df_name_enum=ea.MetricsDataFrameNameEnum.USERS,\n", + " sorting_key=\"expected_value_alloc_vram_efficiency\",\n", + " ascending=True, # we want to find users with low efficiency\n", + " filter_criteria={\n", + " \"expected_value_alloc_vram_efficiency\": {\"max\": 0.3, \"inclusive\": True},\n", + " \"job_count\": {\"min\": 5, \"inclusive\": True}, # Minimum number of jobs to consider a user\n", + " },\n", + ")\n", + "print(\"\\nTop inefficient users by allocated vram efficiency:\")\n", + "display(inefficient_users_alloc_vram_eff.head(20))\n", + "\n", + "# Plot top inefficient users by allocated vram efficiency, with allocated vram efficiency as labels\n", + "users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_alloc_vram_eff.head(20))\n", + "users_with_metrics_visualizer.visualize(\n", + " column=\"expected_value_alloc_vram_efficiency\",\n", + " bar_label_columns=[\"expected_value_alloc_vram_efficiency\", \"user_job_hours\"],\n", + " figsize=(8, 10),\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "21", + "metadata": {}, + "outputs": [], + "source": [ + "inefficient_users = efficiency_analysis.sort_and_filter_records_with_metrics(\n", + " metrics_df_name_enum=ea.MetricsDataFrameNameEnum.USERS,\n", + " sorting_key=\"expected_value_alloc_vram_efficiency\",\n", + " ascending=True, # we want to find users with low efficiency\n", + " filter_criteria={\n", + " \"expected_value_alloc_vram_efficiency\": {\"max\": 0.3, \"inclusive\": True},\n", + " \"job_count\": {\"min\": 5, \"inclusive\": True}, # Minimum number of jobs to consider a user\n", + " },\n", + ")\n", + "\n", + "# Display top inefficient users by job count\n", + "print(\"\\nTop inefficient users by allocated vram efficiency:\")\n", + "display(inefficient_users.head(10))\n", + "\n", + "\n", + "# Plot top inefficient users by GPU hours, with efficiency as labels\n", + "top_users = inefficient_users.head(10)\n", + "\n", + "plt.figure(figsize=(8, 5))\n", + "barplot = sns.barplot(y=top_users[\"User\"], x=top_users[\"user_job_hours\"], orient=\"h\")\n", + "plt.xlabel(\"Job Hours\")\n", + "plt.ylabel(\"User\")\n", + "plt.title(\"Top 10 Inefficient Users by Allocated VRAM Efficiency Contribution\")\n", + "\n", + "# Annotate bars with expected_value_alloc_vram_efficiency, keeping text fully inside the plot's right spine\n", + "ax = barplot\n", + "xmax = top_users[\"user_job_hours\"].max()\n", + "# Add headroom for annotation space (20% extra)\n", + "xlim = xmax * 1.20 if xmax > 0 else 1\n", + "ax.set_xlim(0, xlim)\n", + "\n", + "# Calculate annotation x-position: place at 98% of xlim or just left of the right spine, whichever is smaller\n", + "for i, (job_hours, efficiency) in enumerate(\n", + " zip(\n", + " top_users[\"user_job_hours\"],\n", + " top_users[\"expected_value_alloc_vram_efficiency\"],\n", + " strict=True,\n", + " )\n", + "):\n", + " # Place annotation at min(job_hours + 2% of xlim, 98% of xlim)\n", + " xpos = min(job_hours + xlim * 0.02, xlim * 0.98)\n", + " # If bar is very close to right spine, nudge annotation left to avoid overlap\n", + " if xpos > xlim * 0.96:\n", + " xpos = xlim * 0.96\n", + " ax.text(xpos, i, f\"Eff: {efficiency:.2f}\", va=\"center\", ha=\"left\", fontsize=10, color=\"black\", clip_on=True)\n", + "\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "22", + "metadata": {}, + "source": [ + "#### [Find Inefficient Users based on `vram_hours`](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23", + "metadata": {}, + "outputs": [], + "source": [ + "inefficient_users_vram_hours = efficiency_analysis.find_inefficient_users_by_vram_hours(\n", + " vram_hours_filter={\"min\": 200, \"inclusive\": True}, # VRAM-hours threshold for identifying inefficient users\n", + " min_jobs=5, # Minimum number of jobs to consider a user\n", + ")\n", + "# Display top inefficient users by VRAM-hours\n", + "print(\"\\nTop inefficient users by VRAM-hours:\")\n", + "display(inefficient_users_vram_hours.head(20))\n", + "\n", + "\n", + "# Plot top inefficient users by VRAM-hours, with VRAM-hours as labels\n", + "users_with_metrics_visualizer = UsersWithMetricsVisualizer(inefficient_users_vram_hours.head(20))\n", + "users_with_metrics_visualizer.visualize(\n", + " column=\"vram_hours\", bar_label_columns=[\"vram_hours\", \"user_job_hours\"], figsize=(8, 10)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "24", + "metadata": {}, + "source": [ + "### [PI Group Efficiency Metrics](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25", + "metadata": {}, + "outputs": [], + "source": [ + "pi_accounts_with_metrics" + ] + }, + { + "cell_type": "markdown", + "id": "26", + "metadata": {}, + "source": [ + "#### [Find Inefficient PIs based on `vram_hours`](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27", + "metadata": {}, + "outputs": [], + "source": [ + "inefficient_pis_vram_hours = efficiency_analysis.sort_and_filter_records_with_metrics(\n", + " metrics_df_name_enum=ea.MetricsDataFrameNameEnum.PI_GROUPS,\n", + " sorting_key=\"pi_acc_vram_hours\",\n", + " ascending=False,\n", + " filter_criteria={\n", + " \"pi_acc_vram_hours\": {\"min\": 200, \"inclusive\": True}, # VRAM-hours threshold for identifying inefficient users\n", + " \"job_count\": {\"min\": 5, \"inclusive\": True}, # Minimum number of jobs to consider a PI account\n", + " },\n", + ")\n", + "# Display top inefficient users by VRAM-hours\n", + "print(\"\\nTop inefficient PI Groups by VRAM-hours:\")\n", + "display(inefficient_pis_vram_hours.head(20))\n", + "\n", + "top_pi_accounts = inefficient_pis_vram_hours.head(20)\n", + "\n", + "# Plot top inefficient users by VRAM-hours, with VRAM-hours as labels\n", + "plt.figure(figsize=(8, 8))\n", + "barplot = sns.barplot(\n", + " y=top_pi_accounts[\"pi_account\"],\n", + " x=top_pi_accounts[\"pi_acc_vram_hours\"],\n", + " order=top_pi_accounts[\"pi_account\"].tolist(), # Only show present values\n", + " orient=\"h\",\n", + ")\n", + "plt.xlabel(\"VRAM-Hours\")\n", + "plt.ylabel(\"PI Account\")\n", + "plt.title(\"Top Inefficient PI Accounts by VRAM-Hours\")\n", + "# Annotate bars with gpu_hours, keeping text fully inside the plot's right spine\n", + "ax = barplot\n", + "xmax = top_pi_accounts[\"pi_acc_vram_hours\"].max()\n", + "# Add headroom for annotation space (20% extra)\n", + "xlim = xmax * 1.6 if xmax > 0 else 1\n", + "ax.set_xlim(0, xlim)\n", + "# Calculate annotation x-position: place at 98% of xlim or just left of the right spine, whichever is smaller\n", + "for i, (vram_hours, pi_acc_job_hours) in enumerate(\n", + " zip(\n", + " top_pi_accounts[\"pi_acc_vram_hours\"],\n", + " top_pi_accounts[\"pi_acc_job_hours\"],\n", + " strict=True,\n", + " )\n", + "):\n", + " # Place annotation at min(vram_hours + 2% of xlim, 98% of xlim)\n", + " xpos = min(vram_hours + xlim * 0.02, xlim * 0.98)\n", + " ax.text(\n", + " xpos,\n", + " i,\n", + " f\"VRAM-Hours: {vram_hours:.2f}\\n Job Hours: {pi_acc_job_hours:.2f}\",\n", + " va=\"center\",\n", + " ha=\"left\",\n", + " fontsize=10,\n", + " color=\"black\",\n", + " clip_on=True,\n", + " )\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "28", + "metadata": {}, + "source": [ + "## [Example: Analyze all jobs with no VRAM constraints](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29", + "metadata": {}, + "outputs": [], + "source": [ + "# Filter jobs where no VRAM constraint was set but a GPU was allocated\n", + "no_vram_constraint_efficiency_analysis = ea.EfficiencyAnalysis(jobs_df=preprocessed_jobs_df)\n", + "all_no_vram_constraint_jobs = no_vram_constraint_efficiency_analysis.filter_jobs_for_analysis(\n", + " vram_constraint_filter={\"min\": 0, \"inclusive\": False}, # No VRAM constraints\n", + " gpu_count_filter={\"min\": 1, \"inclusive\": True}, # At least one GPU allocated\n", + " gpu_mem_usage_filter={\"min\": 0, \"inclusive\": False}, # Used more than 0 GiB of VRAM\n", + ")\n", + "\n", + "display(all_no_vram_constraint_jobs.head(10))\n", + "print(all_no_vram_constraint_jobs.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "30", + "metadata": {}, + "source": [ + "### [Job Efficiency Metrics](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31", + "metadata": {}, + "outputs": [], + "source": [ + "no_vram_constraint_jobs_with_metrics = no_vram_constraint_efficiency_analysis.calculate_job_efficiency_metrics(\n", + " all_no_vram_constraint_jobs\n", + ")\n", + "\n", + "# Set option to display all columns\n", + "pd.set_option(\"display.max_columns\", None)\n", + "# Display the DataFrame\n", + "display(no_vram_constraint_jobs_with_metrics.head(10))\n", + "# To revert to default settings (optional)\n", + "pd.reset_option(\"display.max_columns\")\n", + "print(f\"Jobs found: {len(no_vram_constraint_jobs_with_metrics)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "32", + "metadata": {}, + "source": [ + "#### [Problem with duplicate JobIDs](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "33", + "metadata": {}, + "outputs": [], + "source": [ + "# select jobs with specific job id\n", + "pd.set_option(\"display.max_columns\", None)\n", + "# Display the DataFrame\n", + "display(no_vram_constraint_jobs_with_metrics[no_vram_constraint_jobs_with_metrics[\"JobID\"] == 24374463])\n", + "pd.reset_option(\"display.max_columns\")" + ] + }, + { + "cell_type": "markdown", + "id": "34", + "metadata": {}, + "source": [ + "#### [Top users with most number of jobs that have no VRAM constraints](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "35", + "metadata": {}, + "outputs": [], + "source": [ + "# Plot top users by number of jobs with no VRAM constraints\n", + "if not all_no_vram_constraint_jobs.empty:\n", + " plt.figure(figsize=(10, 5))\n", + " user_counts = all_no_vram_constraint_jobs[\"User\"].value_counts().head(20)\n", + " sns.barplot(x=user_counts.values, y=user_counts.index, orient=\"h\")\n", + " plt.xlabel(\"Number of Jobs\")\n", + " plt.ylabel(\"User\")\n", + " plt.title(\"Top 20 Users: Jobs with no VRAM Constraints\")\n", + " plt.tight_layout()\n", + " plt.show()\n", + "else:\n", + " print(\"No jobs found without VRAM constraints.\")" + ] + }, + { + "cell_type": "markdown", + "id": "36", + "metadata": {}, + "source": [ + "#### [Find inefficient jobs with no VRAM Constraints based on `alloc_vram_efficiency_score`](#toc0_)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37", + "metadata": {}, + "outputs": [], + "source": [ + "low_alloc_vram_score_jobs = no_vram_constraint_efficiency_analysis.sort_and_filter_records_with_metrics(\n", + " metrics_df_name_enum=ea.MetricsDataFrameNameEnum.JOBS,\n", + " sorting_key=\"alloc_vram_efficiency_score\",\n", + " ascending=True, # Sort by alloc_vram_efficiency_score in ascending order\n", + " filter_criteria={\n", + " \"alloc_vram_efficiency_score\": {\"max\": -10, \"inclusive\": True}, # score threshold\n", + " },\n", + ")\n", + "# Display top inefficient users by alloc_vram_efficiency_score\n", + "print(\"\\nTop inefficient Jobs by allocated VRAM efficiency score:\")\n", + "\n", + "display(low_alloc_vram_score_jobs.head(20))\n", + "\n", + "jobs_with_metrics_visualizer = JobsWithMetricsVisualizer(low_alloc_vram_score_jobs.head(20))\n", + "jobs_with_metrics_visualizer.visualize(\n", + " column=\"alloc_vram_efficiency_score\",\n", + " bar_label_columns=[\"alloc_vram_efficiency_score\", \"job_hours\"],\n", + " figsize=(10, 12),\n", + ")" + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/mkdocs.yml b/mkdocs.yml index a2e07ba..74ccf10 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -2,9 +2,31 @@ site_name: "Unity Job Analytics" theme: name: "material" + features: + - navigation.tabs # Enables horizontal top nav tabs + - navigation.tabs.sticky # Keeps tabs visible when scrolling + - navigation.sections # Groups subsections under tabs + - toc.integrate # Puts table of contents in main pane + palette: + # Palette toggle for light mode + - scheme: default + toggle: + icon: material/brightness-7 + name: Switch to dark mode + + # Palette toggle for dark mode + - scheme: slate + toggle: + icon: material/brightness-4 + name: Switch to light mode plugins: - search + - mkdocs-jupyter: + include_source: true + allow_errors: true + ignore_h1_titles: true + - mkdocstrings: handlers: python: @@ -21,20 +43,47 @@ plugins: # render Parameters/Returns as a table docstring_section_style: list +# Configure markdown extensions +markdown_extensions: + - attr_list + - md_in_html + - pymdownx.arithmatex: + generic: true + +extra_javascript: + # - javascripts/mathjax.js + - https://unpkg.com/mathjax@3/es5/tex-mml-chtml.js + +# Suppress warnings for external links to notebooks +validation: + omitted_files: warn + absolute_links: warn + unrecognized_links: warn + +docs_dir: docs + nav: - Home: 'index.md' - - About: 'about.md' - - Data: - - 'preprocess.md' - - Analysis: - - 'analysis/efficiency_analysis.md' - - Visualization: - - 'visualization/visualization.md' - - 'visualization/columns.md' - - 'visualization/efficiency_metrics.md' - - 'visualization/models.md' - - MVP Scripts: - - 'mvp_scripts/cpu_metrics.md' - - 'mvp_scripts/gpu_metrics.md' - - 'mvp_scripts/zero_gpu_usage.md' - + - Getting Started: 'getting-started.md' + - Data and Efficiency Metrics: 'data-and-metrics.md' + - Demo: 'demo.md' + - Notebooks: + - Efficiency Analysis: 'notebooks/Efficiency Analysis.ipynb' + - Visualization: 'notebooks/Basic Visualization.ipynb' + # - Frequency Analysis: 'notebooks/Frequency Analysis.ipynb' + - Technical Documentation: + - Data Processing: 'preprocess.md' + - Analysis: + - 'analysis/efficiency_analysis.md' + - 'analysis/frequency_analysis.md' + - Visualization: + - 'visualization/visualization.md' + - 'visualization/columns.md' + - 'visualization/efficiency_metrics.md' + - 'visualization/models.md' + - MVP Scripts: + - 'mvp_scripts/cpu_metrics.md' + - 'mvp_scripts/gpu_metrics.md' + - 'mvp_scripts/zero_gpu_usage.md' + - FAQ: 'faq.md' + - Contact: 'contact.md' \ No newline at end of file diff --git a/mvp_scripts/gpu_metrics.py b/mvp_scripts/gpu_metrics.py index 7485453..9a65204 100644 --- a/mvp_scripts/gpu_metrics.py +++ b/mvp_scripts/gpu_metrics.py @@ -27,7 +27,7 @@ vram_labels = [0] + vram_cutoffs[2:] -def get_requested_vram(constraints): +def get_requested_vram(constraints) -> int: """Get the minimum requested VRAM from job constraints. Args: diff --git a/mvp_scripts/zero_gpu_usage_list.py b/mvp_scripts/zero_gpu_usage_list.py index 424b02d..f062cda 100644 --- a/mvp_scripts/zero_gpu_usage_list.py +++ b/mvp_scripts/zero_gpu_usage_list.py @@ -23,7 +23,7 @@ HOURS = "{hours} unused GPU hours. The most recent jobs are the following:" -def get_job_type_breakdown(interactive, jobs): +def get_job_type_breakdown(interactive, jobs) -> str: """Generate a summary string describing the breakdown of interactive and batch jobs. Args: @@ -45,7 +45,7 @@ def get_job_type_breakdown(interactive, jobs): ) -def pi_report(account, days_back=60): +def pi_report(account, days_back=60) -> None: """Create an efficiency report for a given PI group, summarizing GPU usage and waste. Args: @@ -74,7 +74,7 @@ def pi_report(account, days_back=60): def main( dbfile="./modules/admin-resources/reporting/slurm_data.db", userlist="./users.csv", sendEmail=False, days_back=60 -): +) -> None: """Print out a list of users who habitually waste GPU hours, and optionally email them a report. Args: diff --git a/notebooks/.gitattributes b/notebooks/.gitattributes index 886e7e0..5a4e934 100644 --- a/notebooks/.gitattributes +++ b/notebooks/.gitattributes @@ -1,3 +1,4 @@ -*.ipynb filter=strip-notebook-output -# keep the output of the following notebooks when committing -SlurmGPU.ipynb !filter=strip-notebook-output \ No newline at end of file +# *.ipynb filter=strip-notebook-output +# # keep the output of the following notebooks when committing +# SlurmGPU.ipynb -filter=strip-notebook-output +# notebooks/SlurmGPU.ipynb -filter=strip-notebook-output \ No newline at end of file diff --git a/notebooks/SlurmGPU.ipynb b/notebooks/SlurmGPU.ipynb index 1842a93..a25eec5 100644 --- a/notebooks/SlurmGPU.ipynb +++ b/notebooks/SlurmGPU.ipynb @@ -19,7 +19,25 @@ { "cell_type": "code", "execution_count": null, - "id": "9d1e8bea-5de8-430c-be4a-5f9c268cdc45", + "id": "0", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "from pathlib import Path\n", + "\n", + "project_root = str(Path.cwd().resolve().parent)\n", + "print(f\"Project root: {project_root}\")\n", + "\n", + "if project_root not in sys.path:\n", + " sys.path.append(project_root)\n", + " print(f\"Added project root to sys.path: {project_root}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1", "metadata": {}, "outputs": [], "source": [ @@ -42,7 +60,7 @@ }, { "cell_type": "markdown", - "id": "cad34b63-0a18-4f3c-9787-7223355a42b8", + "id": "2", "metadata": {}, "source": [ "First we take a look at average and median queue wait times for jobs, based on how much GPU VRam they request." @@ -156,7 +174,7 @@ }, { "cell_type": "markdown", - "id": "bc5ce0eb-358d-4ae2-8b40-9be99da18535", + "id": "7", "metadata": {}, "source": [ "Next we examine VRAM usage levels for all jobs, jobs with no specific VRAM request, and for jobs that request the largest GPU possible (80G) of VRAM." @@ -265,7 +283,7 @@ { "cell_type": "code", "execution_count": null, - "id": "30120541-d6e4-4e3f-97ea-83d54a315f7e", + "id": "11", "metadata": {}, "outputs": [], "source": [] diff --git a/src/preprocess/preprocess.py b/src/preprocess/preprocess.py index d271179..ff0f380 100644 --- a/src/preprocess/preprocess.py +++ b/src/preprocess/preprocess.py @@ -484,4 +484,4 @@ def preprocess_data( processing_error_logs.clear() error_indices.clear() - return data + return data \ No newline at end of file