RAID (Rust Analysis and Informative Debugger) is a comprehensive system health checker that leverages AI to analyze your system for potential issues, security concerns, and performance problems.
This project is currently "vibe coded" - meaning it was built quickly with a focus on getting things working rather than production-quality polish. While it functions and can be useful, please understand that there may be more bugs than you'd typically expect from a finished project.
If people find this tool useful and there's community interest, we may invest more effort into making it more robust and production-ready in the future. For now, please use it with the understanding that it's more of a proof-of-concept/experimental tool.
We know this is hard to believe, but yes, there really might be more bugs than usual! π
- The application should NEVER attempt to read, parse, or source the
~/.chatgpt
file - API keys must be provided via environment variables or command line arguments only
- This is a security and design principle - the application should not make assumptions about user file structures
- Multi-AI Provider Support: OpenAI, Anthropic, and Local models
- Comprehensive System Analysis: Kubernetes, containers, systemd, journal, cgroups
- Historical Data Storage: SQLite database for tracking changes over time
- Modular Architecture: Easy to extend with new AI providers and system checks
- Async Operations: Fast, non-blocking AI analysis
- Flexible Configuration: Support for both environment variables and command line arguments
- Clean CLI: Built with clap derive for excellent user experience
- Smart Defaults: Pre-configured with optimal models for immediate use
Just provide your OpenAI API key and you're ready to go!
# Set your API key
export AI_API_KEY=your_openai_api_key
# Run a full system check (uses GPT-4o-mini by default)
cargo run
# Or check specific components
cargo run -- check system
cargo run -- check containers
cargo install --path .
The tool supports both environment variables and command line arguments for AI configuration. Command line arguments take precedence over environment variables.
- OpenAI:
gpt-4o-mini
(fast, cost-effective, powerful) - Anthropic:
claude-3-5-sonnet-20241022
(latest Claude model) - Local:
llama2
(for Ollama users)
# Global AI options (can be used with any subcommand)
-p, --ai-provider <PROVIDER> AI provider to use (open-ai, anthropic, local) [default: open-ai]
-k, --ai-api-key <KEY> API key for the AI provider
-m, --ai-model <MODEL> AI model to use
--ai-base-url <URL> Base URL for AI provider (for custom endpoints)
--ai-max-tokens <TOKENS> Maximum tokens for AI response
--ai-temperature <TEMP> Temperature for AI response (0.0-1.0)
--dry-run Run without AI analysis (just collect and display system info)
export AI_PROVIDER=open-ai # Optional, this is the default
export AI_API_KEY=your_openai_api_key
export AI_MODEL=gpt-4o-mini # Optional, this is the default
export AI_MAX_TOKENS=1000
export AI_TEMPERATURE=0.7
export AI_PROVIDER=anthropic
export AI_API_KEY=your_anthropic_api_key
export AI_MODEL=claude-3-5-sonnet-20241022 # Optional, this is the default
export AI_MAX_TOKENS=1000
export AI_TEMPERATURE=0.7
export AI_PROVIDER=local
export AI_MODEL=llama2 # Optional, this is the default
export AI_BASE_URL=http://localhost:11434
export AI_MAX_TOKENS=1000
export AI_TEMPERATURE=0.7
For self-hosted or custom endpoints:
export AI_BASE_URL=https://your-custom-endpoint.com
cargo run
# or
cargo run -- check all
# Using command line arguments
cargo run -- --ai-provider open-ai --ai-api-key your_key check all
# Using environment variables (recommended for production)
export AI_API_KEY=your_key
cargo run -- check all
# Mix of both (command line takes precedence)
export AI_PROVIDER=anthropic
cargo run -- --ai-provider open-ai --ai-api-key your_key check system
# Local model with custom settings
cargo run -- --ai-provider local --ai-model llama2 --ai-temperature 0.8 check containers
Perfect for testing system information collection without making expensive AI API calls:
# Check everything without AI analysis
cargo run -- --dry-run
# Check specific components without AI analysis
cargo run -- --dry-run check system
cargo run -- --dry-run check containers
cargo run -- --dry-run check kubernetes
cargo run -- --dry-run check cgroups
cargo run -- --dry-run check systemd
cargo run -- --dry-run check journal
# Combine with other options (AI options are ignored in dry-run mode)
cargo run -- --dry-run --ai-provider open-ai --ai-api-key your_key check system
Benefits of Dry Run Mode:
- β No API costs - perfect for testing and development
- β Fast execution - no network calls to AI providers
- β Full system information collection
- β Same output format (without AI analysis section)
- β Great for debugging system information collection
# Basic system information
cargo run -- check system
# Container information with OpenAI (uses GPT-4o-mini by default)
cargo run -- --ai-api-key your_key check containers
# Kubernetes environment with Anthropic
cargo run -- --ai-provider anthropic --ai-api-key your_key check kubernetes
# Cgroup configuration with local model
cargo run -- --ai-provider local --ai-model llama2 check cgroups
# Systemd services
cargo run -- check systemd
# System journal
cargo run -- check journal
# Show help
cargo run -- --help
# Show help for check subcommand
cargo run -- check --help
# Show version
cargo run -- --version
- Operating system and CPU details
- Basic hardware information
- Namespace, pod name, node name
- Service account details
- Kubernetes environment detection
- Docker containers (if available)
- Containerd containers (if available)
- Container status, images, and ports
- Cgroup version (v1/v2)
- Memory and CPU limits
- Cgroup controllers and paths
- System status
- Failed units
- Important service status (docker, containerd, kubelet, etc.)
- Recent errors and warnings
- Boot errors
- System log analysis
The tool stores all comprehensive checks in a SQLite database (system_checks.db
) for historical analysis. Only the "all" command stores data in the database.
The tool is built with a modular architecture:
- CLI Module (
src/cli.rs
): Command line interface using clap derive - AI Module (
src/ai.rs
): Abstract AI provider interface - System Info Module (
src/sysinfo.rs
): System information collection - Database Module (
src/database.rs
): Data persistence - UI Module (
src/ui.rs
): Output formatting
Implement the AIProvider
trait:
#[async_trait]
impl AIProvider for YourProvider {
async fn analyze(&self, input: &str) -> Result<String, AIError> {
// Your implementation
}
fn name(&self) -> &str {
"YourProvider"
}
}
Extend the SystemInfo
struct and add collection functions in sysinfo.rs
.
Add new variants to the Commands
enum in cli.rs
and handle them in the main function.
The tool gracefully handles:
- Missing AI API keys (falls back to DummyAI)
- Network connectivity issues
- Missing system tools (docker, systemctl, etc.)
- Database errors
# Check everything with OpenAI (GPT-4o-mini)
export AI_API_KEY=your_key
cargo run
# Or using command line arguments
cargo run -- --ai-api-key your_key
# Test system information collection without AI costs
cargo run -- --dry-run
# Test specific components
cargo run -- --dry-run check system
cargo run -- --dry-run check containers
# Only check containers with specific model
cargo run -- --ai-api-key your_key --ai-model gpt-4 check containers
# Use local Ollama model
cargo run -- --ai-provider local --ai-model llama2 check systemd
# Use custom OpenAI-compatible endpoint
cargo run -- --ai-api-key your_key --ai-base-url https://your-endpoint.com check all
- Check your API keys are set correctly
- Verify network connectivity
- For local models, ensure Ollama is running
- Use
--ai-provider local
for testing without API keys
- Some checks require root access (systemd, journal)
- Run with
sudo
if needed for full system access
- Install Docker for container checks
- Install systemd for service checks
- Install journalctl for log analysis
- Fork the repository
- Create a feature branch
- Add your changes
- Add tests if applicable
- Submit a pull request
MIT License - see LICENSE file for details.
RAID now supports an advanced AI Agent Mode that allows the AI to iteratively call diagnostic tools and make multiple rounds of analysis to solve complex problems.
- Iterative Problem Solving: The AI can call multiple diagnostic tools in sequence
- Configurable Tool Limits: Set maximum number of tool calls (default: 50)
- Pause and Continue: AI can pause to ask for user clarification
- Interactive Sessions: Continue analysis after hitting limits
- Comprehensive Logging: Track all tool calls and decisions
# Enable AI agent mode for complex problem solving
cargo run -- "my pod is stuck in crash loop backoff" --ai-agent-mode
# With custom tool call limit
cargo run -- "system is running slow" --ai-agent-mode --ai-max-tool-calls 100
# With specific AI provider
cargo run -- "kubernetes deployment failing" --ai-agent-mode --ai-provider anthropic
π€ AI Agent Mode - Iterative Problem Solving
Problem: my pod is stuck in crash loop backoff
Max tool calls: 50
Starting analysis...
π AI Agent is calling diagnostic tools...
π§ Tool 1/50: kubectl_get_pods --namespace default
β
Found pod in CrashLoopBackOff state
π§ Tool 2/50: kubectl_describe_pod my-pod --namespace default
β
Found exit code 1, checking logs...
π§ Tool 3/50: kubectl_logs my-pod --namespace default --lines 50
β
Found connection refused error
π§ Tool 4/50: kubectl_get_services --namespace default
β
Service configuration looks correct
π― Final Analysis (after 4 tool calls):
The pod is failing because it cannot connect to the database service.
The connection is being refused, indicating either:
1. Database service is not running
2. Wrong service name in pod configuration
3. Network policy blocking connection
Recommended next steps:
- Check if database pod is running: kubectl get pods | grep database
- Verify service endpoints: kubectl get endpoints
- Check pod environment variables for correct service names
When the AI needs more information:
βΈοΈ AI Agent paused after 15 tool calls
Reason: I need more information about your database configuration.
Your response (or 'quit' to exit): The database is running in namespace "database"
π Continuing analysis with additional context...
When hitting tool call limits:
β οΈ Tool call limit reached after 50 calls
Partial analysis: Found multiple potential issues with networking configuration...
Continue with 50 more tool calls? (y/n): y
Continuing with 50 more tool calls...
Option | Environment Variable | Default | Description |
---|---|---|---|
--ai-agent-mode |
- | false | Enable iterative AI agent mode |
--ai-max-tool-calls |
AI_MAX_TOOL_CALLS |
50 | Maximum tool calls per session |
--ai-provider |
AI_PROVIDER |
openai | AI provider (openai, anthropic, local) |
--ai-api-key |
AI_API_KEY |
- | API key for AI provider |
--ai-model |
AI_MODEL |
auto | Specific model to use |
Feature | Standard Mode | AI Agent Mode |
---|---|---|
Tool calls | Single analysis | Multiple iterative calls |
Interaction | One-shot | Interactive with pauses |
Problem solving | Basic | Advanced multi-step |
Tool selection | Pre-defined | AI-driven selection |
Continuation | No | Yes, after limits/questions |
Standard Mode is best for:
- Quick system health checks
- Simple diagnostic questions
- Automated health monitoring
- Batch processing
AI Agent Mode is best for:
- Complex troubleshooting scenarios
- Multi-component system issues
- Interactive debugging sessions
- Learning system investigation techniques
Set these for consistent AI agent behavior:
export AI_PROVIDER=anthropic
export AI_API_KEY=your-api-key-here
export AI_MAX_TOOL_CALLS=75
export AI_MODEL=claude-3-5-sonnet-20241022
Kubernetes Troubleshooting:
cargo run -- "deployment is failing with ImagePullBackOff" --ai-agent-mode --ai-max-tool-calls 30
Performance Investigation:
cargo run -- "system is slow and users are complaining" --ai-agent-mode --verbose
Network Debugging:
cargo run -- "pods cannot reach external services" --ai-agent-mode --ai-provider anthropic