|
| 1 | +# InfiniLM Service Babysitter |
| 2 | + |
| 3 | +This directory contains scripts to automatically restart the InfiniLM service when it crashes or panics. |
| 4 | + |
| 5 | +## Files |
| 6 | + |
| 7 | +- `babysitter.py` - Python script that monitors and restarts the service |
| 8 | +- `start_service.sh` - Shell script wrapper for easy service startup |
| 9 | +- `infinilm.service` - Systemd service file for production deployment |
| 10 | + |
| 11 | +## Quick Start |
| 12 | + |
| 13 | +### Method 1: Using the shell script (Recommended) |
| 14 | + |
| 15 | +```bash |
| 16 | +# Start with default settings (port 5000, service.toml) |
| 17 | +./start_service.sh |
| 18 | + |
| 19 | +# Start on a different port |
| 20 | +./start_service.sh -p 8080 service.toml |
| 21 | + |
| 22 | +# Allow more restart attempts |
| 23 | +./start_service.sh --max-restarts 20 service.toml |
| 24 | + |
| 25 | +# Custom restart delay |
| 26 | +./start_service.sh --restart-delay 10 service.toml |
| 27 | +``` |
| 28 | + |
| 29 | +### Method 2: Using the Python script directly |
| 30 | + |
| 31 | +```bash |
| 32 | +# Basic usage |
| 33 | +python3 babysitter.py service.toml |
| 34 | + |
| 35 | +# With custom options |
| 36 | +python3 babysitter.py --port 8080 --max-restarts 15 --restart-delay 10 service.toml |
| 37 | +``` |
| 38 | + |
| 39 | +### Method 3: Systemd service (Production) |
| 40 | + |
| 41 | +```bash |
| 42 | +# Copy the service file |
| 43 | +sudo cp infinilm.service /etc/systemd/system/ |
| 44 | + |
| 45 | +# Reload systemd |
| 46 | +sudo systemctl daemon-reload |
| 47 | + |
| 48 | +# Enable and start the service |
| 49 | +sudo systemctl enable infinilm.service |
| 50 | +sudo systemctl start infinilm.service |
| 51 | + |
| 52 | +# Check status |
| 53 | +sudo systemctl status infinilm.service |
| 54 | + |
| 55 | +# View logs |
| 56 | +sudo journalctl -u infinilm.service -f |
| 57 | +``` |
| 58 | + |
| 59 | +## Features |
| 60 | + |
| 61 | +- **Automatic Restart**: Automatically restarts the service when it crashes |
| 62 | +- **Configurable Limits**: Set maximum restart attempts and delay between restarts |
| 63 | +- **Logging**: Comprehensive logging to both file and console |
| 64 | +- **Graceful Shutdown**: Handles SIGINT and SIGTERM signals properly |
| 65 | +- **Real-time Output**: Shows service output in real-time |
| 66 | +- **Error Handling**: Robust error handling and recovery |
| 67 | + |
| 68 | +## Configuration Options |
| 69 | + |
| 70 | +### Babysitter Options |
| 71 | + |
| 72 | +- `--port`: Port to run the service on (default: 5000) |
| 73 | +- `--max-restarts`: Maximum number of restart attempts (default: 10) |
| 74 | +- `--restart-delay`: Delay between restarts in seconds (default: 5) |
| 75 | + |
| 76 | +### Service Configuration |
| 77 | + |
| 78 | +The service uses the same configuration as the original `cargo run service` command: |
| 79 | + |
| 80 | +- Model paths and GPU assignments |
| 81 | +- Sampling parameters (temperature, top-p, etc.) |
| 82 | +- Token limits |
| 83 | +- Blacklist settings |
| 84 | +- **New**: `max-sessions` parameter to limit concurrent connections |
| 85 | + |
| 86 | +## Logging |
| 87 | + |
| 88 | +The babysitter creates a `babysitter.log` file in the current directory with detailed logs including: |
| 89 | + |
| 90 | +- Service start/stop events |
| 91 | +- Restart attempts and reasons |
| 92 | +- Service output |
| 93 | +- Error messages |
| 94 | + |
| 95 | +## Troubleshooting |
| 96 | + |
| 97 | +### Service won't start |
| 98 | + |
| 99 | +1. Check if the config file exists and is valid |
| 100 | +2. Verify that the `xtask` directory exists |
| 101 | +3. Ensure you have the required dependencies (Python 3, Rust, CUDA) |
| 102 | +4. Check the `babysitter.log` file for detailed error messages |
| 103 | + |
| 104 | +### Service keeps restarting |
| 105 | + |
| 106 | +1. Check the service logs for the root cause of crashes |
| 107 | +2. Verify GPU memory availability |
| 108 | +3. Check if the model files are accessible |
| 109 | +4. Review the `max-sessions` setting if you're hitting connection limits |
| 110 | + |
| 111 | +### Performance issues |
| 112 | + |
| 113 | +1. Adjust the `max-sessions` parameter in your config file |
| 114 | +2. Monitor GPU memory usage |
| 115 | +3. Consider reducing batch sizes or model parameters |
| 116 | + |
| 117 | +## Example Configuration |
| 118 | + |
| 119 | +Add the `max-sessions` parameter to your `service.toml`: |
| 120 | + |
| 121 | +```toml |
| 122 | +[FM9G-7B] |
| 123 | +path = "/root/zenghua/fm9g-7B-sft-v0.0-F16.gguf" |
| 124 | +gpus = [1] |
| 125 | +max-tokens = 32768 |
| 126 | +temperature = 0.7 |
| 127 | +top-p = 0.7 |
| 128 | +repetition-penalty = 1.02 |
| 129 | +max-sessions = 5 # Limit to 5 concurrent sessions |
| 130 | + |
| 131 | +[Qwen3-32B] |
| 132 | +path = "/root/zenghua/Qwen3-32B-F16.gguf" |
| 133 | +gpus = [4, 5, 6, 7] |
| 134 | +max-tokens = 32768 |
| 135 | +temperature = 0.6 |
| 136 | +top-p = 0.95 |
| 137 | +top-k = 20 |
| 138 | +repetition-penalty = 1.02 |
| 139 | +think = false |
| 140 | +max-sessions = 3 # Limit to 3 concurrent sessions for larger model |
| 141 | +``` |
| 142 | + |
| 143 | +## Monitoring |
| 144 | + |
| 145 | +### Check service status |
| 146 | + |
| 147 | +```bash |
| 148 | +# If using systemd |
| 149 | +sudo systemctl status infinilm.service |
| 150 | + |
| 151 | +# Check babysitter logs |
| 152 | +tail -f babysitter.log |
| 153 | + |
| 154 | +# Check service logs |
| 155 | +sudo journalctl -u infinilm.service -f |
| 156 | +``` |
| 157 | + |
| 158 | +### Monitor resource usage |
| 159 | + |
| 160 | +```bash |
| 161 | +# Check GPU usage |
| 162 | +nvidia-smi |
| 163 | + |
| 164 | +# Check memory usage |
| 165 | +free -h |
| 166 | + |
| 167 | +# Check process status |
| 168 | +ps aux | grep xtask |
| 169 | +``` |
| 170 | + |
| 171 | +## Security Considerations |
| 172 | + |
| 173 | +- The systemd service runs as root (required for GPU access) |
| 174 | +- Consider using a dedicated user for production deployments |
| 175 | +- Review and adjust the security settings in `infinilm.service` |
| 176 | +- Monitor logs for any suspicious activity |
0 commit comments