A Lightweight Inference Server with Resource-Aware Scheduling
AutoScaleAI is a modular Python-based microservice that performs real-time machine learning inference while monitoring system performance. It simulates intelligent auto-scaling behavior by tracking latency, CPU usage, and memory consumption per request — mimicking how infrastructure-aware systems like Nutanix dynamically adapt to workload pressure.
- ✅ FastAPI-based RESTful server for ML inference
- ✅ Pre-trained RandomForest model on the digits dataset
- ✅ System monitoring using
psutil
to track CPU and memory usage - ✅ Latency analysis for every prediction
- ✅ Auto-scaling logic simulated via conditional triggers in
autoscaler.py
- ✅ Modular architecture with clear separation of concerns
- Backend: Python 3.11, FastAPI, Uvicorn
- ML Model: scikit-learn
RandomForestClassifier
- Monitoring:
psutil
(CPU + Memory) - Packaging:
joblib
- Serving: REST API via Uvicorn
AutoScaleAI/
├── app/
│ ├── main.py # FastAPI API endpoints
│ ├── model.py # Model loading and prediction
│ ├── autoscaler.py # Latency-based scaling logic
│ ├── utils.py # Helper functions (monitoring, formatting)
│ └── train_model.py # One-time model training script
├── requirements.txt
└── README.md
git clone https://github.com/shubh-garg/AutoScaleAI.git
cd AutoScaleAI
pip install -r requirements.txt
python app/train_model.py
uvicorn app.main:app --reload
Visit http://127.0.0.1:8000/docs for the Swagger UI.
Endpoint:
POST /predict/
Sample Payload:
{
"features": [0.0, 0.0, 10.0, 5.0, 8.0, 3.0, ..., 0.0]
}
Sample Response:
{
"prediction": 4,
"latency": 0.0075,
"cpu": 13.6,
"memory": 83.6
}
- Developed a modular FastAPI microservice for AI inference that dynamically reports latency and system usage to simulate resource-aware scheduling.
- Integrated lightweight model serving (
RandomForestClassifier
) with live system telemetry viapsutil
, achieving <10ms latency on local CPU with scaling logic triggers.
- Add async queueing / worker threads
- Deploy with Docker and Gunicorn
- Integrate GPU usage metrics (if available)
- Replace RF with ONNX or quantized Transformer model
- Live dashboard with Plotly or Grafana
MIT © 2025 Shubh Garg