Troubleshooting Guide

Common Issues

1. Pods Not Starting

Symptoms:

Pods stuck in Pending or CrashLoopBackOff
kubectl get pods shows errors

Diagnosis:

kubectl describe pod <pod-name> -n fl-system
kubectl logs <pod-name> -n fl-system

Common Causes:

Insufficient Resources

# Check node resources
kubectl top nodes

# Solution: Scale down or add nodes
kubectl scale deployment fl-client --replicas=2

Image Pull Errors

# Check image
kubectl describe pod <pod-name> | grep -A 5 Events

# Solution: Verify image exists
docker images | grep fl-platform

Configuration Errors

# Check ConfigMap
kubectl get configmap fl-config -o yaml

# Solution: Update ConfigMap
kubectl edit configmap fl-config
kubectl rollout restart deployment/fl-server

2. Training Not Converging

Symptoms:

Loss not decreasing
Accuracy stuck at random

Diagnosis:

# Check training logs
kubectl logs deployment/fl-client -n fl-system | grep "loss"

# Check Grafana dashboard
kubectl port-forward svc/grafana 3000:3000

Solutions:

Learning Rate Too High/Low

# Update values.yaml
config:
  learningRate: 0.001  # Try different values

Data Distribution Issues

# Check partition statistics
from fl.datasets.nih_chest_xray import PatientBasedPartitioner
partitioner = PatientBasedPartitioner(metadata_path, num_hospitals=10)
stats = partitioner.get_partition_stats(partitions)
print(stats)

Compression Too Aggressive

# Reduce compression
config:
  compression:
    sparsity: 0.5  # Less aggressive
    quantization_bits: 16  # More bits

3. High Memory Usage

Symptoms:

OOMKilled pods
Slow training

Diagnosis:

kubectl top pods -n fl-system
kubectl describe pod <pod-name> | grep -A 5 "Limits"

Solutions:

Increase Memory Limits

# values.yaml
client:
  resources:
    limits:
      memory: "4Gi"  # Increase

Reduce Batch Size

config:
  batchSize: 16  # Smaller batches

Enable Gradient Checkpointing

# In model definition
model.gradient_checkpointing_enable()

4. Byzantine Detection False Positives

Symptoms:

Too many clients flagged as Byzantine
Training unstable

Diagnosis:

# Check Grafana Byzantine dashboard
# Look at anomaly scores

Solutions:

Adjust Detection Threshold

# In robust_aggregation.py
byzantine = aggregator.detect_byzantine(
    client_updates,
    threshold=5.0  # Increase threshold
)

Use Different Aggregation Method

config:
  security:
    aggregation_method: "multi_krum"  # Try different method

5. Privacy Budget Exhausted

Symptoms:

DP training stops early
Privacy errors

Diagnosis:

# Check privacy metrics
kubectl logs deployment/fl-client | grep "epsilon"

Solutions:

Increase Privacy Budget

config:
  privacy:
    target_epsilon: 10.0  # Higher budget

Reduce Noise

config:
  privacy:
    noise_multiplier: 0.5  # Less noise

Train for Fewer Rounds

config:
  num_server_rounds: 5  # Fewer rounds

6. Slow Communication

Symptoms:

Long round times
High network latency

Diagnosis:

# Check network metrics
kubectl exec -it <pod-name> -- ping fl-server

# Check compression ratio
# View Grafana Communication dashboard

Solutions:

Increase Compression

config:
  compression:
    sparsity: 0.95  # More aggressive
    quantization_bits: 4  # Fewer bits

Enable Async Aggregation

# Implement async aggregation
# (Currently not implemented)

7. MLflow Not Accessible

Symptoms:

Cannot access MLflow UI
Experiments not logged

Diagnosis:

kubectl get svc mlflow -n fl-system
kubectl logs statefulset/mlflow -n fl-system

Solutions:

Port Forward

kubectl port-forward svc/mlflow 5000:5000 -n fl-system

Check Persistent Volume

kubectl get pvc -n fl-system
kubectl describe pvc mlflow-data-mlflow-0

Restart MLflow

kubectl rollout restart statefulset/mlflow -n fl-system

8. Grafana Dashboards Not Loading

Symptoms:

Empty dashboards
No data points

Diagnosis:

# Check Prometheus
kubectl port-forward svc/prometheus 9090:9090
# Visit http://localhost:9090

# Check metrics
curl http://localhost:9090/api/v1/query?query=fl_training_loss

Solutions:

Verify Metrics Export

# In training code
from fl.metrics import metrics_collector
metrics_collector.record_training_loss(client_id, round_num, loss)

Check Prometheus Config

kubectl get configmap prometheus-config -o yaml

Restart Prometheus

kubectl rollout restart deployment/prometheus

Performance Tuning

Optimize Training Speed

Use GPU

client:
  resources:
    limits:
      nvidia.com/gpu: 1

Increase Workers

DataLoader(dataset, num_workers=4)

Mixed Precision

from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

Optimize Communication

Aggressive Compression

compression:
  sparsity: 0.99
  quantization_bits: 2

Reduce Model Size

# Use smaller model
from fl.models import TinyNet

Optimize Memory

Gradient Accumulation

for i, batch in enumerate(dataloader):
    loss = train_step(batch)
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Clear Cache

import torch
torch.cuda.empty_cache()

Debugging Commands

# Get all resources
kubectl get all -n fl-system

# Describe deployment
kubectl describe deployment fl-server -n fl-system

# View logs (last 100 lines)
kubectl logs --tail=100 deployment/fl-client -n fl-system

# Follow logs
kubectl logs -f deployment/fl-server -n fl-system

# Execute command in pod
kubectl exec -it <pod-name> -n fl-system -- bash

# Check events
kubectl get events -n fl-system --sort-by='.lastTimestamp'

# Resource usage
kubectl top pods -n fl-system
kubectl top nodes

# Port forward multiple services
kubectl port-forward svc/mlflow 5000:5000 &
kubectl port-forward svc/grafana 3000:3000 &
kubectl port-forward svc/prometheus 9090:9090 &

Getting Help

Check Logs First
- Server logs: kubectl logs deployment/fl-server
- Client logs: kubectl logs deployment/fl-client
- MLflow logs: kubectl logs statefulset/mlflow
Check Metrics
- Grafana dashboards
- Prometheus queries
- MLflow experiments
Check Configuration
- ConfigMaps
- Secrets
- Helm values
GitHub Issues
- Search existing issues
- Create new issue with logs and config

FilesExpand file tree

troubleshooting.md

Latest commit

History

troubleshooting.md

File metadata and controls

Troubleshooting Guide

Common Issues

1. Pods Not Starting

Insufficient Resources

Image Pull Errors

Configuration Errors

2. Training Not Converging

Learning Rate Too High/Low

Data Distribution Issues

Compression Too Aggressive

3. High Memory Usage

Increase Memory Limits

Reduce Batch Size

Enable Gradient Checkpointing

4. Byzantine Detection False Positives

Adjust Detection Threshold

Use Different Aggregation Method

5. Privacy Budget Exhausted

Increase Privacy Budget

Reduce Noise

Train for Fewer Rounds

6. Slow Communication

Increase Compression

Enable Async Aggregation

7. MLflow Not Accessible

Port Forward

Check Persistent Volume

Restart MLflow

8. Grafana Dashboards Not Loading

Verify Metrics Export

Check Prometheus Config

Restart Prometheus

Performance Tuning

Optimize Training Speed

Optimize Communication

Optimize Memory

Debugging Commands

Getting Help