A complete system for submitting, monitoring, and managing machine learning training jobs on Kubernetes clusters.
KubeJob is a distributed system consisting of three main components:
kubejob-cli: Go-based command-line interface for job managementkubejob-server: Flask-based REST API server for Kubernetes job orchestrationml-trainer: Example machine learning training container with Prometheus metrics
┌─────────────────┐ HTTP API ┌─────────────────┐ K8s API ┌─────────────────┐
│ kubejob-cli │ ──────────────► │ kubejob-server │ ─────────────► │ Kubernetes │
│ (Go CLI) │ │ (Flask Python) │ │ Jobs & Pods │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Port Forward │ │ ml-trainer │
│ localhost:6000 │ │ Container │
└─────────────────┘ └─────────────────┘
- Docker Desktop with Kubernetes enabled
- Go 1.19+ (for building kubejob-cli)
- Python 3.9+ (for development)
- kubectl configured for your cluster
-
Clone and navigate to the project:
git clone <repository-url> cd adviser
-
Build the ml-trainer image:
cd ml-trainer docker build -t ml-trainer:latest . cd ..
-
Build the kubejob-server image:
cd kubejob-server docker build -t kubejob-server:latest . cd ..
-
Build the kubejob CLI:
cd kubejob-cli go build -o kubejob . cd ..
-
Deploy to Kubernetes:
cd kubejob-server kubectl apply -f k8s/rbac.yaml # Create service account and RBAC kubectl apply -f k8s/server.yaml # Deploy the server
-
Set up port forwarding:
kubectl port-forward svc/kubejob-server 6000:80
# Submit a training job
cd kubejob-cli
./kubejob submit --image ml-trainer:latest --args "--epochs 20 --batch-size 64"
# Check job status
./kubejob status <job-name>
# View job logs
./kubejob logs <job-name>
# Get help
./kubejob --helpTechnology: Go with Cobra CLI framework
Purpose: User-friendly command-line interface for job management
- Submit Kubernetes Jobs with custom arguments
- Monitor job status (active/succeeded/failed)
- Retrieve job logs with error handling
- Configurable server endpoints
File Structure:
kubejob-cli/
├── cmd/
│ ├── submit.go # Job submission logic
│ ├── status.go # Job status queries
│ └── logs.go # Log retrieval
├── go.mod
├── go.sum
└── main.go # CLI entry point
Key Features:
- Default Server URL: Automatically defaults to
http://localhost:6000 - Environment Variables: Supports
SERVER_URLenvironment variable - Error Handling: Comprehensive HTTP error reporting
- Argument Parsing: Flexible command-line argument processing
Usage Examples:
# Basic job submission
./kubejob submit --image ml-trainer:latest --args "--epochs 10"
# Custom server URL
./kubejob submit --server http://custom-server:8080 --image ml-trainer:latest --args "--epochs 5"
# Environment variable
export SERVER_URL=http://production-server:6000
./kubejob submit --image ml-trainer:latest --args "--epochs 100"Technology: Python Flask with Kubernetes Python client
Purpose: REST API server for Kubernetes job orchestration
- RESTful API for job lifecycle management
- Kubernetes RBAC integration with service accounts
- Graceful error handling and status reporting
- Development mode for local testing
- Health monitoring endpoint
File Structure:
kubejob-server/
├── app.py # Main Flask application
├── requirements.txt # Python dependencies
├── Dockerfile # Container image definition
└── k8s/
├── rbac.yaml # Service account and RBAC configuration
└── server.yaml # Kubernetes deployment manifests
API Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/hello |
GET | Basic health check |
/health |
GET | Detailed system status |
/jobs |
POST | Submit new job |
/jobs/<name> |
GET | Get job status |
/jobs/<name>/logs |
GET | Retrieve job logs |
Configuration Features:
- Multi-Mode Operation: Automatically detects in-cluster vs local kubectl config
- Development Mode: Graceful degradation when Kubernetes is unavailable
- Image Pull Policy: Configured for local images with
imagePullPolicy: Never - Error Handling: Comprehensive pod status checking and error reporting
Job Creation Logic:
container = client.V1Container(
name="trainer",
image=data["image"],
args=data["args"].split(),
image_pull_policy="Never" # Use local images only
)Kubernetes Integration:
- Service Account:
kubejob-sawith appropriate RBAC permissions - Namespace: Operates in
defaultnamespace - Backoff Limit: Jobs retry up to 2 times on failure
- Restart Policy:
Neverfor training job pods
Technology: Python with scikit-learn and Prometheus metrics
Purpose: Example ML training container with monitoring capabilities
- Configurable training parameters (epochs, batch-size)
- Prometheus metrics export on port 8000
- Iris dataset classification with Logistic Regression
- Real-time training progress logging
- Proper Docker entrypoint configuration
File Structure:
ml-trainer/
├── train.py # Training script with metrics
├── requirements.txt # Python ML dependencies
└── Dockerfile # Optimized container image
Training Parameters:
--epochs: Number of training iterations (default: 5)--batch-size: Training batch size (default: 32)- Prometheus metrics on port 8000
Dockerfile Configuration:
# Proper entrypoint for argument passing
ENTRYPOINT ["python", "train.py"]
CMD ["--epochs", "5", "--batch-size", "32"]Metrics Exported:
training_loss: Loss value per epochtraining_accuracy: Accuracy score per epoch
Usage in Jobs:
# Custom parameters
./kubejob submit --image ml-trainer:latest --args "--epochs 50 --batch-size 128"
# Default parameters
./kubejob submit --image ml-trainer:latest --args ""-
Python Virtual Environment:
cd kubejob-server python -m venv venv source venv/bin/activate # Linux/Mac # or venv\Scripts\activate # Windows pip install -r requirements.txt
-
Run kubejob-server locally:
cd kubejob-server python app.py -
Test ml-trainer locally:
cd ml-trainer docker run --rm ml-trainer:latest --epochs 3 --batch-size 16
kubejob-cli:
cd kubejob-cli
go mod tidy
go build -o kubejob .kubejob-server:
cd kubejob-server
docker build -t kubejob-server:latest .ml-trainer:
cd ml-trainer
docker build -t ml-trainer:latest .1. ImagePullBackOff Errors
# Solution: Ensure local images are built
docker images | grep -E "(kubejob-server|ml-trainer)"
# Rebuild if missing
docker build -t ml-trainer:latest ./ml-trainer
docker build -t kubejob-server:latest ./kubejob-server2. Port Forwarding Connection Lost
# Check if kubejob-server pod is running
kubectl get pods -l app=kubejob-server
# Restart port forwarding
kubectl port-forward svc/kubejob-server 6000:803. "Connection Refused" from CLI
# Verify port forwarding is active
curl http://localhost:6000/health
# Should return: {"kubernetes_available":true,"mode":"production","status":"healthy"}4. Job Container Cannot Run
# Check pod details
kubectl describe pod <pod-name>
# Common cause: Image needs proper entrypoint
# Solution: Use ENTRYPOINT + CMD in Dockerfile# Check all jobs
kubectl get jobs
# Check job details
kubectl describe job <job-name>
# Check pod logs directly
kubectl logs <pod-name>
# Check kubejob-server logs
kubectl logs -l app=kubejob-server
# Test API directly
curl http://localhost:6000/health
curl -X POST http://localhost:6000/jobs -H "Content-Type: application/json" \
-d '{"image": "busybox", "args": "echo hello"}'- RBAC: kubejob-server runs with minimal required permissions via
kubejob-sa - Network Isolation: Uses Kubernetes service mesh for internal communication
- Image Security: Local images only (no external registry pulls)
- Resource Limits: Consider adding resource quotas for production deployments
For production use:
-
Resource Limits:
resources: limits: cpu: "1" memory: "1Gi" requests: cpu: "100m" memory: "128Mi"
-
Persistent Storage: Add volumes for training data and model outputs
-
Monitoring: Integrate with Prometheus/Grafana for metrics collection
-
Load Balancing: Use Ingress controllers instead of port forwarding
-
Security: Implement proper authentication and authorization
[Add your license information here]
[Add contribution guidelines here]