Skip to content

mercersoft/adviser

Repository files navigation

KubeJob: Kubernetes Job Management System

A complete system for submitting, monitoring, and managing machine learning training jobs on Kubernetes clusters.

📋 Overview

KubeJob is a distributed system consisting of three main components:

  • kubejob-cli: Go-based command-line interface for job management
  • kubejob-server: Flask-based REST API server for Kubernetes job orchestration
  • ml-trainer: Example machine learning training container with Prometheus metrics

🏗️ Architecture

┌─────────────────┐    HTTP API     ┌─────────────────┐    K8s API    ┌─────────────────┐
│   kubejob-cli   │ ──────────────► │ kubejob-server  │ ─────────────► │   Kubernetes    │
│   (Go CLI)      │                 │ (Flask Python)  │                │   Jobs & Pods   │
└─────────────────┘                 └─────────────────┘                └─────────────────┘
                                            │                                    │
                                            │                                    │
                                            ▼                                    ▼
                                    ┌─────────────────┐                ┌─────────────────┐
                                    │  Port Forward   │                │   ml-trainer    │
                                    │  localhost:6000 │                │   Container     │
                                    └─────────────────┘                └─────────────────┘

🚀 Quick Start

Prerequisites

  • Docker Desktop with Kubernetes enabled
  • Go 1.19+ (for building kubejob-cli)
  • Python 3.9+ (for development)
  • kubectl configured for your cluster

Setup

  1. Clone and navigate to the project:

    git clone <repository-url>
    cd adviser
  2. Build the ml-trainer image:

    cd ml-trainer
    docker build -t ml-trainer:latest .
    cd ..
  3. Build the kubejob-server image:

    cd kubejob-server
    docker build -t kubejob-server:latest .
    cd ..
  4. Build the kubejob CLI:

    cd kubejob-cli
    go build -o kubejob .
    cd ..
  5. Deploy to Kubernetes:

    cd kubejob-server
    kubectl apply -f k8s/rbac.yaml    # Create service account and RBAC
    kubectl apply -f k8s/server.yaml  # Deploy the server
  6. Set up port forwarding:

    kubectl port-forward svc/kubejob-server 6000:80

Usage

# Submit a training job
cd kubejob-cli
./kubejob submit --image ml-trainer:latest --args "--epochs 20 --batch-size 64"

# Check job status  
./kubejob status <job-name>

# View job logs
./kubejob logs <job-name>

# Get help
./kubejob --help

📦 Components

kubejob-cli

Technology: Go with Cobra CLI framework
Purpose: User-friendly command-line interface for job management

Features

  • Submit Kubernetes Jobs with custom arguments
  • Monitor job status (active/succeeded/failed)
  • Retrieve job logs with error handling
  • Configurable server endpoints

Implementation Details

File Structure:

kubejob-cli/
├── cmd/
│   ├── submit.go    # Job submission logic
│   ├── status.go    # Job status queries
│   └── logs.go      # Log retrieval
├── go.mod
├── go.sum
└── main.go         # CLI entry point

Key Features:

  • Default Server URL: Automatically defaults to http://localhost:6000
  • Environment Variables: Supports SERVER_URL environment variable
  • Error Handling: Comprehensive HTTP error reporting
  • Argument Parsing: Flexible command-line argument processing

Usage Examples:

# Basic job submission
./kubejob submit --image ml-trainer:latest --args "--epochs 10"

# Custom server URL
./kubejob submit --server http://custom-server:8080 --image ml-trainer:latest --args "--epochs 5"

# Environment variable
export SERVER_URL=http://production-server:6000
./kubejob submit --image ml-trainer:latest --args "--epochs 100"

kubejob-server

Technology: Python Flask with Kubernetes Python client
Purpose: REST API server for Kubernetes job orchestration

Features

  • RESTful API for job lifecycle management
  • Kubernetes RBAC integration with service accounts
  • Graceful error handling and status reporting
  • Development mode for local testing
  • Health monitoring endpoint

Implementation Details

File Structure:

kubejob-server/
├── app.py              # Main Flask application
├── requirements.txt    # Python dependencies
├── Dockerfile         # Container image definition
└── k8s/
    ├── rbac.yaml      # Service account and RBAC configuration
    └── server.yaml    # Kubernetes deployment manifests

API Endpoints:

Endpoint Method Description
/hello GET Basic health check
/health GET Detailed system status
/jobs POST Submit new job
/jobs/<name> GET Get job status
/jobs/<name>/logs GET Retrieve job logs

Configuration Features:

  • Multi-Mode Operation: Automatically detects in-cluster vs local kubectl config
  • Development Mode: Graceful degradation when Kubernetes is unavailable
  • Image Pull Policy: Configured for local images with imagePullPolicy: Never
  • Error Handling: Comprehensive pod status checking and error reporting

Job Creation Logic:

container = client.V1Container(
    name="trainer", 
    image=data["image"], 
    args=data["args"].split(),
    image_pull_policy="Never"  # Use local images only
)

Kubernetes Integration:

  • Service Account: kubejob-sa with appropriate RBAC permissions
  • Namespace: Operates in default namespace
  • Backoff Limit: Jobs retry up to 2 times on failure
  • Restart Policy: Never for training job pods

ml-trainer

Technology: Python with scikit-learn and Prometheus metrics
Purpose: Example ML training container with monitoring capabilities

Features

  • Configurable training parameters (epochs, batch-size)
  • Prometheus metrics export on port 8000
  • Iris dataset classification with Logistic Regression
  • Real-time training progress logging
  • Proper Docker entrypoint configuration

Implementation Details

File Structure:

ml-trainer/
├── train.py           # Training script with metrics
├── requirements.txt   # Python ML dependencies  
└── Dockerfile        # Optimized container image

Training Parameters:

  • --epochs: Number of training iterations (default: 5)
  • --batch-size: Training batch size (default: 32)
  • Prometheus metrics on port 8000

Dockerfile Configuration:

# Proper entrypoint for argument passing
ENTRYPOINT ["python", "train.py"]
CMD ["--epochs", "5", "--batch-size", "32"]

Metrics Exported:

  • training_loss: Loss value per epoch
  • training_accuracy: Accuracy score per epoch

Usage in Jobs:

# Custom parameters
./kubejob submit --image ml-trainer:latest --args "--epochs 50 --batch-size 128"

# Default parameters  
./kubejob submit --image ml-trainer:latest --args ""

🔧 Development

Local Development Setup

  1. Python Virtual Environment:

    cd kubejob-server
    python -m venv venv
    source venv/bin/activate  # Linux/Mac
    # or venv\Scripts\activate  # Windows
    pip install -r requirements.txt
  2. Run kubejob-server locally:

    cd kubejob-server
    python app.py
  3. Test ml-trainer locally:

    cd ml-trainer
    docker run --rm ml-trainer:latest --epochs 3 --batch-size 16

Building Components

kubejob-cli:

cd kubejob-cli
go mod tidy
go build -o kubejob .

kubejob-server:

cd kubejob-server
docker build -t kubejob-server:latest .

ml-trainer:

cd ml-trainer  
docker build -t ml-trainer:latest .

🐛 Troubleshooting

Common Issues

1. ImagePullBackOff Errors

# Solution: Ensure local images are built
docker images | grep -E "(kubejob-server|ml-trainer)"

# Rebuild if missing
docker build -t ml-trainer:latest ./ml-trainer
docker build -t kubejob-server:latest ./kubejob-server

2. Port Forwarding Connection Lost

# Check if kubejob-server pod is running
kubectl get pods -l app=kubejob-server

# Restart port forwarding
kubectl port-forward svc/kubejob-server 6000:80

3. "Connection Refused" from CLI

# Verify port forwarding is active
curl http://localhost:6000/health

# Should return: {"kubernetes_available":true,"mode":"production","status":"healthy"}

4. Job Container Cannot Run

# Check pod details
kubectl describe pod <pod-name>

# Common cause: Image needs proper entrypoint
# Solution: Use ENTRYPOINT + CMD in Dockerfile

Debugging Commands

# Check all jobs
kubectl get jobs

# Check job details  
kubectl describe job <job-name>

# Check pod logs directly
kubectl logs <pod-name>

# Check kubejob-server logs
kubectl logs -l app=kubejob-server

# Test API directly
curl http://localhost:6000/health
curl -X POST http://localhost:6000/jobs -H "Content-Type: application/json" \
  -d '{"image": "busybox", "args": "echo hello"}'

🔒 Security Considerations

  • RBAC: kubejob-server runs with minimal required permissions via kubejob-sa
  • Network Isolation: Uses Kubernetes service mesh for internal communication
  • Image Security: Local images only (no external registry pulls)
  • Resource Limits: Consider adding resource quotas for production deployments

📈 Production Deployment

For production use:

  1. Resource Limits:

    resources:
      limits:
        cpu: "1"
        memory: "1Gi"
      requests:
        cpu: "100m" 
        memory: "128Mi"
  2. Persistent Storage: Add volumes for training data and model outputs

  3. Monitoring: Integrate with Prometheus/Grafana for metrics collection

  4. Load Balancing: Use Ingress controllers instead of port forwarding

  5. Security: Implement proper authentication and authorization

📝 License

[Add your license information here]

🤝 Contributing

[Add contribution guidelines here]

About

Adviser CLI and Server with ml-training.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published