diff --git a/docs/faq/index.md b/docs/faq/index.md index ab38d5502..3f3c2f2d6 100644 --- a/docs/faq/index.md +++ b/docs/faq/index.md @@ -1,11 +1,211 @@ -# FAQ +# Frequently Asked Questions (FAQ) +This page provides answers to common questions about Arena. Browse by category or search for your issue. -## **Installation** +## Installation -* [FAQ: Failed To Install Arena on Mac](./installation/failed-install-arena.md). +- [Failed to Install Arena on Mac](./installation/failed-install-arena.md) - Troubleshoot macOS installation issues +## Training Jobs -## **Training Job** +- [No matches for kind "TFJob"](./training/operator-not-deploy.md) - Operator not deployed or not found -* [FAQ: No matches for kind "TFJob" version "kubeflow.org/v1alpha2"](./training/operator-not-deploy.md) +## Serving Jobs + +### Model Deployment Issues + +**Q: My serving job is stuck in PENDING status** +A: Check cluster resources with `arena top node`. Serving jobs require sufficient CPU, memory, and GPU. Increase replicas or adjust resource requests. + +**Q: Model fails to load** +A: Verify model path exists: `kubectl exec -- ls -la `. Check pod logs: `arena serve logs ` + +## Common Issues + +### Resource Constraints + +**Q: "No resources available" error when submitting jobs** +A: Use `arena top node` to check available resources. Either: + +- Wait for other jobs to complete +- Use fewer GPUs in your submission +- Submit to a different namespace with dedicated resources + +### Kubeconfig and Authentication + +**Q: Arena cannot connect to cluster** +A: + +1. Verify kubeconfig: `kubectl cluster-info` +2. Set explicit path: `arena --config=/path/to/kubeconfig list` +3. Check KUBECONFIG env: `echo $KUBECONFIG` + +**Q: "Permission denied" when submitting jobs** +A: Your kubeconfig user doesn't have required permissions. Contact cluster admin to grant: + +- `create tfjobs`, `pytorchjobs`, etc. +- `get pods` and `logs` +- `create services` and `deployments` + +### Job Status Issues + +**Q: Job status shows FAILED immediately** +A: + +1. Check logs: `arena logs ` +2. Check pod events: `kubectl describe pod ` +3. Common causes: bad image, invalid command, missing data volumes + +**Q: Job stuck in PENDING for long time** +A: + +1. Check node status: `kubectl get nodes` +2. Check pod events: `kubectl describe pod -n ` +3. Might be waiting for: + - GPU availability + - Image pull + - Scheduler decisions + +### Data and Volume Issues + +**Q: Cannot mount data volume** +A: + +1. Verify volume exists: `kubectl get pvc` +2. Check access mode: needs ReadWriteOnce or ReadWriteMany +3. Verify data path: `kubectl get pvc -o yaml` + +**Q: "No space left on device" error** +A: Volume is full. Either: + +- Increase volume size +- Clean up old data +- Use a different volume + +### GPU Issues + +**Q: GPU not detected in job** +A: + +1. Check GPU driver: `kubectl get nodes -L nvidia.com/gpu` +2. Verify node has GPU: `kubectl describe node ` +3. Check GPU allocation: `arena top node` + +**Q: GPU out of memory** +A: + +1. Reduce batch size in training script +2. Reduce model size +3. Enable gradient checkpointing +4. Use multiple GPUs with distributed training + +## Model Management + +**Q: Cannot connect to MLflow tracking server** +A: + +1. Verify server running: `curl http://:5000/api/2.0/mlflow/version` +2. Set MLFLOW_TRACKING_URI: `export MLFLOW_TRACKING_URI=http://server:5000` +3. Check network connectivity from Arena pod + +**Q: Model not appearing in registry** +A: + +1. Verify job completed: `arena get ` +2. Check model submission flags: `--model-name` and `--model-source` +3. Check model directory permissions and contents + +## Monitoring and Logging + +**Q: Cannot see job logs** +A: + +1. Job might not have started: `arena get ` +2. Check pod status: `kubectl get pods` +3. Use `--follow` flag: `arena logs --follow` + +**Q: "arena top" shows no GPU usage** +A: + +1. Prometheus might not be running: `kubectl get pods -n arena-system` +2. GPU operator might not be installed +3. Jobs might be waiting for resources + +## CLI Issues + +**Q: "command not found: arena"** +A: + +1. Verify installation: `which arena` +2. Check PATH: `echo $PATH` +3. Reinstall: `cp /usr/local/bin/arena to PATH` + +**Q: Autocompletion not working** +A: + +1. Source completion: `source <(arena completion bash)` +2. Add to ~/.bashrc: `echo "source <(arena completion bash)" >> ~/.bashrc` +3. Reload shell: `source ~/.bashrc` + +## Performance and Optimization + +**Q: Training is very slow** +A: + +1. Check GPU utilization: `arena top job ` +2. Check network: distributed training requires good network +3. Use faster data loading: prefetch, cache, smaller batch + +**Q: High latency in serving** +A: + +1. Check pod load: `arena top job ` +2. Increase replicas: `arena serve delete` + resubmit with more replicas +3. Check model complexity and input size + +## Troubleshooting Workflow + +When something goes wrong, follow this checklist: + +1. **Check Job Status**: `arena get ` +2. **View Logs**: `arena logs --tail 100` +3. **Inspect Pod**: `kubectl describe pod ` +4. **Check Resources**: `arena top node` and `arena top job ` +5. **Verify Configuration**: `arena get ` shows full config +6. **Check Events**: `kubectl get events -n ` +7. **Review Operator Logs**: `kubectl logs -n arena-system deployment/` + +## Getting More Help + +- **Search Documentation**: Check [Training Guide](../training/index.md), [Serving Guide](../serving/index.md) +- **Check CLI Help**: `arena --help` +- **View Examples**: Check [Training Examples](../training/tfjob/standalone.md) +- **GitHub Issues**: [Arena Issues](https://github.com/kubeflow/arena/issues) +- **GitHub Discussions**: [Arena Discussions](https://github.com/kubeflow/arena/discussions) + +## Performance Tuning + +### Training Performance + +1. **Enable GPU**: Always use `--gpus` flag +2. **Increase Batch Size**: Larger batches = better GPU utilization +3. **Distributed Training**: Use multiple nodes/GPUs for large models +4. **Data Pipeline**: Optimize data loading with prefetching +5. **Mixed Precision**: Reduce memory and accelerate training + +### Inference Performance + +1. **Model Quantization**: Reduce model size with INT8/FP16 +2. **Batch Inference**: Process multiple requests together +3. **Caching**: Cache model predictions when possible +4. **Load Balancing**: Distribute traffic across replicas +5. **GPU Memory**: Monitor and optimize GPU allocation + +## Related Resources + +- [Installation Guide](../installation/index.md) +- [Training Jobs Guide](../training/index.md) +- [Model Serving Guide](../serving/index.md) +- [CLI Reference](../cli/arena.md) +- [Kubernetes Troubleshooting](https://kubernetes.io/docs/tasks/debug-application-cluster/) +- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)