-
Notifications
You must be signed in to change notification settings - Fork 187
Description
What happend:
Hi there, thanks a lot for your work. It's impressive, so I was trying to deploy it on local MINIKUBE and local DIND, but in fact none of them worked properly. I was stuck in an issue for few days, so I'd like to ask you guys for help. By chance I've found something similar to my issue from your docs but in the different condition, which means:
- my local
minikubeencountered the issue which was recorded in the DIND-TRAING -- all pods worked as expected
alertmanager-7bd87d99cc-jhp2b 1/1 Running 0 6h
etcd0 1/1 Running 0 6h
ffdl-lcm-8d555c7bf-dqqhg 1/1 Running 0 6h
ffdl-restapi-7f5c57c77d-k67pm 1/1 Running 0 6h
ffdl-trainer-6777dd5756-xkk65 1/1 Running 0 6h
ffdl-trainingdata-696b99ff5c-tvbtc 1/1 Running 0 6h
ffdl-ui-95d6464c7-bv2sn 1/1 Running 0 6h
jobmonitor-0d296791-2adc-4336-4f01-b280090460c3-cbdb48cfd-qqsvz 1/1 Running 0 1h
learner-0d296791-2adc-4336-4f01-b280090460c3-0 0/1 ContainerCreating 0 1h
lhelper-0d296791-2adc-4336-4f01-b280090460c3-54858658b-p7vfc 2/2 Running 0 1h
mongo-0 1/1 Running 4 6h
prometheus-67fb854b59-c884p 2/2 Running 0 6h
pushgateway-5665768d5c-jdlnl 2/2 Running 0 6h
storage-0 1/1 Running 0 6hexcept the pod learner with eternal pending status because of the following warning.
Unable to mount volumes for pod "learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0_default(33f78708-f963-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0". list of unmounted volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f]. list of unattached volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f learner-entrypoint-files jobdata]and here's the details of pod learner-x
Name: learner-0d296791-2adc-4336-4f01-b280090460c3-0
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: minikube/10.0.2.15
Start Time: Thu, 06 Dec 2018 17:05:52 +0100
Labels: controller-revision-hash=learner-0d296791-2adc-4336-4f01-b280090460c3-999bf4986
service=dlaas-learner
statefulset.kubernetes.io/pod-name=learner-0d296791-2adc-4336-4f01-b280090460c3-0
training_id=training-bFEXXGPmR
user_id=test-user
Annotations: scheduler.alpha.kubernetes.io/nvidiaGPU={ "AllocationPriority": "Dense" }
scheduler.alpha.kubernetes.io/tolerations=[ { "key": "dedicated", "operator": "Equal", "value": "gpu-task" } ]
Status: Pending
IP:
Controlled By: StatefulSet/learner-0d296791-2adc-4336-4f01-b280090460c3
Containers:
learner:
Container ID:
Image: tensorflow/tensorflow:1.5.0-py3
Image ID:
Ports: 22/TCP, 2222/TCP
Host Ports: 0/TCP, 0/TCP
Command:
bash
-c
export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/; chmod +x /usr/local/bin/*.sh;
if [ ! -f /job/load-model.exit ]; then
while [ ! -f /job/load-model.start ]; do sleep 2; done ;
date "+%s%N" | cut -b1-13 > /job/load-model.start_time ;
echo "Starting Training $TRAINING_ID"
mkdir -p "$MODEL_DIR" ;
python -m zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR ;
echo $? > /job/load-model.exit ;
fi
echo "Done load-model" ;
if [ ! -f /job/learner.exit ]; then
while [ ! -f /job/learner.start ]; do sleep 2; done ;
date "+%s%N" | cut -b1-13 > /job/learner.start_time ;
for i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*} ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i; done;
export LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;
mkdir -p $RESULT_DIR/learner-$LEARNER_ID ;
mkdir -p $CHECKPOINT_DIR ;bash -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}' ;
echo $? > /job/learner.exit ;
fi
echo "Done learner" ;
if [ ! -f /job/store-logs.exit ]; then
while [ ! -f /job/store-logs.start ]; do sleep 2; done ;
date "+%s%N" | cut -b1-13 > /job/store-logs.start_time ;
echo Calling copy logs.
mv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID ;
ERROR_CODE=$? ;
echo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete ;
bash -c 'exit $ERROR_CODE' ;
echo $? > /job/store-logs.exit ;
fi
echo "Done store-logs" ;
while true; do sleep 2; done ;
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 1048576k
nvidia.com/gpu: 0
Requests:
cpu: 500m
memory: 1048576k
nvidia.com/gpu: 0
Environment:
LOG_DIR: /job/logs
GPU_COUNT: 0.000000
TRAINING_COMMAND: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 2000
TRAINING_ID: training-bFEXXGPmR
DATA_DIR: /mnt/data/tf_training_data
MODEL_DIR: /job/model-code
RESULT_DIR: /mnt/results/tf_trained_model/training-bFEXXGPmR
DOWNWARD_API_POD_NAME: learner-0d296791-2adc-4336-4f01-b280090460c3-0 (v1:metadata.name)
DOWNWARD_API_POD_NAMESPACE: default (v1:metadata.namespace)
LEARNER_NAME_PREFIX: learner-0d296791-2adc-4336-4f01-b280090460c3
TRAINING_ID: training-bFEXXGPmR
NUM_LEARNERS: 1
JOB_STATE_DIR: /job
CHECKPOINT_DIR: /mnt/results/tf_trained_model/_wml_checkpoints
RESULT_BUCKET_DIR: /mnt/results/tf_trained_model
Mounts:
/entrypoint-files from learner-entrypoint-files (rw)
/job from jobdata (rw)
/mnt/data/tf_training_data from cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
/mnt/results/tf_trained_model from cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cosinputmount-0d296791-2adc-4336-4f01-b280090460c3:
Type: FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
Driver: ibm/ibmc-s3fs
FSType:
SecretRef: &{cossecretdata-0d296791-2adc-4336-4f01-b280090460c3}
ReadOnly: false
Options: map[debug-level:warn endpoint:http://192.168.99.105:31172 tls-cipher-suite:DEFAULT cache-size-gb:0 chunk-size-mb:52 curl-debug:false kernel-cache:true multireq-max:20 bucket:tf_training_data ensure-disk-free:0 parallel-count:5 region:us-standard s3fs-fuse-retry-count:30 stat-cache-size:100000]
cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3:
Type: FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
Driver: ibm/ibmc-s3fs
FSType:
SecretRef: &{cossecretresults-0d296791-2adc-4336-4f01-b280090460c3}
ReadOnly: false
Options: map[cache-size-gb:0 curl-debug:false endpoint:http://192.168.99.105:31172 parallel-count:2 bucket:tf_trained_model debug-level:warn s3fs-fuse-retry-count:30 stat-cache-size:100000 chunk-size-mb:52 kernel-cache:false ensure-disk-free:2048 region:us-standard tls-cipher-suite:DEFAULT multireq-max:20]
learner-entrypoint-files:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: learner-entrypoint-files
Optional: false
jobdata:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: dedicated=gpu-task:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 1m (x40 over 1h) kubelet, minikube Unable to mount volumes for pod "learner-0d296791-2adc-4336-4f01-b280090460c3-0_default(ce612f9d-f970-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-0d296791-2adc-4336-4f01-b280090460c3-0". list of unmounted volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3]. list of unattached volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 learner-entrypoint-files jobdata]- my local
dindencountered the issue with non-hint FAILED ERROR while training. All the pods was running, but there're no pods jobmonitor, learner and lhelper.
Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31404
Handling connection for 31404
FAILED
Error 200: OKWhat you expected to happen:
Make FfDL work as properly on either local DIND or MINIKUBE.
Environment:
OS: Darwin local 17.4.0 Darwin Kernel Version 17.4.0:
MINIKUBE:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}How to reproduce it (as minimally and precisely as possible):
I was just following README.rd with several make instructions
make deploy-plugin
make quickstart-deploy
make test-push-data-s3
make test-job-submitAnything else we need to know?:
In situation 2, I totally followed the above-mentioned steps;
In situation 1, because it popped out hints that nfs error at first, and I just remember one of the doc I've read about MINIKUBE as if to say that, for persistent volumes, it just supports hostpath type, so I created a PV and PVC, here's the details.
$ kubectl describe pv hostpathtest
Name: hostpathtest
Labels: <none>
Annotations: pv.kubernetes.io/bound-by-controller=yes
Finalizers: [kubernetes.io/pv-protection]
StorageClass:
Status: Bound
Claim: default/static-volume-1
Reclaim Policy: Retain
Access Modes: RWO
Capacity: 20Gi
Node Affinity: <none>
Message:
Source:
Type: HostPath (bare host directory volume)
Path: /data/hostpath_test
HostPathType:
Events: <none>$ kubectl describe pvc learner-1
Name: learner-1
Namespace: default
StorageClass:
Status: Bound
Volume: hostpathtest-learner
Labels: type=dlaas-static-volume
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"volume.beta.kubernetes.io/storage-class":""},"labels":{"type":"dlaas-stat...
pv.kubernetes.io/bind-completed=yes
pv.kubernetes.io/bound-by-controller=yes
volume.beta.kubernetes.io/storage-class=
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 20Gi
Access Modes: RWO
Events: <none>Thanks in advance for all advices and have a good day