Skip to content

Unable to mount volumes for pod Learner #152

@JunFugithub

Description

@JunFugithub

What happend:
Hi there, thanks a lot for your work. It's impressive, so I was trying to deploy it on local MINIKUBE and local DIND, but in fact none of them worked properly. I was stuck in an issue for few days, so I'd like to ask you guys for help. By chance I've found something similar to my issue from your docs but in the different condition, which means:

  1. my local minikube encountered the issue which was recorded in the DIND-TRAING -- all pods worked as expected
alertmanager-7bd87d99cc-jhp2b                                     1/1       Running             0          6h
etcd0                                                             1/1       Running             0          6h
ffdl-lcm-8d555c7bf-dqqhg                                          1/1       Running             0          6h
ffdl-restapi-7f5c57c77d-k67pm                                     1/1       Running             0          6h
ffdl-trainer-6777dd5756-xkk65                                     1/1       Running             0          6h
ffdl-trainingdata-696b99ff5c-tvbtc                                1/1       Running             0          6h
ffdl-ui-95d6464c7-bv2sn                                           1/1       Running             0          6h
jobmonitor-0d296791-2adc-4336-4f01-b280090460c3-cbdb48cfd-qqsvz   1/1       Running             0          1h
learner-0d296791-2adc-4336-4f01-b280090460c3-0                    0/1       ContainerCreating   0          1h
lhelper-0d296791-2adc-4336-4f01-b280090460c3-54858658b-p7vfc      2/2       Running             0          1h
mongo-0                                                           1/1       Running             4          6h
prometheus-67fb854b59-c884p                                       2/2       Running             0          6h
pushgateway-5665768d5c-jdlnl                                      2/2       Running             0          6h
storage-0                                                         1/1       Running             0          6h

except the pod learner with eternal pending status because of the following warning.

Unable to mount volumes for pod "learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0_default(33f78708-f963-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0". list of unmounted volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f]. list of unattached volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f learner-entrypoint-files jobdata]

and here's the details of pod learner-x

Name:               learner-0d296791-2adc-4336-4f01-b280090460c3-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               minikube/10.0.2.15
Start Time:         Thu, 06 Dec 2018 17:05:52 +0100
Labels:             controller-revision-hash=learner-0d296791-2adc-4336-4f01-b280090460c3-999bf4986
                    service=dlaas-learner
                    statefulset.kubernetes.io/pod-name=learner-0d296791-2adc-4336-4f01-b280090460c3-0
                    training_id=training-bFEXXGPmR
                    user_id=test-user
Annotations:        scheduler.alpha.kubernetes.io/nvidiaGPU={ "AllocationPriority": "Dense" }
                    scheduler.alpha.kubernetes.io/tolerations=[ { "key": "dedicated", "operator": "Equal", "value": "gpu-task" } ]
Status:             Pending
IP:
Controlled By:      StatefulSet/learner-0d296791-2adc-4336-4f01-b280090460c3
Containers:
  learner:
    Container ID:
    Image:         tensorflow/tensorflow:1.5.0-py3
    Image ID:
    Ports:         22/TCP, 2222/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      bash
      -c
      export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/; chmod +x /usr/local/bin/*.sh;
                        if [ ! -f /job/load-model.exit ]; then
                          while [ ! -f /job/load-model.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/load-model.start_time ;

                        echo "Starting Training $TRAINING_ID"
                        mkdir -p "$MODEL_DIR" ;
                        python -m zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR  ;
                          echo $? > /job/load-model.exit ;
                        fi
                        echo "Done load-model" ;
                        if [ ! -f /job/learner.exit ]; then
                          while [ ! -f /job/learner.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/learner.start_time ;

                        for i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*} ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i; done;
                        export LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;
                        mkdir -p $RESULT_DIR/learner-$LEARNER_ID ;
                        mkdir -p $CHECKPOINT_DIR ;bash -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}' ;
                          echo $? > /job/learner.exit ;
                        fi
                        echo "Done learner" ;
                        if [ ! -f /job/store-logs.exit ]; then
                          while [ ! -f /job/store-logs.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/store-logs.start_time ;

                        echo Calling copy logs.
                        mv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID ;
                        ERROR_CODE=$? ;
                        echo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete ;
                        bash -c 'exit $ERROR_CODE' ;
                          echo $? > /job/store-logs.exit ;
                        fi
                        echo "Done store-logs" ;
                      while true; do sleep 2; done ;
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Requests:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Environment:
      LOG_DIR:                     /job/logs
      GPU_COUNT:                   0.000000
      TRAINING_COMMAND:            python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz   --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz   --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001   --trainingIters 2000
      TRAINING_ID:                 training-bFEXXGPmR
      DATA_DIR:                    /mnt/data/tf_training_data
      MODEL_DIR:                   /job/model-code
      RESULT_DIR:                  /mnt/results/tf_trained_model/training-bFEXXGPmR
      DOWNWARD_API_POD_NAME:       learner-0d296791-2adc-4336-4f01-b280090460c3-0 (v1:metadata.name)
      DOWNWARD_API_POD_NAMESPACE:  default (v1:metadata.namespace)
      LEARNER_NAME_PREFIX:         learner-0d296791-2adc-4336-4f01-b280090460c3
      TRAINING_ID:                 training-bFEXXGPmR
      NUM_LEARNERS:                1
      JOB_STATE_DIR:               /job
      CHECKPOINT_DIR:              /mnt/results/tf_trained_model/_wml_checkpoints
      RESULT_BUCKET_DIR:           /mnt/results/tf_trained_model
    Mounts:
      /entrypoint-files from learner-entrypoint-files (rw)
      /job from jobdata (rw)
      /mnt/data/tf_training_data from cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
      /mnt/results/tf_trained_model from cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cosinputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretdata-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[debug-level:warn endpoint:http://192.168.99.105:31172 tls-cipher-suite:DEFAULT cache-size-gb:0 chunk-size-mb:52 curl-debug:false kernel-cache:true multireq-max:20 bucket:tf_training_data ensure-disk-free:0 parallel-count:5 region:us-standard s3fs-fuse-retry-count:30 stat-cache-size:100000]
  cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretresults-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[cache-size-gb:0 curl-debug:false endpoint:http://192.168.99.105:31172 parallel-count:2 bucket:tf_trained_model debug-level:warn s3fs-fuse-retry-count:30 stat-cache-size:100000 chunk-size-mb:52 kernel-cache:false ensure-disk-free:2048 region:us-standard tls-cipher-suite:DEFAULT multireq-max:20]
  learner-entrypoint-files:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      learner-entrypoint-files
    Optional:  false
  jobdata:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     dedicated=gpu-task:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Warning  FailedMount  1m (x40 over 1h)  kubelet, minikube  Unable to mount volumes for pod "learner-0d296791-2adc-4336-4f01-b280090460c3-0_default(ce612f9d-f970-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-0d296791-2adc-4336-4f01-b280090460c3-0". list of unmounted volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3]. list of unattached volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 learner-entrypoint-files jobdata]
  1. my local dind encountered the issue with non-hint FAILED ERROR while training. All the pods was running, but there're no pods jobmonitor, learner and lhelper.
Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31404
Handling connection for 31404
FAILED
Error 200: OK

What you expected to happen:
Make FfDL work as properly on either local DIND or MINIKUBE.

Environment:
OS: Darwin local 17.4.0 Darwin Kernel Version 17.4.0:
MINIKUBE:

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

How to reproduce it (as minimally and precisely as possible):

I was just following README.rd with several make instructions

make deploy-plugin
make quickstart-deploy
make test-push-data-s3
make test-job-submit

Anything else we need to know?:

In situation 2, I totally followed the above-mentioned steps;
In situation 1, because it popped out hints that nfs error at first, and I just remember one of the doc I've read about MINIKUBE as if to say that, for persistent volumes, it just supports hostpath type, so I created a PV and PVC, here's the details.

$ kubectl describe pv hostpathtest
Name:            hostpathtest
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller=yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:
Status:          Bound
Claim:           default/static-volume-1
Reclaim Policy:  Retain
Access Modes:    RWO
Capacity:        20Gi
Node Affinity:   <none>
Message:
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /data/hostpath_test
    HostPathType:
Events:            <none>
$ kubectl describe pvc learner-1
Name:          learner-1
Namespace:     default
StorageClass:
Status:        Bound
Volume:        hostpathtest-learner
Labels:        type=dlaas-static-volume
Annotations:   kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"volume.beta.kubernetes.io/storage-class":""},"labels":{"type":"dlaas-stat...
               pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-class=
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
Events:        <none>

Thanks in advance for all advices and have a good day

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions