Unable to mount volumes for pod Learner

__What happend:__
Hi there, thanks a lot for your work. It's impressive, so I was trying to deploy it on local MINIKUBE and local DIND, but in fact none of them worked properly. I was stuck in an issue for few days, so I'd like to ask you guys for help. By chance I've found something similar to my issue from your [docs](https://github.com/IBM/FfDL/blob/master/docs/troubleshooting.md) but in the different condition, which means:
1. my local `minikube` encountered the issue which was recorded in the __DIND-TRAING__ -- all pods worked as expected 
```bash
alertmanager-7bd87d99cc-jhp2b                                     1/1       Running             0          6h
etcd0                                                             1/1       Running             0          6h
ffdl-lcm-8d555c7bf-dqqhg                                          1/1       Running             0          6h
ffdl-restapi-7f5c57c77d-k67pm                                     1/1       Running             0          6h
ffdl-trainer-6777dd5756-xkk65                                     1/1       Running             0          6h
ffdl-trainingdata-696b99ff5c-tvbtc                                1/1       Running             0          6h
ffdl-ui-95d6464c7-bv2sn                                           1/1       Running             0          6h
jobmonitor-0d296791-2adc-4336-4f01-b280090460c3-cbdb48cfd-qqsvz   1/1       Running             0          1h
learner-0d296791-2adc-4336-4f01-b280090460c3-0                    0/1       ContainerCreating   0          1h
lhelper-0d296791-2adc-4336-4f01-b280090460c3-54858658b-p7vfc      2/2       Running             0          1h
mongo-0                                                           1/1       Running             4          6h
prometheus-67fb854b59-c884p                                       2/2       Running             0          6h
pushgateway-5665768d5c-jdlnl                                      2/2       Running             0          6h
storage-0                                                         1/1       Running             0          6h
```

except the pod __learner__ with eternal pending status because of the following warning.

```bash
Unable to mount volumes for pod "learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0_default(33f78708-f963-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0". list of unmounted volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f]. list of unattached volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f learner-entrypoint-files jobdata]
```
and here's the details of pod __learner-x__

```bash
Name:               learner-0d296791-2adc-4336-4f01-b280090460c3-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               minikube/10.0.2.15
Start Time:         Thu, 06 Dec 2018 17:05:52 +0100
Labels:             controller-revision-hash=learner-0d296791-2adc-4336-4f01-b280090460c3-999bf4986
                    service=dlaas-learner
                    statefulset.kubernetes.io/pod-name=learner-0d296791-2adc-4336-4f01-b280090460c3-0
                    training_id=training-bFEXXGPmR
                    user_id=test-user
Annotations:        scheduler.alpha.kubernetes.io/nvidiaGPU={ "AllocationPriority": "Dense" }
                    scheduler.alpha.kubernetes.io/tolerations=[ { "key": "dedicated", "operator": "Equal", "value": "gpu-task" } ]
Status:             Pending
IP:
Controlled By:      StatefulSet/learner-0d296791-2adc-4336-4f01-b280090460c3
Containers:
  learner:
    Container ID:
    Image:         tensorflow/tensorflow:1.5.0-py3
    Image ID:
    Ports:         22/TCP, 2222/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      bash
      -c
      export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/; chmod +x /usr/local/bin/*.sh;
                        if [ ! -f /job/load-model.exit ]; then
                          while [ ! -f /job/load-model.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/load-model.start_time ;

                        echo "Starting Training $TRAINING_ID"
                        mkdir -p "$MODEL_DIR" ;
                        python -m zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR  ;
                          echo $? > /job/load-model.exit ;
                        fi
                        echo "Done load-model" ;
                        if [ ! -f /job/learner.exit ]; then
                          while [ ! -f /job/learner.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/learner.start_time ;

                        for i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*} ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i; done;
                        export LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;
                        mkdir -p $RESULT_DIR/learner-$LEARNER_ID ;
                        mkdir -p $CHECKPOINT_DIR ;bash -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}' ;
                          echo $? > /job/learner.exit ;
                        fi
                        echo "Done learner" ;
                        if [ ! -f /job/store-logs.exit ]; then
                          while [ ! -f /job/store-logs.start ]; do sleep 2; done ;
                          date "+%s%N" | cut -b1-13 > /job/store-logs.start_time ;

                        echo Calling copy logs.
                        mv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID ;
                        ERROR_CODE=$? ;
                        echo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete ;
                        bash -c 'exit $ERROR_CODE' ;
                          echo $? > /job/store-logs.exit ;
                        fi
                        echo "Done store-logs" ;
                      while true; do sleep 2; done ;
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Requests:
      cpu:             500m
      memory:          1048576k
      nvidia.com/gpu:  0
    Environment:
      LOG_DIR:                     /job/logs
      GPU_COUNT:                   0.000000
      TRAINING_COMMAND:            python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz   --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz   --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001   --trainingIters 2000
      TRAINING_ID:                 training-bFEXXGPmR
      DATA_DIR:                    /mnt/data/tf_training_data
      MODEL_DIR:                   /job/model-code
      RESULT_DIR:                  /mnt/results/tf_trained_model/training-bFEXXGPmR
      DOWNWARD_API_POD_NAME:       learner-0d296791-2adc-4336-4f01-b280090460c3-0 (v1:metadata.name)
      DOWNWARD_API_POD_NAMESPACE:  default (v1:metadata.namespace)
      LEARNER_NAME_PREFIX:         learner-0d296791-2adc-4336-4f01-b280090460c3
      TRAINING_ID:                 training-bFEXXGPmR
      NUM_LEARNERS:                1
      JOB_STATE_DIR:               /job
      CHECKPOINT_DIR:              /mnt/results/tf_trained_model/_wml_checkpoints
      RESULT_BUCKET_DIR:           /mnt/results/tf_trained_model
    Mounts:
      /entrypoint-files from learner-entrypoint-files (rw)
      /job from jobdata (rw)
      /mnt/data/tf_training_data from cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
      /mnt/results/tf_trained_model from cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cosinputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretdata-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[debug-level:warn endpoint:http://192.168.99.105:31172 tls-cipher-suite:DEFAULT cache-size-gb:0 chunk-size-mb:52 curl-debug:false kernel-cache:true multireq-max:20 bucket:tf_training_data ensure-disk-free:0 parallel-count:5 region:us-standard s3fs-fuse-retry-count:30 stat-cache-size:100000]
  cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3:
    Type:       FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
    Driver:     ibm/ibmc-s3fs
    FSType:
    SecretRef:  &{cossecretresults-0d296791-2adc-4336-4f01-b280090460c3}
    ReadOnly:   false
    Options:    map[cache-size-gb:0 curl-debug:false endpoint:http://192.168.99.105:31172 parallel-count:2 bucket:tf_trained_model debug-level:warn s3fs-fuse-retry-count:30 stat-cache-size:100000 chunk-size-mb:52 kernel-cache:false ensure-disk-free:2048 region:us-standard tls-cipher-suite:DEFAULT multireq-max:20]
  learner-entrypoint-files:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      learner-entrypoint-files
    Optional:  false
  jobdata:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     dedicated=gpu-task:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age               From               Message
  ----     ------       ----              ----               -------
  Warning  FailedMount  1m (x40 over 1h)  kubelet, minikube  Unable to mount volumes for pod "learner-0d296791-2adc-4336-4f01-b280090460c3-0_default(ce612f9d-f970-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-0d296791-2adc-4336-4f01-b280090460c3-0". list of unmounted volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3]. list of unattached volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 learner-entrypoint-files jobdata]
```

2. my local `dind` encountered the issue with non-hint _FAILED ERROR_ while training. All the pods was running, but there're no pods __jobmonitor__, __learner__ and __lhelper__.

```bash
Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31404
Handling connection for 31404
FAILED
Error 200: OK
```


__What you expected to happen:__
Make FfDL work as properly on either local DIND or MINIKUBE.


__Environment:__
OS: Darwin local 17.4.0 Darwin Kernel Version 17.4.0:
MINIKUBE:
```bash
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
```

__How to reproduce it (as minimally and precisely as possible):__

I was just following README.rd with several make instructions

```bash
make deploy-plugin
make quickstart-deploy
make test-push-data-s3
make test-job-submit
```

__Anything else we need to know?:__

In situation 2, I totally followed the above-mentioned steps; 
In situation 1, because it popped out hints that nfs error at first, and I just remember one of the doc I've read about MINIKUBE as if to say that, for persistent volumes, it just supports `hostpath` type, so I created a PV and PVC, here's the details.


```bash
$ kubectl describe pv hostpathtest
Name:            hostpathtest
Labels:          <none>
Annotations:     pv.kubernetes.io/bound-by-controller=yes
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:
Status:          Bound
Claim:           default/static-volume-1
Reclaim Policy:  Retain
Access Modes:    RWO
Capacity:        20Gi
Node Affinity:   <none>
Message:
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /data/hostpath_test
    HostPathType:
Events:            <none>
```

```bash
$ kubectl describe pvc learner-1
Name:          learner-1
Namespace:     default
StorageClass:
Status:        Bound
Volume:        hostpathtest-learner
Labels:        type=dlaas-static-volume
Annotations:   kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"volume.beta.kubernetes.io/storage-class":""},"labels":{"type":"dlaas-stat...
               pv.kubernetes.io/bind-completed=yes
               pv.kubernetes.io/bound-by-controller=yes
               volume.beta.kubernetes.io/storage-class=
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
Events:        <none>
```
Thanks in advance for all advices and have a good day





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to mount volumes for pod Learner #152

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to mount volumes for pod Learner #152

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions