Common commands

Command to install docker/nvidia-docker on ubuntu (say ubuntu 16.04 xenial in AWS)

# Pre-requisites:
 1. Install docker first by doing:
           sudo su
           curl -fsSL https://get.docker.com/ | sh
     Add 'ubuntu' user to docker group
     sudo usermod -aG docker $USER

# Then do the following
 2. Then cd to the directory where the Dockerfile is kept
 3. Run:
           docker build -t custom-tf-docker-image-inference .

 4. Install nvidia-docker and nvidia-docker-plugin 
    apt install -y nvidia-cuda-toolkit
    apt install -y nvidia-modprobe
    wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
    dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb

    service nvidia-docker status
    service nvidia-docker start
    nvidia-smi

    # the following command may not be needed
    # apt-get install linux-headers-$(uname -r)

 5. Test nvidia-smi # this command gave driver mismatch error and I could not solve it
    nvidia-docker run --rm nvidia/cuda nvidia-smi

 6. reboot
 7. nvidia-docker run -p 8888:8888 -p 6006:6006 -p 8090:8090 -p 9007:9007 \
                  --name custy-agshift-gpu-inference-new \
                  -it -v /home/ubuntu/tf_files:/tf_files custom-tf-docker-image-inference

 8. cd /
 9. nohup ./run_jupyter.sh --allow-root > tf_files/nohup.out 2>&1 < /dev/null &

  # To stop a jupyter server
   jupyter notebook stop <port_number like 8888 or 8889>

Ref: docker command line:
https://docs.docker.com/engine/reference/commandline/run/#capture-container-id-cidfile
https://github.com/NVIDIA/nvidia-docker
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

10. To get in to a docker container, without corrupting the display
  docker exec -it f27eed929330 bash -c "export COLUMNS=`tput cols`; export LINES=`tput lines`; exec bash"

Commands on docker images

References:
https://www.digitalocean.com/community/tutorials/how-to-remove-docker-images-containers-and-volumes

# To see all images
docker images -a

# To check all docker processes
docker ps -a

# To find all dangling docker images
docker images -f dangling=true

# To remove dangling docker images
 docker rmi $(docker images -f dangling=true -q)

# To remove all images
 docker rmi $(docker images -a -q)

# Remove processes followed by images by force
 docker rm -f $(docker ps -a -q)
 docker rmi -f $(docker images -q)

# To remove a particular image - first remove any dangling images
docker rmi custom-tf-docker-image

# To build a docker image (custom)
1. First create a file 'Dockerfile'
2. Next build the iamge
   docker build -t custom-tf-docker-image .
3. Next run (Note, if we do not use the -it option then docker will exit immediately)
   docker run --name custy-tf -it custom-tf-docker-image

# To run bash
docker exec -it docker_id bash -c "export COLUMNS=`tput cols`; export LINES=`tput lines`; exec bash"

Difference between EXPOSE and PUBLISH (ports)

https://stackoverflow.com/questions/22111060/difference-between-expose-and-publish-in-docker

Commands on docker containers

# To list all exited containers
docker ps -a -f status=exited

# To remove all exited containers
docker rm $(docker ps -a -f status=exited -q)

# To remove a container by its container id
docker ps -a |  grep "<container_id>" | awk '{print $1}' | xargs docker rm
eg: docker ps -a |  grep "2adc7cfe583d" | awk '{print $1}' | xargs docker rm

# Stop and remove all containers
docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)

# Sequence of commands to do a complete cleanup of all docker images
docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)
docker rmi $(docker images -f dangling=true -q)
docker rmi $(docker images -a -q)

# To get a new bash shell on an existing docker container (avoid clash with co-user)
docker exec -it f5e8 bash

# To copy docker image from one machine to another machine

1. first tar the image:
sudo docker save -o <image.tar> <image name>

Then copy your image to a new system with regular file transfer tools such as cp or scp. 
After that you will have to load the image into docker:
2. sudo docker load -i <path to image tar file>

You may need to do sudo to do this

In one shot this can be done as:
sudo docker save <image> | bzip2 | pv | \
     ssh user@host 'bunzip2 | docker load'

Communicating between docker containers

Ref: https://docs.oracle.com/cd/E37670_01/E75728/html/section_rsr_p2z_fp.html

Ref: https://docs.docker.com/engine/userguide/networking/

Command to launch docker using nvidia-docker wrapper

nvidia-docker run -p 8888:8888 -p 6006:6006 --name custy-gpu-new \
                  -it -v /home/ubuntu/tf_files:/tf_files custom-docker-image-gpu-new

If GPU stops working for a docker container

=========================================
Option A: Play a trick
=========================================

 Check this command: 
          docker volume ls
              DRIVER              VOLUME NAME
              nvidia-docker       nvidia_driver_375.66
              nvidia-docker       nvidia_driver_384.90

Now our docker may be using driver: nvidia_driver_375.66 while the host system silently
upgraded the nvidia driver to nvidia_driver_384.90

So, if we try to execute GPU code or even nvidia-smi inside our docker we will get the following error:
 > nvidia-smi
   Failed to initialize NVML: Driver/library version mismatch

Trick played to solve this:
============================
Go to the host and do:
   sudo su
   docker inspect <container_id (fbd211fc1edb)> 
       => Go to the Mounts section and see which nvidia-driver you are using
          In our case it was:
          "Source": "/var/lib/nvidia-docker/volumes/nvidia_driver/375.66",
          "Destination": "/usr/local/nvidia"

          This means the source (host) location "/var/lib/nvidia-docker/volumes/nvidia_driver/375.66"
          is mounted in the docker container as "/usr/local/nvidia"

   We can find the same thing by running the following command as well:
   docker inspect nvidia_driver_375.66
       => This will also give us the host directory where this driver is located

   (*) Now do the following:
   ---------------------------
   cd /var/lib/nvidia-docker/volumes/nvidia_driver/
   mv 375.66 375.66.ORIG
   ln -s 384.90 375.66

   Also, type the ldconfig command, to find where the shared libraries are
   ldconfig -p | grep -E 'nvidia'
        => This will show something like this:
           /usr/lib/nvidia-384/...

   (*) Now do the following:
   --------------------------
   cd /usr/lib/
   mv nvidia-375 nvidia-375-ORIG
   ln -s 384.90 nvidia-375

   (*) Restart and execute docker container:
   -----------------------------------------
   docker restart fbd211fc1edb (restart may not be needed)
   docker exec -it fbd211fc1edb bash

(*) When the host system upgraded the driver to 384.111 from 384.90 then it was even more trickier
We first created /var/lib/nvidia-docker/volumes/nvidia_driver/384.111 
Then Copied /usr/lib/nvidia-384/* inside /var/lib/nvidia-docker/volumes/nvidia_driver/384.111/lib
Then nvidia-smi in the docker started to work

Now the cuda problem

=========================================
Option B: Just create another container
=========================================

Note: You can re-use the same image. We did that and the GPU started working in the newly
created container. Later on, once you are done copying contents from 1 container to another and
you are confident that everything is working upto expectation, then you can kill the previous
container

In this example we re-used the image "custom-agshift-docker-image-gpu
", but we re-mapped the docker port to a different port of the host machine:
nvidia-docker run -p 8889:8888 -p 6007:6006 --name custy-agshift-gpu-new1 \
                  -it -v /home/ubuntu/tf_files:/tf_files custom-agshift-docker-image-gpu
Note: here the docker port 8888, 6006 (Right side) binds to host port 8889, 6007 respectively
      Open this port in Amazon AWS

Use the --volume option to get rid of this problem for future docker runs
 nvidia-docker run -p 8891:8888 -p 6009:6006 -p 8091:8090 -p 9008:9007 \
                  --volume=nvidia_driver_384.90:/usr/local/nvidia:ro --name custy-agshift-gpu-oct24 \
                  -it -v /home/ubuntu/tf_files:/tf_files custom-agshift-docker-image-gpu

Restore cudNN

 # Sometimes, this problem might come up, while running tensorflow:
 ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory
 This happens because /usr/local/cuda/lib64 somehow does not have libcudnn.so.6 anymore.

 So go to your respective docker container and download and install the cudnn v6
 https://gist.github.com/mjdietzx/0ff77af5ae60622ce6ed8c4d9b419f45

 # Vijay
 CUDNN_TAR_FILE="cudnn-8.0-linux-x64-v6.0.tgz"
 wget http://developer.download.nvidia.com/compute/redist/cudnn/v6.0/${CUDNN_TAR_FILE}
 tar -xzvf ${CUDNN_TAR_FILE}

 sudo apt-get install libcupti-dev

 # Remember to copy the extracted files to /usr/local/cuda/lib64
 # Remember to create the soft-links.

 # Also make sure the PATHS are like this:
 root@fbd211fc1edb:/usr/local/cuda/lib64# echo $LD_LIBRARY_PATH
 /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

 root@fbd211fc1edb:/usr/local/cuda/lib64# echo $PATH
 /usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Commands to launch docker

$ docker run -it -p 8888:8888 -p 6006:6006 -v /$(pwd)/tensorflow:/notebooks --name tf b.gcr.io/tensorflow/tensorflow

Information:
-p 8888:8888 -p 6006:6006 mapping ports of container to host, first pair for Jupyter notebook, 
                          the second one for Tensorboard

To attach the docker port to a different port of the host machine do this:
nvidia-docker run -p 8889:8888 -p 6007:6006 --name custy-agshift-gpu-new1 \
                  -it -v /home/ubuntu/tf_files:/tf_files custom-agshift-docker-image-gpu
Note: here the docker port 8888, 6006 (Right side) binds to host port 8889, 6007 respectively
      Open this port in Amazon AWS

to attach docker to a local file system and then run with a name (locally in my MAC)
======================================================================================
docker run -p 8888:8888 -p 6006:6006 --name tensorflow-agshift -it -v $HOME/amit_devel/proj_openCV/AgShift/AmazonAWS/MLDL/STRAWBERRY_ML_TEST/tf_files:/tf_files  gcr.io/tensorflow/tensorflow:latest-devel

In Amazon AWS this would be:
=============================
docker run -p 8888:8888 -p 6006:6006 --name tensorflow-agshift -it -v /home/ubuntu/MLDL/STRAWBERRY_ML_TEST/tf_files:/tf_files  gcr.io/tensorflow/tensorflow:latest-devel

To get jupyter:
================
cd /
./run_jupyter.sh

To run jupyter in the background:
==================================
cd /
nohup ./run_jupyter.sh > tf_files/nohup.out 2>&1 < /dev/null &

To get bash in docker:
=======================
Open 'terminal', you will get docker terminal. Type bash

connect to the jupyter notebook using chrome
==============================================
http://<>-<>.us-<>-<>.compute.amazonaws.com:8888/?token=1f67a65ac96957e4e2bce1f356f4cc7b329c592a0b3601c2
The token can change everytime jupyter is launched afresh
# 34-208-32-97

Commands to investigate nvidia gpu and docker

Ref: https://github.com/floydhub/dl-setup#nvidia-drivers

Ref: https://github.com/NVIDIA/nvidia-docker

#Find your graphics card model
# the AWS P2 machine uses the following
lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

# the AWS G2 machine uses the following
lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)

# to find nvidia driver version
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

# to do volume inspection
docker volume inspect nvidia_driver_375.66
[
    {
        "Driver": "nvidia-docker",
        "Labels": null,
        "Mountpoint": "/var/lib/nvidia-docker/volumes/nvidia_driver/375.66",
        "Name": "nvidia_driver_375.66",
        "Options": {},
        "Scope": "local"
    }
]

# or for G2 K520 machine, use the following command:
docker volume inspect nvidia_driver_384.66

# to find docker volume and version of driver
docker volume ls
DRIVER              VOLUME NAME
nvidia-docker       nvidia_driver_375.66

# to check CUDA installation
nvcc -V

# the above command will give us the following output
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

# if nvcc is not installed, install using apt-get below
 apt install -y nvidia-cuda-toolkit

------------------------------------------------------------
ERROR:
nvidia-docker Error: nvml: Driver/library version mismatch

At least do a "reboot" of the machine
------------------------------------------------------------

# Install nvidia-docker and nvidia-docker-plugin
 wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
 sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb

# Test nvidia-smi
 nvidia-docker run --rm nvidia/cuda nvidia-smi

 lsmod | grep nvidia

 ldconfig -p | grep -E 'nvidia|cuda'

# to inspect a particular nvidia driver
 docker volume inspect nvidia_driver_<driver number>

 docker volume ls

 service nvidia-docker status

 service nvidia-docker start

 service nvidia-docker stop

 systemctl status nvidia-docker.service

 nvidia-smi

 which nvidia-modprobe

 nvidia-modprobe --version

 find /lib/modules/ -name "*nvidia*"

 ls -l  /usr/src/nvidia* | grep uvm

 ls -l  /usr/src/nvidia* | grep nvidia

 dpkg -l | grep nvidia

 modprobe nvidia

Mounting s3 in ec2 instance

Ref: https://cloudkul.com/blog/mounting-s3-bucket-linux-ec2-instance/

Mounting s3 in docker

This is tricky.

Ref: https://github.com/xueshanf/docker-s3fs

 # create a shared mount point for S3 bucket in the host ec2 linux instance:
 mkdir -p /home/ubuntu/tf_files/s3_bucket_mount

 # bind the mount point
 mount --bind /home/ubuntu/tf_files/s3_bucket_mount/ /home/ubuntu/tf_files/s3_bucket_mount/

 # share the mount point
 mount --make-shared /home/ubuntu/tf_files/s3_bucket_mount/

 # check the target and propagation
 findmnt -o TARGET,PROPAGATION /home/ubuntu/tf_files/s3_bucket_mount/

 # edit the password for s3fs
 vim /etc/passwd-s3fs
 # This should have the AWS_S3_ACCESS_KEY and AWS_S3_SECRET
 AKIAIPZL6WZS6OMG3BXQ:T9jnKcVem4dsDBDYOCFhqPmAiZ5iisk7lKvJ7/2i

 # make it read-only
 chmod 400 /etc/passwd-s3fs

 # On host ec2 instance change to a custom directory
 cd /home/ubuntu/tf_files/DOCKER_COMPOSE

 # clone this repo on host
 git clone https://github.com/xueshanf/docker-s3fs.git

 # This clone will also give the respective Dockerfile to use and build an image

 # build an image with s3fs - name the image "docker-s3fs-image" or any other name
 docker build -t docker-s3fs-image .

 # change to the directory where the docker-compose.yml file resides
 cd /home/ubuntu/tf_files/DOCKER_COMPOSE/docker-s3fs

 # Edit the docker-compose.yml file. It should look like below. 
 # Mount both /home/ubuntu/tf_files and a mount point for Amazon S3 (/tf_files/s3_bucket_mount)
 version: '2'

 services:
   s3fs:
     #image: docker-s3fs-image:latest
     image: custom-agshift-docker-s3fs-gpu-image:v0.1
     environment:
       AWSACCESSKEYID: AKIAIPZL6WZS6OMG3BXQ
       AWSSECRETACCESSKEY: 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
     cap_add:
       - MKNOD
       - SYS_ADMIN
     security_opt:
       - apparmor:unconfined
     devices:
       - /dev/fuse
     volumes:
       - /home/ubuntu/tf_files:/tf_files
       - /home/ubuntu/tf_files/s3_bucket_mount:/tf_files/s3_bucket_mount:shared
     #command: /usr/bin/s3fs -f -o allow_other -o use_cache=/tmp agskore /tf_files/s3_bucket_mount/
     command: /usr/bin/s3fs -f -o allow_other,uid=0,gid=0,umask=022 -o use_cache=/tmp agskore /tf_files/s3_bucket_mount/

 # finally run docker using docker-compose
 docker-compose up -d

 # get the container_id
 docker ps -a

 # Get in the docker container and you will find the s3 bucket mounted
 docker exec -it <container_id> bash

Command to launch tensorboard:

tensorboard --logdir /tmp/retrain_logs

DOCKER SETUP - UBUNTU

# pull a docker ubuntu 16.04 image https://hub.docker.com/_/ubuntu/
 docker pull ubuntu:16.04

# check
 docker images

# run
 docker run -d -it -p 8090:8099 -v /home/ubuntu/tf_files:/tf_files ubuntu:16.04 /bin/bash

DOCKER NETWORKING - Running a simple client server (sockets) between 2 dockers

Ref: https://forums.docker.com/t/running-a-simple-client-server-program-in-docker/7321/2

PreviousDocker NextVarious S3 utilities

Last updated 4 years ago

Was this helpful?