Common commands
Command to install docker/nvidia-docker on ubuntu (say ubuntu 16.04 xenial in AWS)
# Pre-requisites:
1. Install docker first by doing:
sudo su
curl -fsSL https://get.docker.com/ | sh
Add 'ubuntu' user to docker group
sudo usermod -aG docker $USER
# Then do the following
2. Then cd to the directory where the Dockerfile is kept
3. Run:
docker build -t custom-tf-docker-image-inference .
4. Install nvidia-docker and nvidia-docker-plugin
apt install -y nvidia-cuda-toolkit
apt install -y nvidia-modprobe
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
service nvidia-docker status
service nvidia-docker start
nvidia-smi
# the following command may not be needed
# apt-get install linux-headers-$(uname -r)
5. Test nvidia-smi # this command gave driver mismatch error and I could not solve it
nvidia-docker run --rm nvidia/cuda nvidia-smi
6. reboot
7. nvidia-docker run -p 8888:8888 -p 6006:6006 -p 8090:8090 -p 9007:9007 \
--name custy-agshift-gpu-inference-new \
-it -v /home/ubuntu/tf_files:/tf_files custom-tf-docker-image-inference
8. cd /
9. nohup ./run_jupyter.sh --allow-root > tf_files/nohup.out 2>&1 < /dev/null &
# To stop a jupyter server
jupyter notebook stop <port_number like 8888 or 8889>
Ref: docker command line:
https://docs.docker.com/engine/reference/commandline/run/#capture-container-id-cidfile
https://github.com/NVIDIA/nvidia-docker
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
10. To get in to a docker container, without corrupting the display
docker exec -it f27eed929330 bash -c "export COLUMNS=`tput cols`; export LINES=`tput lines`; exec bash"
Commands on docker images
References:
https://www.digitalocean.com/community/tutorials/how-to-remove-docker-images-containers-and-volumes
# To see all images
docker images -a
# To check all docker processes
docker ps -a
# To find all dangling docker images
docker images -f dangling=true
# To remove dangling docker images
docker rmi $(docker images -f dangling=true -q)
# To remove all images
docker rmi $(docker images -a -q)
# Remove processes followed by images by force
docker rm -f $(docker ps -a -q)
docker rmi -f $(docker images -q)
# To remove a particular image - first remove any dangling images
docker rmi custom-tf-docker-image
# To build a docker image (custom)
1. First create a file 'Dockerfile'
2. Next build the iamge
docker build -t custom-tf-docker-image .
3. Next run (Note, if we do not use the -it option then docker will exit immediately)
docker run --name custy-tf -it custom-tf-docker-image
# To run bash
docker exec -it docker_id bash -c "export COLUMNS=`tput cols`; export LINES=`tput lines`; exec bash"
Difference between EXPOSE and PUBLISH (ports)
Commands on docker containers
# To list all exited containers
docker ps -a -f status=exited
# To remove all exited containers
docker rm $(docker ps -a -f status=exited -q)
# To remove a container by its container id
docker ps -a | grep "<container_id>" | awk '{print $1}' | xargs docker rm
eg: docker ps -a | grep "2adc7cfe583d" | awk '{print $1}' | xargs docker rm
# Stop and remove all containers
docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)
# Sequence of commands to do a complete cleanup of all docker images
docker stop $(docker ps -a -q)
docker rm $(docker ps -a -q)
docker rmi $(docker images -f dangling=true -q)
docker rmi $(docker images -a -q)
# To get a new bash shell on an existing docker container (avoid clash with co-user)
docker exec -it f5e8 bash
# To copy docker image from one machine to another machine
1. first tar the image:
sudo docker save -o <image.tar> <image name>
Then copy your image to a new system with regular file transfer tools such as cp or scp.
After that you will have to load the image into docker:
2. sudo docker load -i <path to image tar file>
You may need to do sudo to do this
In one shot this can be done as:
sudo docker save <image> | bzip2 | pv | \
ssh user@host 'bunzip2 | docker load'
Communicating between docker containers
Command to launch docker using nvidia-docker wrapper
nvidia-docker run -p 8888:8888 -p 6006:6006 --name custy-gpu-new \
-it -v /home/ubuntu/tf_files:/tf_files custom-docker-image-gpu-new
If GPU stops working for a docker container
=========================================
Option A: Play a trick
=========================================
Check this command:
docker volume ls
DRIVER VOLUME NAME
nvidia-docker nvidia_driver_375.66
nvidia-docker nvidia_driver_384.90
Now our docker may be using driver: nvidia_driver_375.66 while the host system silently
upgraded the nvidia driver to nvidia_driver_384.90
So, if we try to execute GPU code or even nvidia-smi inside our docker we will get the following error:
> nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
Trick played to solve this:
============================
Go to the host and do:
sudo su
docker inspect <container_id (fbd211fc1edb)>
=> Go to the Mounts section and see which nvidia-driver you are using
In our case it was:
"Source": "/var/lib/nvidia-docker/volumes/nvidia_driver/375.66",
"Destination": "/usr/local/nvidia"
This means the source (host) location "/var/lib/nvidia-docker/volumes/nvidia_driver/375.66"
is mounted in the docker container as "/usr/local/nvidia"
We can find the same thing by running the following command as well:
docker inspect nvidia_driver_375.66
=> This will also give us the host directory where this driver is located
(*) Now do the following:
---------------------------
cd /var/lib/nvidia-docker/volumes/nvidia_driver/
mv 375.66 375.66.ORIG
ln -s 384.90 375.66
Also, type the ldconfig command, to find where the shared libraries are
ldconfig -p | grep -E 'nvidia'
=> This will show something like this:
/usr/lib/nvidia-384/...
(*) Now do the following:
--------------------------
cd /usr/lib/
mv nvidia-375 nvidia-375-ORIG
ln -s 384.90 nvidia-375
(*) Restart and execute docker container:
-----------------------------------------
docker restart fbd211fc1edb (restart may not be needed)
docker exec -it fbd211fc1edb bash
(*) When the host system upgraded the driver to 384.111 from 384.90 then it was even more trickier
We first created /var/lib/nvidia-docker/volumes/nvidia_driver/384.111
Then Copied /usr/lib/nvidia-384/* inside /var/lib/nvidia-docker/volumes/nvidia_driver/384.111/lib
Then nvidia-smi in the docker started to work
Now the cuda problem
=========================================
Option B: Just create another container
=========================================
Note: You can re-use the same image. We did that and the GPU started working in the newly
created container. Later on, once you are done copying contents from 1 container to another and
you are confident that everything is working upto expectation, then you can kill the previous
container
In this example we re-used the image "custom-agshift-docker-image-gpu
", but we re-mapped the docker port to a different port of the host machine:
nvidia-docker run -p 8889:8888 -p 6007:6006 --name custy-agshift-gpu-new1 \
-it -v /home/ubuntu/tf_files:/tf_files custom-agshift-docker-image-gpu
Note: here the docker port 8888, 6006 (Right side) binds to host port 8889, 6007 respectively
Open this port in Amazon AWS
Use the --volume option to get rid of this problem for future docker runs
nvidia-docker run -p 8891:8888 -p 6009:6006 -p 8091:8090 -p 9008:9007 \
--volume=nvidia_driver_384.90:/usr/local/nvidia:ro --name custy-agshift-gpu-oct24 \
-it -v /home/ubuntu/tf_files:/tf_files custom-agshift-docker-image-gpu
Restore cudNN
# Sometimes, this problem might come up, while running tensorflow:
ImportError: libcudnn.so.6: cannot open shared object file: No such file or directory
This happens because /usr/local/cuda/lib64 somehow does not have libcudnn.so.6 anymore.
So go to your respective docker container and download and install the cudnn v6
https://gist.github.com/mjdietzx/0ff77af5ae60622ce6ed8c4d9b419f45
# Vijay
CUDNN_TAR_FILE="cudnn-8.0-linux-x64-v6.0.tgz"
wget http://developer.download.nvidia.com/compute/redist/cudnn/v6.0/${CUDNN_TAR_FILE}
tar -xzvf ${CUDNN_TAR_FILE}
sudo apt-get install libcupti-dev
# Remember to copy the extracted files to /usr/local/cuda/lib64
# Remember to create the soft-links.
# Also make sure the PATHS are like this:
root@fbd211fc1edb:/usr/local/cuda/lib64# echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
root@fbd211fc1edb:/usr/local/cuda/lib64# echo $PATH
/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Commands to launch docker
$ docker run -it -p 8888:8888 -p 6006:6006 -v /$(pwd)/tensorflow:/notebooks --name tf b.gcr.io/tensorflow/tensorflow
Information:
-p 8888:8888 -p 6006:6006 mapping ports of container to host, first pair for Jupyter notebook,
the second one for Tensorboard
To attach the docker port to a different port of the host machine do this:
nvidia-docker run -p 8889:8888 -p 6007:6006 --name custy-agshift-gpu-new1 \
-it -v /home/ubuntu/tf_files:/tf_files custom-agshift-docker-image-gpu
Note: here the docker port 8888, 6006 (Right side) binds to host port 8889, 6007 respectively
Open this port in Amazon AWS
to attach docker to a local file system and then run with a name (locally in my MAC)
======================================================================================
docker run -p 8888:8888 -p 6006:6006 --name tensorflow-agshift -it -v $HOME/amit_devel/proj_openCV/AgShift/AmazonAWS/MLDL/STRAWBERRY_ML_TEST/tf_files:/tf_files gcr.io/tensorflow/tensorflow:latest-devel
In Amazon AWS this would be:
=============================
docker run -p 8888:8888 -p 6006:6006 --name tensorflow-agshift -it -v /home/ubuntu/MLDL/STRAWBERRY_ML_TEST/tf_files:/tf_files gcr.io/tensorflow/tensorflow:latest-devel
To get jupyter:
================
cd /
./run_jupyter.sh
To run jupyter in the background:
==================================
cd /
nohup ./run_jupyter.sh > tf_files/nohup.out 2>&1 < /dev/null &
To get bash in docker:
=======================
Open 'terminal', you will get docker terminal. Type bash
connect to the jupyter notebook using chrome
==============================================
http://<>-<>.us-<>-<>.compute.amazonaws.com:8888/?token=1f67a65ac96957e4e2bce1f356f4cc7b329c592a0b3601c2
The token can change everytime jupyter is launched afresh
# 34-208-32-97
Commands to investigate nvidia gpu and docker
#Find your graphics card model
# the AWS P2 machine uses the following
lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
# the AWS G2 machine uses the following
lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
# to find nvidia driver version
cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.66 Mon May 1 15:29:16 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
# to do volume inspection
docker volume inspect nvidia_driver_375.66
[
{
"Driver": "nvidia-docker",
"Labels": null,
"Mountpoint": "/var/lib/nvidia-docker/volumes/nvidia_driver/375.66",
"Name": "nvidia_driver_375.66",
"Options": {},
"Scope": "local"
}
]
# or for G2 K520 machine, use the following command:
docker volume inspect nvidia_driver_384.66
# to find docker volume and version of driver
docker volume ls
DRIVER VOLUME NAME
nvidia-docker nvidia_driver_375.66
# to check CUDA installation
nvcc -V
# the above command will give us the following output
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
# if nvcc is not installed, install using apt-get below
apt install -y nvidia-cuda-toolkit
------------------------------------------------------------
ERROR:
nvidia-docker Error: nvml: Driver/library version mismatch
At least do a "reboot" of the machine
------------------------------------------------------------
# Install nvidia-docker and nvidia-docker-plugin
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
# Test nvidia-smi
nvidia-docker run --rm nvidia/cuda nvidia-smi
lsmod | grep nvidia
ldconfig -p | grep -E 'nvidia|cuda'
# to inspect a particular nvidia driver
docker volume inspect nvidia_driver_<driver number>
docker volume ls
service nvidia-docker status
service nvidia-docker start
service nvidia-docker stop
systemctl status nvidia-docker.service
nvidia-smi
which nvidia-modprobe
nvidia-modprobe --version
find /lib/modules/ -name "*nvidia*"
ls -l /usr/src/nvidia* | grep uvm
ls -l /usr/src/nvidia* | grep nvidia
dpkg -l | grep nvidia
modprobe nvidia
Mounting s3 in ec2 instance
Mounting s3 in docker
This is tricky.
# create a shared mount point for S3 bucket in the host ec2 linux instance:
mkdir -p /home/ubuntu/tf_files/s3_bucket_mount
# bind the mount point
mount --bind /home/ubuntu/tf_files/s3_bucket_mount/ /home/ubuntu/tf_files/s3_bucket_mount/
# share the mount point
mount --make-shared /home/ubuntu/tf_files/s3_bucket_mount/
# check the target and propagation
findmnt -o TARGET,PROPAGATION /home/ubuntu/tf_files/s3_bucket_mount/
# edit the password for s3fs
vim /etc/passwd-s3fs
# This should have the AWS_S3_ACCESS_KEY and AWS_S3_SECRET
AKIAIPZL6WZS6OMG3BXQ:T9jnKcVem4dsDBDYOCFhqPmAiZ5iisk7lKvJ7/2i
# make it read-only
chmod 400 /etc/passwd-s3fs
# On host ec2 instance change to a custom directory
cd /home/ubuntu/tf_files/DOCKER_COMPOSE
# clone this repo on host
git clone https://github.com/xueshanf/docker-s3fs.git
# This clone will also give the respective Dockerfile to use and build an image
# build an image with s3fs - name the image "docker-s3fs-image" or any other name
docker build -t docker-s3fs-image .
# change to the directory where the docker-compose.yml file resides
cd /home/ubuntu/tf_files/DOCKER_COMPOSE/docker-s3fs
# Edit the docker-compose.yml file. It should look like below.
# Mount both /home/ubuntu/tf_files and a mount point for Amazon S3 (/tf_files/s3_bucket_mount)
version: '2'
services:
s3fs:
#image: docker-s3fs-image:latest
image: custom-agshift-docker-s3fs-gpu-image:v0.1
environment:
AWSACCESSKEYID: AKIAIPZL6WZS6OMG3BXQ
AWSSECRETACCESSKEY: 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
cap_add:
- MKNOD
- SYS_ADMIN
security_opt:
- apparmor:unconfined
devices:
- /dev/fuse
volumes:
- /home/ubuntu/tf_files:/tf_files
- /home/ubuntu/tf_files/s3_bucket_mount:/tf_files/s3_bucket_mount:shared
#command: /usr/bin/s3fs -f -o allow_other -o use_cache=/tmp agskore /tf_files/s3_bucket_mount/
command: /usr/bin/s3fs -f -o allow_other,uid=0,gid=0,umask=022 -o use_cache=/tmp agskore /tf_files/s3_bucket_mount/
# finally run docker using docker-compose
docker-compose up -d
# get the container_id
docker ps -a
# Get in the docker container and you will find the s3 bucket mounted
docker exec -it <container_id> bash
Command to launch tensorboard:
tensorboard --logdir /tmp/retrain_logs
DOCKER SETUP - UBUNTU
# pull a docker ubuntu 16.04 image https://hub.docker.com/_/ubuntu/
docker pull ubuntu:16.04
# check
docker images
# run
docker run -d -it -p 8090:8099 -v /home/ubuntu/tf_files:/tf_files ubuntu:16.04 /bin/bash
DOCKER NETWORKING - Running a simple client server (sockets) between 2 dockers
Last updated
Was this helpful?