Deep Learning Workflow

From Earlham CS Department
Jump to navigation Jump to search

Where to store data?

For your project to be successful, it is critical that you have the ability to iterate over the training process in a timely manner. One thing you should make sure of is that reading the data is not costly.

Your directory is a very bad place to store your data for a simple reason: user directories are not stored on the local disk of every machine, they are mounted via NFS over the network. Transferring data over the network is much, much slower than reading it from the disk (keep in mind that you also have to first read data from the disk, in order to send it over the network).

In order to avoid this pitfall, simply put the data on the local disk. /mounts is a standard directory we’ve been putting data. If you don’t have access to this directory, just shoot an email to sysadmins and we will help you out.

GPU

Deep learning has been around for almost half a century, but only recently it has become a prevalent machine learning method. The popularity of deep learning has risen dramatically since the 2000s and it was due to two main factors:

  • Mass digitization provided computer scientists with a ton of data that is necessary to train models that generalize well.
  • The computation power has become cheaper and more accessible. Especially with the arrival of GPUs, training the models has become faster than ever. You too have to leverage these two factors to make sure your project is successful.

GPU vs CPU

The majority of the work done by neural networks is just matrix multiplication. This operation is highly parallelizable and GPUs are designed for parallel processing of the data. Let's conduct an experiment to demonstrate how much faster GPUs are when it comes to training neural networks. I will train a simple CNN on digit classification.

from tensorflow.keras.datasets import mnist
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1" #if commented, runs on GPU, otherwise CPU
import tensorflow as tf
from tensorflow.keras import Sequential, datasets, layers, models
train_X, train_y), (test_X, test_y) = mnist.load_data()
height = train_X.shape[1]
width = train_X.shape[2]
num_classes = 10
model = Sequential([
  layers.experimental.preprocessing.Rescaling(1./255, input_shape=(height, width,1)),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(128, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Flatten(),
  layers.Dense(256, activation='relu'),
  layers.Dense(num_classes)
])
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
epochs=10
import time
start = time.time()
history = model.fit(
  train_X,
  train_y,
  epochs=epochs,
  batch_size=64
)
end=time.time()
print("Time taken:",end-start)

Results

CPU with 16 cores: 188.98 s

GPU: 54.0 s

As you can see I achieved 3.5x acceleration using GPU, even though I was using all 16 CPU cores.


Setting up environment for GPU

Layout is a good place to do machine learning projects. We will be adding GPUs to other machines soon so this info might change. One way to check the specs of gpu is to run the command :

 $ lshw -C display 

On layout this outputs:

description: VGA compatible controller
product: GK110 [GeForce GTX 780]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:03:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:181 memory:de000000-deffffff memory:d0000000-d7ffffff memory:d8000000-d9ffffff
ioport:8000(size=128) memory:df000000-df07ffff


Depending on the model and the make of GPU other commands might also be available. For example, for nvidia gpus you can display the info about the current state of GPU using the command:

 $ nvidia-smi

(Note: you can use this command to see whether or not the resources are busy).

The way we manage different software versions and environments is through Modules. You can display the available modules via command: module avail If you run this command on layout you will see different versions of python, conda and cuda modules. Different python versions have different tensorflow versions installed.

You can see tensorflow’s compatibility chart with python, cuda and cudnn here: https://www.tensorflow.org/install/source#gpu

The latest version of tensorflow (2.3.1) is available on python/3.7 and it’s compatible with cuda/10.1.

If you are planning to use this version of tensorflow, then simply run these two commands:

$module load python/3.7
$module load cuda/10.1

Jupyter

It’s very convenient to run DL/ML projects on python notebooks for multiple reasons. It’s easier to visualize the data, you can make changes on the fly, it helps you to make sure everything is set up correctly before you start running lengthy experiments, etc.

We have set up a jupyterhub instance on layout that is designed to support DL projects. (Link: https://lo0.cluster.earlham.edu/jupyterhub) It comes prepackaged with cuda10.1, tf2.3.1, py3.7 -> it automatically runs the tensorflow projects on the available GPUs. Just to make sure everything’s set up correctly, use this python script:

tf.config.list_physical_devices('GPU')