Deep Learning Workflow
Where to store data?
For your project to be successful, it is critical that you have the ability to iterate over the training process in a timely manner. One thing you should make sure of is that reading the data is not costly.
Your directory is a very bad place to store your data for a simple reason: user directories are not stored on the local disk of every machine, they are mounted via NFS over the network. Transferring data over the network is much, much slower than reading it from the disk (keep in mind that you also have to first read data from the disk, in order to send it over the network).
In order to avoid this pitfall, simply put the data on the local disk. /mounts is a standard directory we’ve been putting data. If you don’t have access to this directory, just shoot an email to sysadmins and we will help you out.
GPU
Deep learning has been around for almost half a century, but only recently it has become a prevalent machine learning method. The popularity of deep learning has risen dramatically since the 2000s and it was due to two main factors:
- Mass digitization provided computer scientists with a ton of data that is necessary to train models that generalize well.
- The computation power has become cheaper and more accessible. Especially with the arrival of GPUs, training the models has become faster than ever. You too have to leverage these two factors to make sure your project is successful.
GPU vs CPU
The majority of the work done by neural networks is just matrix multiplication. This operation is highly parallelizable and GPUs are designed for parallel processing of the data. Let's conduct an experiment to demonstrate how much faster GPUs are when it comes to training neural networks. I will train a simple CNN on digit classification.
from tensorflow.keras.datasets import mnist import os os.environ["CUDA_VISIBLE_DEVICES"] = "-1" #if commented, runs on GPU, otherwise CPU import tensorflow as tf from tensorflow.keras import Sequential, datasets, layers, models train_X, train_y), (test_X, test_y) = mnist.load_data() height = train_X.shape[1] width = train_X.shape[2] num_classes = 10 model = Sequential([ layers.experimental.preprocessing.Rescaling(1./255, input_shape=(height, width,1)), layers.Conv2D(32, 3, padding='same', activation='relu'), layers.MaxPooling2D(), layers.Conv2D(64, 3, padding='same', activation='relu'), layers.MaxPooling2D(), layers.Conv2D(128, 3, padding='same', activation='relu'), layers.MaxPooling2D(), layers.Flatten(), layers.Dense(256, activation='relu'), layers.Dense(num_classes) ]) model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) epochs=10 import time start = time.time() history = model.fit( train_X, train_y, epochs=epochs, batch_size=64 ) end=time.time() print("Time taken:",end-start)
Results
CPU with 16 cores: 188.98 s
GPU: 54.0 s
As you can see I achieved 3.5x acceleration using GPU, even though I was using all 16 CPU cores.
Setting up environment for GPU
Layout is a good place to do machine learning projects. We will be adding GPUs to other machines soon so this info might change. One way to check the specs of gpu is to run the command :
$ lshw -C display
On layout this outputs:
- description: VGA compatible controller
- product: GK110 [GeForce GTX 780]
- vendor: NVIDIA Corporation
- physical id: 0
- bus info: pci@0000:03:00.0
- version: a1
- width: 64 bits
- clock: 33MHz
- capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
- configuration: driver=nvidia latency=0
- resources: irq:181 memory:de000000-deffffff memory:d0000000-d7ffffff memory:d8000000-d9ffffff
- ioport:8000(size=128) memory:df000000-df07ffff
Depending on the model and the make of GPU other commands might also be available. For example, for nvidia gpus you can display the info about the current state of GPU using the command:
$ nvidia-smi
(Note: you can use this command to see whether or not the resources are busy).
The way we manage different software versions and environments is through Modules. You can display the available modules via command: module avail If you run this command on layout you will see different versions of python, conda and cuda modules. Different python versions have different tensorflow versions installed.
You can see tensorflow’s compatibility chart with python, cuda and cudnn here: https://www.tensorflow.org/install/source#gpu
The latest version of tensorflow (2.3.1) is available on python/3.7 and it’s compatible with cuda/10.1.
If you are planning to use this version of tensorflow, then simply run these two commands:
$module load python/3.7 $module load cuda/10.1
Jupyter
It’s very convenient to run DL/ML projects on python notebooks for multiple reasons. It’s easier to visualize the data, you can make changes on the fly, it helps you to make sure everything is set up correctly before you start running lengthy experiments, etc.
We have set up a jupyterhub instance on layout that is designed to support DL projects. (Link: https://lo0.cluster.earlham.edu/jupyterhub) It comes prepackaged with cuda10.1, tf2.3.1, py3.7 -> it automatically runs the tensorflow projects on the available GPUs. Just to make sure everything’s set up correctly, use this python script:
tf.config.list_physical_devices('GPU')
Tested and working 2022