TensorFlow CPUs and GPUs Configuration

7 min readMar 19, 2017

I try to load two neural networks in TensorFlow and fully utilize the power of GPUs. However, my GPUs only have 8GBs memory, which is quite small. So I need to use GPUs and CPUs at the same time. This article is mainly training to resolve this problem.

At first, TensorFlow uses tf.ConfigProto() to configure the session.

config = tf.ConfigProto()

It can also take in parameters when running tasks by setting environmental variable CUDA_VISIBLE_DEVICES.-1 is set to not use GPU. The id maps to the ones shown in nvidia-smi command.

CUDA_VISIBLE_DEVICES=1 python task.py

Another way is to use export to control all the environment

export CUDA_VISIBLE_DEVICES=1

Now, back to the configuration in tensorflow.

1. Default Mode

TensorFlow default mode is to initialize all available GPUs. Given the following code:

import tensorflow as tf
config = tf.ConfigProto()
sess = tf.Session(config=config)

The output is as follows:

2019-03-25 13:59:00.021347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0, 1, 22019-03-25 13:59:00.828308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:2019-03-25 13:59:00.828355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 1 22019-03-25 13:59:00.828366: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N Y Y2019-03-25 13:59:00.828373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   Y N Y2019-03-25 13:59:00.828379: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   Y Y N2019-03-25 13:59:00.829254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9188 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)2019-03-25 13:59:00.974303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8421 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute capability: 6.1)2019-03-25 13:59:01.082848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10403 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:0b:00.0, compute capability: 6.1)

To better analyze how tensorflow is assigning resource, and to find out which devices your operations and tensors are assigned to, create the session with log_device_placement configuration option set to True.

2. Decide Using GPU or CPU

To not use GPU, a good solution is to not allow the environment to see any GPUs by setting the environmental variable CUDA_VISIBLE_DEVICES.

def no_gpu():    import os
    os.environ["CUDA_VISIBLE_DEVICES"]="-1"
    import tensorflow as tf    # Creates a graph.    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)    # Creates a session with log_device_placement set to True.    config=tf.ConfigProto(log_device_placement=True)
    sess = tf.Session(config=config)    #config gpu list    # Runs the op.    print(sess.run(c))

The output is as follows:

Device mapping: no known devices.2019-03-25 14:30:49.257887: I tensorflow/core/common_runtime/direct_session.cc:288] Device mapping:MatMul: (MatMul): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258671: I tensorflow/core/common_runtime/placer.cc:935] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:CPU:0MatMul_1: (MatMul): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258711: I tensorflow/core/common_runtime/placer.cc:935] MatMul_1: (MatMul)/job:localhost/replica:0/task:0/device:CPU:0MatMul_2: (MatMul): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258733: I tensorflow/core/common_runtime/placer.cc:935] MatMul_2: (MatMul)/job:localhost/replica:0/task:0/device:CPU:0a: (Const): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258755: I tensorflow/core/common_runtime/placer.cc:935] a: (Const)/job:localhost/replica:0/task:0/device:CPU:0b: (Const): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258773: I tensorflow/core/common_runtime/placer.cc:935] b: (Const)/job:localhost/replica:0/task:0/device:CPU:0a_1: (Const): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258791: I tensorflow/core/common_runtime/placer.cc:935] a_1: (Const)/job:localhost/replica:0/task:0/device:CPU:0b_1: (Const): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258809: I tensorflow/core/common_runtime/placer.cc:935] b_1: (Const)/job:localhost/replica:0/task:0/device:CPU:0a_2: (Const): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258827: I tensorflow/core/common_runtime/placer.cc:935] a_2: (Const)/job:localhost/replica:0/task:0/device:CPU:0b_2: (Const): /job:localhost/replica:0/task:0/device:CPU:02019-03-25 14:30:49.258845: I tensorflow/core/common_runtime/placer.cc:935] b_2: (Const)/job:localhost/replica:0/task:0/device:CPU:0[[22. 28.][49. 64.]]

The parameter device_count which takes a dictionary to assign available GPU device number and CPU device number. For example, the following code can make tensorflow not use any gpu resource.

config = tf.ConfigProto(device_count = {'GPU': 0})sess = tf.Session(config=config)

The following code can assign both cpu and gpu.

config = tf.ConfigProto(device_count={'GPU':0, 'CPU':4})

Configuring CPUs

To run Tensorflow on one single CPU thread

session_conf = tf.ConfigProto(
      intra_op_parallelism_threads=1,
      inter_op_parallelism_threads=1)
sess = tf.Session(config=session_conf)

device_count limits the number of CPUs being used, not the number of cores or threads.

3. Configuring GPUs

In ConfigProto(), gpu_options is used to configure gpus.

Visible Devices

First, you need to check your machine’s available GPUs with nvidia-smi command. When using multiple gpus, we need to use , to separate them.

config.gpu_options.visible_device_list= '1' #only see the gpu 1
config.gpu_options.visible_device_list= '0,1,2' #see the gpu 0, 1, 2

GPU Growth

The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations: it starts out allocating very little memory, and as Sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process.

config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.9

4. Manual device placement

If you would like a particular operation to run on a device of your choice instead of what’s automatically selected for you, you can use with tf.device to create a device context such that all the operations within that context will have the same device assignment.

# Creates a graph.
with tf.device('/cpu:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

You will see that now a and b are assigned to cpu:0.

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K40c, pci bus
id: 0000:05:00.0
b: /job:localhost/replica:0/task:0/cpu:0
a: /job:localhost/replica:0/task:0/cpu:0
MatMul: /job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 [ 49.  64.]]

5. Using a single GPU on a multi-GPU system

If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default. If you would like to run on a different GPU, you will need to specify the preference explicitly:

# Creates a graph.
with tf.device('/gpu:2'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

If the device you have specified does not exist, you will get InvalidArgumentError:

InvalidArgumentError: Invalid argument: Cannot assign a device to node 'b':
Could not satisfy explicit device specification '/gpu:2'
   [[Node: b = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [3,2]
   values: 1 2 3...>, _device="/gpu:2"]()]]

If you would like TensorFlow to automatically choose an existing and supported device to run the operations in case the specified one doesn’t exist, you can set allow_soft_placement to True in the configuration option when creating the session.

# Creates a graph.
with tf.device('/gpu:2'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with allow_soft_placement and log_device_placement set
# to True.
sess = tf.Session(config=tf.ConfigProto(
      allow_soft_placement=True, log_device_placement=True))
# Runs the op.
print sess.run(c)

6. Using multiple GPUs

If you would like to run TensorFlow on multiple GPUs, you can construct your model in a multi-tower fashion where each tower is assigned to a different GPU. For example:

# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(sum)

You will see the following output.

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K20m, pci bus
id: 0000:02:00.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla K20m, pci bus
id: 0000:03:00.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: Tesla K20m, pci bus
id: 0000:83:00.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: Tesla K20m, pci bus
id: 0000:84:00.0
Const_3: /job:localhost/replica:0/task:0/gpu:3
Const_2: /job:localhost/replica:0/task:0/gpu:3
MatMul_1: /job:localhost/replica:0/task:0/gpu:3
Const_1: /job:localhost/replica:0/task:0/gpu:2
Const: /job:localhost/replica:0/task:0/gpu:2
MatMul: /job:localhost/replica:0/task:0/gpu:2
AddN: /job:localhost/replica:0/task:0/cpu:0
[[  44.   56.]
 [  98.  128.]]

Solve our problem.

It looks like we need the multi-tower fashion. CIFAR-1o is a good example. The reason CIFAR-10 was selected was that it is complex enough to exercise much of TensorFlow’s ability to scale to large models.

Training a Model Using Multiple GPU Cards

Modern workstations may contain multiple GPUs for scientific computation. TensorFlow can leverage this environment to run the training operation concurrently across multiple cards.

Training a model in a parallel, distributed fashion requires coordinating training processes. For what follows we term model replica to be one copy of a model training on a subset of data.

Naively employing asynchronous updates of model parameters leads to sub-optimal training performance because an individual model replica might be trained on a stale copy of the model parameters. Conversely, employing fully synchronous updates will be as slow as the slowest model replica.

In a workstation with multiple GPU cards, each GPU will have similar speed and contain enough memory to run an entire CIFAR-10 model. Thus, we opt to design our training system in the following manner:

Place an individual model replica on each GPU.
Update model parameters synchronously by waiting for all GPUs to finish processing a batch of data.

Here is a diagram of this model:

Note that each GPU computes inference as well as the gradients for a unique batch of data. This setup effectively permits dividing up a larger batch of data across the GPUs.

This setup requires that all GPUs share the model parameters. A well-known fact is that transferring data to and from GPUs is quite slow. For this reason, we decide to store and update all model parameters on the CPU (see green box). A fresh set of model parameters is transferred to the GPU when a new batch of data is processed by all GPUs.

The GPUs are synchronized in operation. All gradients are accumulated from the GPUs and averaged (see green box). The model parameters are updated with the gradients averaged across all model replicas.

Placing Variables and Operations on Devices

Placing operations and variables on devices requires some special abstractions.

The first abstraction we require is a function for computing inference and gradients for a single model replica. In the code we term this abstraction a “tower”. We must set two attributes for each tower:

A unique name for all operations within a tower. tf.name_scope provides this unique name by prepending a scope. For instance, all operations in the first tower are prepended with tower_0, e.g. tower_0/conv1/Conv2D.
A preferred hardware device to run the operation within a tower. tf.device specifies this. For instance, all operations in the first tower reside within device('/gpu:0') scope indicating that they should be run on the first GPU.

All variables are pinned to the CPU and accessed via tf.get_variable in order to share them in a multi-GPU version. See how-to on Sharing Variables.

A good example is here: https://github.com/tensorflow/tensorflow/blob/r0.7/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py

Can try this out at first, see if it works.