May 12, 2020

7 Tips To Maximize PyTorch Performance

William Falcon
Image for post

Throughout the last 10 months, while working on PyTorch Lightning, the team and I have been exposed to many styles of structuring PyTorch code and we have identified a few key places where we see people inadvertently introducing bottlenecks.

We’ve taken great care to make sure that PyTorch Lightning does not make any of these mistakes for the code we automate for you, and we even try to correct it for users when we detect them. However, since Lightning is just structured PyTorch and you still control all of the scientific PyTorch, there’s not much we can do in many cases for the user.

In addition, if you’re not using Lightning, you might inadvertently introduce these issues into your code.

To help you train the faster, here are 8 tips you should be aware of that might be slowing down your code.

Use workers in DataLoaders

Image for post
Image for post

This first mistake is an easy one to correct. PyTorch allows loading data on multiple processes simultaneously (documentation).

In this case, PyTorch can bypass the GIL lock by processing 8 batches, each on a separate process. How many workers should you use? A good rule of thumb is:

num_worker = 4 * num_GPU

This answer has a good discussion about this.

Warning: The downside is that your memory usage will also increase (source).

Pin memory

Image for post
Image for post

You know how sometimes your GPU memory shows that it’s full but you’re pretty sure that your model isn’t using that much? That overhead is called pinned memory. ie: this memory has been reserved as a type of “working allocation.”

When you enable pinned_memory in a DataLoader it “automatically puts the fetched data Tensors in pinned memory, and enables faster data transfer to CUDA-enabled GPUs” (source).

Image for post
Pinned memory described in this NVIDIA blogpost.

This also means you should not unnecessarily call:

torch.cuda.empty_cache()

Avoid CPU to GPU transfers or vice-versa

# bad.cpu()
.item()
.numpy()

I see heavy usage of the .item() or .cpu() or .numpy() calls. This is really bad for performance because every one of these calls transfers data from GPU to CPU and dramatically slows your performance.

If you’re trying to clear up the attached computational graph, use .detach() instead.

# good.
detach()

This won’t transfer memory to GPU and it will remove any computational graphs attached to that variable.

Construct tensors directly on GPUs

Most people create tensors on GPUs like this

t = tensor.rand(2,2).cuda()

However, this first creates CPU tensor, and THEN transfers it to GPU… this is really slow. Instead, create the tensor directly on the device you want.

t = tensor.rand(2,2, device=torch.device('cuda:0'))

If you’re using Lightning, we automatically put your model and the batch on the correct GPU for you. But, if you create a new tensor inside your code somewhere (ie: sample random noise for a VAE, or something like that), then you must put the tensor yourself.

t = tensor.rand(2,2, device=self.device)

Every LightningModule has a convenient self.device call which works whether you are on CPU, multiple GPUs, or TPUs (ie: lightning will choose the right device for that tensor.

Use DistributedDataParallel not DataParallel

PyTorch has two main models for training on multiple GPUs. The first, DataParallel (DP), splits a batch across multiple GPUs. But this also means that the model has to be copied to each GPU and once gradients are calculated on GPU 0, they must be synced to the other GPUs.

That’s a lot of GPU transfers which are expensive! Instead, DistributedDataParallel (DDP)creates a siloed copy of the model on each GPU (in its own process), and makes only a portion of the data available to that GPU. Then its like having N independent models training, except that once each one calculates the gradients, they all sync gradients across models… this means we only transfer data across GPUs once during each batch.

In Lightning, you can trivially switch between both

Trainer(distributed_backend='ddp', gpus=8)
Trainer(distributed_backend='dp', gpus=8)

Note that both PyTorch and Lightning, discourage DP use.

Use 16-bit precision

This is another way to speed up training which we don’t see many people using. In 16-bit training parts of your model and your data go from 32-bit numbers to 16-bit numbers. This has a few advantages:

  1. You use half the memory (which means you can double batch size and cut training time in half).
  2. Certain GPUs (V100, 2080Ti) give you automatic speed-ups (3x-8x faster) because they are optimized for 16-bit computations.

In Lightning this is trivial to enable:

Trainer(precision=16)

Note: Before PyTorch 1.6 you ALSO had to install Nvidia Apex… now 16-bit is native to PyTorch. But if you’re using Lightning, it supports both and automatically switches depending on the detected PyTorch version.

Profile your code

This last tip may be hard to do without Lightning, but you can use things like the cprofiler to do. However, in Lightning you can get a summary of all the calls made during training in two ways:

First, the built-in basic profiler

Trainer(profile=True)

Which gives an output like this:

Image for post
Image for post

or the advanced profiler:

profiler = AdvancedProfiler()
trainer = Trainer(profiler=profiler)

which gets very granular

Image for post

The full documentation for the Lightning profiler can be found here.

Adopting Lightning in your code

PyTorch Lightning is nothing more than structured PyTorch.

Image for post

If you’re ready to have most of these tips automated for you (and well tested), then check out this video on refactoring your PyTorch code into the Lightning format!