PyTorch Lightning - Production

Image for post — Photo by Museums Victoria on Unsplash

How to automate and scale your deep learning experiments with Ansible, AWS cloud infrastructure and Pytorch Lightning library.

Let’s say you are a deep learning practitioner, but you don’t have an in-house GPU cluster or a machine learning platform at your disposal. Nobody trains their models on a CPU for almost a decade. Even worse, with models and datasets getting bigger, you have to deal with distributed deep learning and scale your training either in a model-parallel and/or data-parallel regimes. What can we do about it?

We can follow the modern cloud paradigm and utilize some GPU-as-a-service. It will allow you to allocate the necessary infrastructure dynamically on demand and release it once you have finished. It works well, but this is where the main complexity lies. Modern deep learning frameworks like PyTorch Lightning or Horovod make data-parallel distributed learning easy nowadays. The most annoying and time-consuming thing is creating a proper environment because often we have to do it manually. Even for services that hide a lot of infrastructure complexity from you — like Google Collab or Paperscape — some manual work still needs to be done.

I’m a strong believer in that repetitive routine work is your enemy. Why? Here is my list of personal concerns:

Reproducibility of results. Have you ever heard of a so-called human factor? We are very error-prone creatures and we are not good at memorizing something in great detail. The more human work some process involves the harder it will be to reproduce it in the future.
Mental distractions. Deep learning is an empirical endeavor and your progress in it relies deeply on your ability to iterate quickly and test as many hypotheses as you can. And due to that fact, anything that distracts you from your main task, — training and evaluating your models or analyzing the data, — negatively affects the success of an overall process.
Effectiveness. Computers do many things a lot faster than we, humans, do. When you have to repeat the same slow procedure over and over it all adds up.

Routine is your enemy

In this article, I’ll describe how you can automate the way you conduct your deep learning experiments.

Automate your Deep Learning experiments

The following are three main ideas of this article:

Utilize cloud-based infrastructure to dynamically allocate resources for your training purposes;
Use DevOps automation toolset to manage all manual work on the experiment environment setup;
Write your training procedure in a modern deep learning framework that makes it capable of data-parallel distributed learning effortlessly.

To actually implement these ideas we will utilize AWS cloud infrastructure, Ansible automation tool, and PyTorch Lightning deep learning library.

Our work will be divided into two parts. In this article we will provide a minimal working example which:

Automatically creates and destroys EC2 instances for our deep learning cluster;
Establishes connectivity between them necessary for Pytorch and Pytorch Lightning distributed training;
Creates a local ssh config file to enable connection to the cluster;
Creates a Python virtual environment and installs all library dependencies for the experiment;
Provides a submit script to run distributed data-parallel workloads on the created cluster.

In the next article, we will add additional features and build a fully automated environment for distributed learning experiments.

Now, let’s take a brief overview of the chosen technology stack.

What is AWS EC2?

AWS Elastic Compute Cloud (EC2) is a core AWS service that allows you to manage virtual machines in Amazon data centers. With this service you can dynamically create and destroy your machines either manually via AWS Console or via API provided by AWS SDK.

As of today, AWS provides a range of GPU-enabled instances for our purposes with one or multiple GPUs per instance and different choices of NVIDIA GPUs: Tesla GRID K520, M60, K80, T4, V100. See the official site for a full list.

What is Ansible?

Ansible is a tool for software and infrastructure provisioning and configuration management. With Ansible you can remotely provision a whole cluster of remote servers, provision software on them, and monitor them.

It is an open-source project written in Python. It uses a declarative approach: you define a desired system state, and Ansible executes necessary actions. To do that you use ordinary YAML files. Declarative nature of Ansible also means that most of the instructions you define for it are idempotent: if you run it more than once it would not cause any undesirable side effects.

One of the distinctive features of Ansible is that it is agent-less, i.e. it doesn’t require any agent software to be installed on the manageable nodes. It operates solely via SSH protocol. So the only thing you need to ensure is the SSH connectivity between the control host on which you run Ansible commands and the inventory hosts you want to manage.

Ansible core concepts

Let’s dive a bit into the core concepts of Ansible. There are not many of those, so you can quickly get your head around them and start playing with this brilliant tool.

Inventory

Inventory is simply a list of hosts you want to manage with Ansible. They are organized into named groups. You can define inventory in an INI-formatted file if you have a static predefined infrastructure. Another way — use inventory plugins that will tell Ansible which hosts to operate on if your infrastructure is not known in advance or may change dynamically (like in our case here).

Modules

A module is the unit of work that you can perform in Ansible. There is a massive library of modules you can use in Ansible. And it constitutes an extremely extensible architecture. See the module index.

Variables

Nothing fancy here. You can define variables like in any programming language either to separate your logic from the data or to pass information between parts of your system. Ansible collects a lot of system information and stores them in predefined variables — facts. You can read more about variables in the official documentation.

Tasks

A task is a module invocation with some parameters. You can also define a name, variable to store the result, conditional and loop expressions for the task. Here is an example of a task that copies some local file into a remote computer’s file system when some_variable variable is defined:

Plays

A play in Ansible is a way to apply a list of tasks to a group of hosts from inventory. You define a play as a dictionary in YAML. hosts parameter specifies an inventory group and tasks parameter contains a list of tasks.

Playbooks

A playbook is just a YAML file that contains a list of plays to run. The way to run a playbook is to pass it to ansible-playbook CLI that comes with Ansible installation.

Here’s a diagram to illustrate how these concepts interplay with each other:

There are also more advanced concepts in Ansible that allow you to write more modular code for complex scenarios. We’ll use some of them in Part 2 of the article.

What is Pytorch Lightning?

Pytorch Lightning is a high-level library on top of PyTorch. You can think of it as a Keras for PyTorch. There are a couple of features that make it stand out from the crowd of other PyTorch-based deep learning libraries:

It is transparent. As authors have written in the documentation, it is more a convention to write a Pytorch code than a separate framework. You don’t need to learn another library and you don’t need to make a huge effort to convert your ordinary PyTorch code to use it with Pytorch-Lightning. Your PyTorch Lightning code is actually your PyTorch code.
It hides a lot of boilerplate engineering code. Pytorch is a brilliant framework but when it comes to conducting full-featured experiments with it, you quickly end up with a lot of code that is not particularly related to the actual research you are doing. And you have to repeat this work every time. Pytorch Lighting provides this functionality for you. Specifically, it adds distributed data-parallel learning capability to your model with no modifications to the code required from you at all!
It is simple. All PyTorch Lightning code base revolves around a few number of abstractions:

LightningModule is a class that organizes your PyTorch code. The way you use PyTorch Lightning is by creating a custom class that is inherited from LightningModule and implementing its virtual methods. LightningModule itself is inherited from PyTorch Module.
Trainer automates your training procedure. Once you’ve organized your PyTorch code into a LightningModule, you pass its instance to a Trainer and it does the actual heavy lifting of training.
Callbacks, Loggers and Hooks are the means to customize the Trainer’s behavior.

For more information read the official documentation.

Okay, enough talking, let’s start building.

Building the experimentation toolset

In the rest of the article I’ll walk you through a step-by-step process of building our experimentation environment. Here is a link to a GitHub repo if you are interested in the final result. Also, look at part 2 where we’ll add additional features to our toolset.

Setup AWS Account and Ansible

Let’s install Ansible and configure it to work with AWS.

Setup AWS account and configure AWS SDK

If you don’t have an AWS account the first thing you need to do is to set up one. To do that go to Create New Account link from the official documentation and follow the instructions.

Next, let’s install AWS SDK to get an API access to AWS required by Ansible:

pip install boto

We need credentials for AWS API. To obtain them log in to your AWS console and follow the instructions. Choose a programmatic access option and apply the AdministratorAccess policy to give administrative access to your API. Yes, this is not a good practice in general, so you should change this in the future to more restrictive privileges.

Put your newly created user credentials to your .bashrc file:

echo "export AWS_ACCESS_KEY_ID=<your key id>" >> ~/.bashrc
echo "export AWS_SECRET_ACCESS_KEY=<your secret key>" >> ~/.bashrc
source ~/.bashrc

Setup SSH keys and get default VPC ID

We’ll be using the default SSH key pair (~/.ssh/id_rsa and ~/.ssh/id_rsa.pub) to connect to EC2 instances. If you don’t already have one on your system you should generate it with the ssh-keygen tool. Once generated, register it in the AWS EC2 service. You can do it under the Key Pairs menu options in the EC2 service of the AWS console. Please note, that keys are region-specific, so you need to register them under the same region you plan to create your EC2 instances in.

Next, we need to copy the ID of the default VPC. VPC (Virtual Private Network) is a virtual network in the AWS cloud where you connect your EC2 instances and other AWS services. We’ll be using default VPC for our experiments. Go to the VPC service in AWS console and open the list of all VPCs in a region. Find the one where the Default VPC column value is set to Yes.

Finally, create a config.yaml file and write the registered SSH pair name and VPC ID to it:

aws_ssh_key: <your registered ssh key pair name here>
vpc_id: <ID of your default VPC>

We will import this file later in our Ansible playbooks.

Setup Ansible

Ansible is written in Python and so can be easily installed with a single command:

pip install ansible==2.9.9

You can install the latest version or the one I was using while writing this article.

Disable Ansible host key checking

Since we won’t have a predefined static infrastructure it’s more convenient to disable host key check by Ansible. You can do it globally for all users in /etc/ansible/ansible.cfg, globally for the specific user in ~/.ansible.cfg, or locally for the given project in ./ansible.cfg. Either way create a file with the following contents:

Configure Ansible AWS EC2 dynamic inventory plugin

Remember Ansible inventory plugins? Because we create our EC2 instances dynamically and don’t assign any predefined DNS names to them, we don’t know their addresses in advance. AWS EC2 inventory plugin will help us here and provide them for our playbooks. We need to create the following configuration file for it:

Here you define where and how the plugin will look for instances. regions field contains the AWS regions list. filters field defines metadata attributes by which to filter the instances of interest. We use a managed_by tag to identify them. Later we will assign this tag to the instances that we create with our toolset.

Install additional dependencies for submit script

Our submit script will require a couple of additional Python packages installed on our local workstation:

pip install Click==7.0 fabric==2.5.0 patchwork==1.0.1

Overall description of our solution

We will use the following Ansible modules and plugins to get the job done:

ec2 module to create EC2 instances;
ec2_group module to create a security group for our EC2 instances;
ec2_placement_group module to create a cluster placement group for EC2 instances;
aws_ec2 inventory plugin to discover created EC2 instances and add them to the inventory;
ec2_instance module to terminate our instances.

To submit our training scripts to the cluster we’ll use the fabric python package.

The code will be divided into the following files:

setup-play.yml: playbook that creates EC2 infrastructure for the experiment. Additionally, it imports the environment playbook;
environment-play.yml: playbook that provisions the environment needed for the experiment on EC2 instances;
cleanup-play.yml: playbook that destroys EC2 instances and releases resources;
config.yml: variables file that contains all configuration for the experiment environment;
aws_ec2.yml: configuration file for AWS EC2 dynamic inventory plugin;
submit.py: CLI to submit training scripts to run on the cluster.

Setup infrastructure playbook

This playbook contains two plays. The first one is executed on the control node (i.e. on our local workstation), and its job is to create the EC2 infrastructure for our cluster. Let’s walk through its tasks:

create a cluster placement group. We will provide it later to our EC2 instances to suggest AWS to place them close to each other and reduce the latency of network communication;
create a security group to configure instances’ firewall rules. We allow all traffic between instances and SSH traffic to them from the Internet. Note that we pass variables defined in config.yml;
create EC2 instances with the ec2 module. We tag them so that the aws_ec2 dynamic inventory plugin will be able to find them later. We register the result of this task into the ec2 variable;
store the local IP address of the first created instance to make it a master node in our deep learning cluster;
create an SSH config file to be able to connect to the instances outside of Ansible;
add created instances to a new host group into the hosts' inventory so that the second play will be able to operate on them;
wait until created instances will be ready to accept SSH traffic

The second play contains a single task and its goal is to define environment variables necessary for PyTorch Lightning.

And, finally, we import the environment playbook.

Provision environment playbook

With this playbook, we deploy an environment (and changes to it) to the instances. It is relatively simple and contains only 3 steps:

create Python virtualenv for the experiment
copy requirements.txt file with all third party dependencies
install them

Clean up playbook

This playbook allows you to terminate your EC2 instances and delete the SSH config file.

And finally, let's look at the configuration file that you should customize for each experiment’s needs.

Configuration file

We have already created this file in the preparation section. Here is a final version with some additional parameters added:

Okay, so what do we have at the moment?

We can provide kind and number instances we want to create in our configuration file and specify libraries we want to install on them in requirements.txt file, and then run a single command:

ansible-playbook setup-play.yml

After a couple of minutes, a ready-to-use cluster will be at our disposal. We can SSH to its instances via:

ssh -i ssh_config worker[x]

When we are done, we can destroy it with:

ansible-playbook -i aws_ec2.yaml cleanup-play.yml

Now let’s streamline code deployment and actual running of the training procedure on our new shiny deep learning cluster.

Deploy a training procedure to the cluster

First, let’s create an example model and a training procedure.

Training procedure example

To make things simple and concentrate on the topic of the article I’ve picked the deep learning “Hello, world!” example. Let’s take a simple 3-layer fully-connected network and train it on the MNIST dataset. The code is pretty self-explanatory and consists of a simple lightning model and the main procedure that fits this model with a Trainer.

It uses the awesome hydra library for parameterization. Here’s a YAML file with parameters:

The default parameters in the file allow us to run the script on a local laptop without any GPU in a single-node mode. Let’s run the script locally and make sure it works:

python ddp_train_example.py max_epochs=5

Now, as I said earlier, the great feature of PyTorch Lightning is that you literally don’t have to change anything in the code to run it in a data-parallel distributed mode on the cluster. The only thing we need to change is a couple of parameters passed to the Trainer instance which we defined in our hydra configuration. To run our script on two nodes with one GPU on each we should invoke it the following way:

python ddp_train_example.py gpus=1 num_nodes=2 \ distributed_backend=ddp

The only thing left to be done is to implement a reusable CLI to submit our training script to the cluster.

Submit script

The submit CLI will take any Python script with its arguments as parameters. It will sync all files within the current working directory and run the given script on all cluster nodes. With fabric Python library we can do it with a few lines of code:

The actual submit logic resides inside the run function. The main function invokes it on all cluster nodes: asynchronously on worker nodes and synchronously on a master node. All standard output from the script running on the master node is automatically printed to stdout on your workstation.

Finally, let’s submit our training script to the cluster:

./submit.py -- ddp_train_example.py \
gpus=<number of GPUs per instance> \
num_nodes=<number of nodes in our cluster> \
distributed_backend=ddp

And that’s it. Our model is trained just like it would be on a local machine but with utilizing Pytorch and PyTorch Lighting distributed learning capabilities.

Conclusion

So what do we have in the end? With just three commands you can dynamically create a deep learning cluster in AWS, submit training jobs on it and delete it once you have finished with your experiments:

# create our deep learning cluster
ansible-playbook setup-play.yml# submit training job to it
./submit.py -- ddp_train_example.py \
gpus=<number of GPUs per instance> \
num_nodes=<number of nodes in our cluster> \
distributed_backend=ddp# terminate the cluster
ansible-playbook -i aws_ec2.yaml cleanup-play.yml

You can make this functionality reusable or just copy it into all of your experiment directories.

Now you can take your experimentation to the next level, be much more agile, and not afraid of the scary beasts of distributed deep learning.

But that’s not the end of this story. In part 2, we will add additional features to our toolset providing us with the ability to do interactive work on the cluster with Jupyter Notebook, monitor the training process with Tensorboard, and store experiments’ results to persistent storage.

Stay tuned!

‍

Distributed Deep Learning with Ansible, AWS and Pytorch Lightning. Part 1