PyTorch Lightning - Production

Engineering Modular Neural Networks

co-authored with Sarah Jane Hong and Darryl Barnhart

Creating and training a model at scale has its challenges. Starting with a single model is easy enough. But as you scale your model up, you’ll need features like gradient accumulation, mixed precision, and hyperparameter scheduling. Then, as you want to scale out, you need to worry about all the details on how to setup a proper distributed training environment.

With all that, you still want to make sure you maintain a fast feedback loop as you try out new ideas, so you’re constantly having to scale up and down. Maintaining all those features take a lot of work, and it’s easy for modern deep learning code-bases to grow to be overly complex.

At Latent Space, we’re trying to push the state of the art in generative modelling while improving on metrics as wide ranging as disentanglement, model distillation, and temporal consistency. To help us iterate quickly, we created a framework internally called Lab to dynamically build models and track experiments.

The purpose of Lab is to make it incredibly easy to test different model configurations and architectures quickly. It can:

dynamically compose models from different blocks by swapping them in and out using gin configurations
sweep over different model architectures, also using gin.
train models on the cloud by spinning up instances, provisioning, and running them on one or many GPUs/TPUs, including multi-node scenarios.
generate synthetic datasets to overcome the limitations of current academic datasets.

Using Lab, we’ve been able to create and continually iterate on a complex models and datasets, but we started running into some issues.

Why Lightning

Lab was very home-grown; we built it as we needed each feature. We didn’t always have time to focus on our framework’s structure and keep it up to date as we upgraded.

We started spending the majority of our time making sure our framework did not break as we tried to develop our model architecture further. As we tried to push the envelope on our model, we wanted to spend more time focusing on algorithmic correctness over mechanical correctness.

Lightning is a framework that aimed at reducing boilerplate and enabling researchers to focus on their research code. It covers difficult-to-maintain features that become important when training at scale, like distributed training, mixed precision, and gradient accumulation.

Lab already supported all of those features, but the attraction of being able to use all these features and remove our maintenance cost started outweighing the migration cost.

We were also impressed by the development of Lightning — the design principles were clearly laid out, the creator, William Falcon, was very active and responsive to questions, and the features were well-designed and well-tested.

The Migration

For us, moving to Lightning meant moving over all our existing features and models. Lightning modules inherit off PyTorch modules and include a set of methods and hooks meant to be overridden. Luckily, the way Lightning is designed is pretty agnostic — as long as we replace our entry-point with their top-level entry-point, we got most of the advantages right away.

The main work involved was moving our configuration and model construction system into Lightning. The Lab model construction system allows us to build our models using ‘blocks’. As an example, one block might contain the logic for self-attention, and another block will handle logic for modulation.

Integrating the configuration and model construction system into Lightning took a lot of care and planning, as we wanted to maintain our existing feature-set and reduce disruption to our team. The bulk of the work was friction-less due to the simplicity of the Lightning API. We got away with writing a few top-level adapters to decouple our code with some of the assumptions in the way our dataloaders and optimizers worked.

We were surprised by how easy it was to get running with the framework given the sheer size of our code-base. With a smaller or medium-sized research code-base, I would expect the move to be even easier.

Reaping the Benefits

Lab is now built on top of Lightning. We were able to remove historic, home-grown code and replaced it with Lightning’s strongly-tested feature-set. We extracted out the best parts of our old framework (like our composable model builder) into libraries and continued using them with our new hybrid framework.

There are a still few things that the pre-migration Lab does that we still need to keep at the framework-level, but in the future, we hope to merge some of the more useful features upstream and contribute to the framework.

Looking at the roadmap for the future post-migration, we hope to spend a lot less time dealing with how the model runs and more time with how the model learns.

We will be writing a lot more about best practices in engineering deep learning models, so stay tuned for more!

‍

Scaling Up and Down: Why we moved to PyTorch Lightning

Engineering Modular Neural Networks

Why Lightning

The Migration

Reaping the Benefits