PyTorch Lightning - Production

Lightning reveals the final API, a new website, and a sneak peek into our new native platform for training models at scale on the cloud.

‍

We were hard at work in the last couple of months fine-tuning our API, polishing our docs, recording tutorials, and it’s finally time to share with you all V1.0.0 of PyTorch Lightning. Want the lightning answer to scaling models on the cloud? continue reading.

The Lightning DNA

AI research has evolved much faster than any single framework can keep up with. The field of deep learning is constantly evolving, mostly in complexity and scale. Lightning provides a user experience designed for the world of complex model interactions while abstracting away all the distracting details of engineering such as multi-GPU and multi-TPU training, early stopping, logging, etc…

Frameworks like PyTorch were designed for a time where AI research was mostly about network architectures, an nn.Module that can define the sequence of operations.

‍

And these frameworks do an incredible job at providing all the pieces to put together extremely complex models for research or production. But as soon as models start interacting with each other, like a GAN, BERT, or an autoencoder, that paradigm breaks, and the immense flexibility, soon turns into boilerplate that is hard to maintain as a project scales.

Unlike frameworks that came before, PyTorch Lightning was designed to encapsulate a collection of models interacting together, what we call deep learning systems. Lightning is built for the more complicated research and production cases of today’s world, where many models interact with each other using complicated rules.

‍

The second key principle of PyTorch Lightning is that hardware and the “science” code must be separated. Lightning evolved to harness massive compute at scale without surfacing any of those abstractions to the user. By doing this separation you gain new abilities that were not possible before such as debugging your 512 GPU job on your laptop using CPUs without needing to change your code.

Lastly, Lightning was created with the vision of becoming a community-driven framework.

Building good deep learning models requires a ton of expertise and small tricks that make the system work. Across the world, hundreds of incredible engineers and PhDs implement the same code over and over again. Lightning now has a growing contributor community of over 300+ of the most talented deep learning people around, that choose to allocate the same energy and do exactly the same optimizations but instead have thousands of people benefiting from their efforts.

What’s new in 1.0.0

Lightning 1.0.0 signals a stable and final API.

This means that the major research projects that depend on Lightning can rest easy knowing that their code will not break or change going forward.

Research + Production

Lightning’s core strength is to enable state of the art AI research to happen at scale. It’s a framework designed for professional researchers to try the hardest ideas on the largest compute resources without losing any flexibility.

We’re excited to announce that Lightning 1.0.0 is now also making it trivial to deploy these models at scale. All of the Lightning code makes sure that everything can be exported to onnx and torchscript easily.

So, this means that your team of data scientists, researchers, etc can now BE the people who also put models into production. They don’t need large teams of machine learning engineers.

This is one major reason why leading companies are using Lightning: as a way to help them cut the time to production dramatically without losing any flexibility needed for research.

And this is precisely what our corporate offering does: Grid AI is our native platform for training models at scale on the cloud. Grid allows anyone building deep learning models to iterate on massive compute and then instantly deploy these models into a scalable environment capable of handling the largest traffic you could throw at a deep learning system.

Website

You’ll also notice that we’ve centralized all the blog posts, lightning-speed video tutorials , community projects, and other resources under our brand new homepage to showcase all things Lightning!

‍

Metrics

pytorch_lightning.metrics is a Metrics API created for easy metric development and usage in PyTorch and PyTorch Lightning. The updated API provides an in-built method to compute the metric across multiple GPUs (processes) for each step, while at the same time storing statistics that allow you to compute the metric at the end of an epoch, without having to worry about any of the complexities associated with the distributed backend.

It is rigorously tested for all edge cases and includes a growing list of common metric implementations, such as Accuracy, Precision, Recall, Fbeta, MeanSquaredError, and more.

To implement your custom metric, simply subclass the base Metric class and implement the __init__(), update() and compute() methods. All you need to do is call add_state() correctly to implement a custom metric with DDP. reset() is called on metric state variables added using add_state().

Manual vs automatic optimization

With Lightning you don’t need to worry about when to enable/disable grads, do a backward pass, or update optimizers as long as you return a loss with an attached graph from the training_step, Lightning will automate the optimization.

However, for certain research like GANs, reinforcement learning or something with multiple optimizers or an inner loop, you can turn off automatic optimization and fully control the training loop yourself.

First, turn off automatic optimization:

trainer = Trainer(automatic_optimization=False)

Now you own the train loop!

Logging

Lightning makes integration with loggers super simple- just call the log() method anywhere on your LightningModule, and it will send the logged quantity to your logger of choice. We use Tensorboard by default, but you can choose any supported logger you wish.

Depending on where .log() is called from, Lightning auto-determines when the logging should take place (on every step or every epoch), but of course you can override the default behavior by manually using on_step and on_epoch options. Setting on_epoch=True will accumulate your logged values over the full training epoch.

Data flow

We deprecated EvalResult and TrainResult in favor of simplifying data flow and decoupling logging from data in training and validation loops.

Each loop (training, validation, test) has three hooks you can implement:

x_step
x_step_end
x_epoch_end

To illustrate how data flows, we’ll use the training loop (ie: x=training)

outs = []
for batch in data:
out = training_step(batch)
outs.append(out)training_epoch_end(outs)

Anything you return in training_step can be used as input to training_epoch_end.

The same goes for validation and test steps: anything returned in validation_step or test_step can be used as input to {validation/test}_step_end or {validation/test}_epoch_end. In the event that you use DP or DDP2 distributed modes (ie: split a batch across GPUs), use the x_step_end to manually aggregate (or don’t implement it to let lightning auto-aggregate for you).

Checkpointing

Lightning now automatically saves a checkpoint for you in your current working directory, with the state of your last training epoch. This makes sure you can resume training in case it was interrupted.

You can customize the checkpointing behavior to monitor any quantity of your training or validation steps. For example, if you want to update your checkpoints based on your validation loss:

Calculate any metric or other quantity you wish to monitor, such as validation loss.
Log the quantity using thelog() method, with a key such as val_loss.
Initializing the ModelCheckpoint callback, and set monitor to be the key of your quantity.
Pass the callback to checkpoint_callback Trainer flag.

Read about all API changes, including many bug fixes, in our release notes.

Thanks

We would not be celebrating V1.0.0 without the incredible work of our brilliant core team working round the clock to get every little detail right, constant support from the PyTorch team, and of course our community members. We’d like to personally thank everyone who contributed PRs or reviewed them, submitted feedback and issues, replied in our forum or in our slack community. This one is for all of you!

Wanna learn more about lightning? Go to our website, read the docs, or join our first-ever virtual meetup- Ask Me Anything with William Falcon, creator of Lightning! The meetup will take place Oct 21st, 2020, 1 PM EST, so bring your lunch or dinner and come learn more about new features, grid, or anything else you always wanted to know but never asked. Sign up here.

PyTorch

An open source machine learning framework that accelerates…

‍

PyTorch Lightning 1.0: From 0–600k