July 10, 2020

How PyTorch Lightning became the first ML framework to run continuous integration on TPUs

Eden Afek

Co-authored with Jirka Borovec, Lead contributor for PyTorch Lightning and Zach Cain, ML engineer @ Google

As PyTorch Lightning adoption continues to grow, we continuously evolve our testing suite to ensure that the companies and AI research labs that build their AI systems on PyTorch Lightning have a reliable and robust codebase. In preparation for our upcoming V1, we have taken a major step in our support for training on TPUs.

As the first ML framework to implement PyTorch’s xla-TPU support (PyTorch Lightnight’s TPU support is built on top of pytorch/xla’s support of PyTorch native API), we continue to lead the charge in getting PyTorch users closer to running full workloads on TPUs. We’re proud to show you how we became the first ML framework to run CI on TPUs!

Understanding TPUs

©google
©Google

TPUs, or Tensor Processing Units, are hardware chips developed by Google to accelerate machine learning applications. The chip was designed to handle the computational demands of Google’s AI framework TensorFlow, which performs its computations on tensors, multidimensional data arrays. They are available on the cloud, using cloud TPUs. Cloud TPU v2 consists of 180 teraflops and 64 GB High Bandwidth Memory (HBM).

In 2018, Google released the latest generation TPU v3, more than doubling performance with 420 teraflops and 128 GB HBM.

PyTorch Lightning

PyTorch Lightning is a lightweight PyTorch framework (really just organized PyTorch),

Image for post

PyTorch Lightning provides seamless training of deep learning models over arbitrary hardware like GPUs, TPUs, CPUs without needing to modify your code. Much like TPUs, it was designed to help you iterate faster through your deep learning research ideas.

PyTorch Lightning has robust documentation, tutorial videos, and an active Slack channel supported by over 10 core members, in addition to 200+ contributors and 5 full-time staff engineers.

To deliver high-quality, stable, and error-free code to the numerous companies (Facebook, NVIDIA, Uber) and research labs who use PyTorch Lightning, we need very rigorous testing which we’ll describe below.

GitHub Actions

Image for post

We use GitHub Actions to streamline the PyTorch Lightning development lifecycle. We tried many different CI platforms in the past such as Travis CI, Appveyor, and others. Keeping track of API changes in each CI platform was challenging and neither offered the complete experience we were looking for — simple testing over multiple operating systems with minimal code changes, stable API, and a sufficient number of concurrent jobs.

When in 2019 GitHub launched the beta version of GH Actions, we were excited to give it a try. It’s easy to use and maintain since everything is under one roof. You can use it to create custom workflows to build, test, package, release, or deploy any code project on GitHub.

We created several workflows for end-to-end continuous integration (CI) and continuous deployment (CD) with GitHub actions, directly in our repository. The configuration is very simple, especially given we had to write matrix testing for all three main OS. It’s also free for all public repositories (up to 2,000 minutes of runtime)!

Check out more info here.

Continuous Integration on TPUs

Image for post

We test all possible environments- all combinations of Linux, Conda, PyTorch, Mac, and Windows versions. We have 16-bit support and multi GPU tests. Our testing coverage was 88%, but one of Lightning key features was still missing from CI testing — TPU training. We could only do testing non debugging using Google Colab. We wanted to integrate TPU tests to our GH actions CI. We were fortunate enough to have Zachary Cain, Google Engineer for Google Cloud ML Accelerators, make it happen.

Adding CI on TPUs is the first step towards making TPU fully covered in PyTorch Lightning’s tests.

Cloud TPU Integration with Github Action

Cloud TPUs can be accessed from 3 different services:

  1. Google Compute Engine (GCE)
  2. Google Kubernetes Engine (GKE)
  3. AI Platform

The PyTorch Lightning Github Action integration relies on GKE, which is a service that automatically starts and stops machines to run Docker images.

In general, any time new code arrives at the repo, a Github Action captures the latest version of the code in a Docker image that can be launched on GKE. The GKE configs are produced with the help of GoogleCloudPlatform/ml-testing-accelerators: an open-source framework for running deep learning jobs on GKE. The repo can be used with any combination of Tensorflow, PyTorch, GPUs, TPUs, or CPUs.

Image for post

For new commits to Pytorch Lightning, this workflow does the following:

  1. Authenticates to Google Cloud services using credentials stored as Github Secrets.
  2. Creates docker images based on google/cloud-sdk containing all PL requirements and the latest PL code.
  3. Pushes the docker image to Google Cloud (Google Container Registry).
  4. Deploys the job to a Kubernetes cluster, using images and tests defined in the attached jsonnet config. The test config is built using this open-source helper repo for TPU testing.
  5. GKE publishes the test results to Stackdriver, which are then pulled into the Github Action logs.
  6. Coverage report, if available, is passed to Github Action to upload to codecov.

GKE makes it easy to use cluster autoscaler, which automatically resizes the GKE cluster’s node pools based on the workload demands. It increases the availability of your workloads when you need it while controlling costs. It is especially important for community developments like PyTorch Lightning, where many contributors are working on PRs in parallel. Learn more about auto-scaling here.

Get started with PyTorch Lightning

Image for post
Image for post

Click here to learn more about PyTorch Lightning.

Want to start your own CI for TPUs and/or GPUs/CPUs? Please open an issue or ask a question at https://github.com/GoogleCloudPlatform/ml-testing-accelerators/issues.

PyTorch

An open source machine learning framework that accelerates…

Thanks to William Falcon and Zachary Cain.