PyTorch Lightning

A simple training pipeline for Leela Zero implemented with PyTorch, PyTorch Lightning and Hydra

Lightning and Hydra (?) — Source

Recently, I have been looking into ways to speed up my research and manage my experiments, especially around writing training pipelines and managing experiment configurations, and I discovered these two new projects called PyTorch Lightning and Hydra. PyTorch Lightning helps you write training pipelines quickly, while Hydra helps you manage configurations in a clean way.

In order to practice using them in a more realistic setting, I decided to write a training pipeline for Leela Zero, a Go engine. I chose to do this because it is a well-scoped project with interesting technical challenges related to training huge networks on big data sets using multiple GPUs. Also, I previously had fun implementing a smaller version of AlphaGo for chess, so I thought this would be a fun side project.

In this blog, I will explain the major details of the project so that you can understand what I did easily. You can read my code here: https://github.com/yukw777/leela-zero-pytorch

Leela Zero

The first step was to figure out the inner-workings of Leela Zero’s neural network. I referenced Leela Zero’s documentation and its Tensorflow training pipeline heavily.

Neural Network Architecture

Leela Zero’s neural network is composed of a ResNet “tower” with two “heads”, the policy head and the value head, as described in the AlphaGo Zero paper. All convolution filters are 3x3 except for the ones at the start of the policy and value head, which are 1x1, as in the paper. The game and board features are encoded as tensors of shape [batch size, board width, board height, number of features] and fed through the ResNet tower first. The tower then extracts abstract features and feeds them through each of the heads to calculate the policy probability distribution for the next move and the value of the game to predict the winner of the game.

You can find the implementation details of the network in the code snippet below.

Weights Format

Leela Zero uses a simple text file to save and load network weights. Each row in the text file has a series of numbers that represent weights of each layer of the network. The residual tower is first, followed by the policy head, and then the value head.

Convolutional layers have 2 weight rows:

Convolution weights with shape [output, input, filter size, filter size]
Channel biases

Batchnorm layers have 2 weight rows:

Batchnorm means
Batchnorm variances

Innerproduct (fully connected) layers have 2 weight rows:

Layer weights with shape [output, input]
Output biases

I wrote unit tests to make sure my weight files are correct. An additional simple sanity check I used was to calculate the number of layers and compare it to what Leela Zero says after loading my weight files. The equation for the number of layers is:

n_layers = 1 (version number) +
2 (input convolution) +
2 (input batch norm) +
n_res (number of residual blocks) *
8 (first conv + first batch norm +
second conv + second batch norm) +
2 (policy head convolution) +
2 (policy head batch norm) +
2 (policy head linear) +
2 (value head convolution) +
2 (value head batch norm) +
2 (value head first linear) +
2 (value head second linear)

So far, this seems simple enough, but there is a quirky implementation detail you need to be aware of. Leela Zero actually uses the bias for the convolutional layer to represent the learnable parameters (gamma and beta) of the following batch norm layer. This was done so that the format of the weights file, which only has one line for the layer weights and another for the bias, didn’t have to change when batch norm layers were added.

Currently, Leela Zero only uses the beta term of batch norm, and sets gamma to 1. Then, how do you actually use the convolutional bias to produce the same results as applying the learnable parameters in batch norm? Let’s first take a look at the equation for batch norm:

y = gamma * (x — mean)/sqrt(var — eps) + beta

Since Leela Zero sets gamma to 1, the equation becomes:

y = (x — mean)/sqrt(var — eps) + beta

Now, let x_conv be the output of a convolutional layer without the bias. Then, we want to add some bias to x_conv, so that when you run it through batch norm without beta, the result is the same as running x_conv through the batch norm equation with only beta mentioned above. In an equation form:

(x_conv + bias — mean)/sqrt(var — eps) =
(x_conv — mean)/sqrt(var — eps) + beta x_conv + bias — mean =
x_conv — mean + beta * sqrt(var — eps) bias = beta * sqrt(var — eps)

So if we set the convolutional bias to beta * sqrt(var — eps) in the weight file, we get the desired output, and this is what LeelaZero does.

Then, how do we actually implement this? In Tensorflow, you can tell the batch norm layer to ignore just the gamma term by calling tf.layers.batch_normalization(scale=False) and be done with it. Unfortunately, in PyTorch you can’t set batch normalization layers to ignore only gamma; you can only ignore both gamma and beta by setting the affine parameter to False: BatchNorm2d(out_channels, affine=False). So, I set batch normalization to ignore both, then simply added a tensor after, which represents beta. Then, I used the equation bias = beta * sqrt(var — eps) to calculate the convolutional bias for the weight file.

Training Pipeline

After figuring out the details of Leela Zeros’s neural network, it was time to tackle the training pipeline. As I mentioned, I wanted to practice using two tools — PyTorch Lightning and Hydra — to speed up writing training pipelines and cleanly manage experiment configurations. Let’s dive into the details on how I used them.

Writing the training pipeline is by far my least favorite part of research: it involves a lot of repetitive boilerplate code, and is hard to debug. Because of this, PyTorch Lightning was like a breath of fresh air to me. It is a lightweight library without many auxiliary abstractions on top of PyTorch that takes care of most of the boilerplate code in writing training pipelines. It allows you to focus on the more interesting parts of your training pipelines, like the model architecture, and to make your research code more modular and debuggable. Furthermore, it supports multi-GPU and TPU training out of the box!

In order to use PyTorch Lightning for my training pipeline, the most coding I had to do was to write a class, which I called NetworkLightningModule, that inherits from LightningModule to specify the details of my training pipeline, and pass it to the Trainer. You can follow the official PyTorch Lightning documentation for details on how to write your own LightningModule.

Hydra

Another part of research that I have been searching for a good solution is experiment management. When you conduct research, it’s unavoidable that you run a myriad of variants of your experiment to test your hypothesis, and it’s extremely important to keep track of them in a scalable way. So far, I have relied on configuration files to manage my experiment variants, but using flat configuration files quickly becomes unmanageable. Templates are one solution to this problem. However, I have found that templates eventually become messy as well, because as you overlay multiple layers of value files to render your configuration files, it becomes difficult to keep track of which value came from which value file.

Hydra, on the other hand, is a composition-based configuration management system. Instead of having separate templates and value files to render the final configuration, you combine multiple smaller configuration files to compose the final configuration. It is not as flexible as a template-based configuration management system, but I find that composition-based systems strike a good balance between flexibility and maintainability. Hydra is one such system that is specifically tailored for research scripts. It is a bit heavy-handed in its invocation as it requires that you use it as a decorator to the main entry point function of your script, but I actually think this design choice makes it easy to integrate with your training scripts. Furthermore, it allows you to manually override configurations via command line, which is very useful when running different variations of your experiment. I used Hydra to manage different sizes of the network architecture and training pipeline configurations.

Evaluation

To evaluate my trained networks, I used GoMill to run Go tournaments. It is a library to run tournaments between Go Text Protocol (GTP) engines, of which Leela Zero is one. You can find a tournament configuration I used here.

Conclusion

By using PyTorch-Lightning and Hydra, I was able to drastically speed up writing training pipelines and efficiently manage experiment configurations. I hope this project and blog post will help you with your research also. You can check out the code here: https://github.com/yukw777/leela-zero-pytorch‍

Training Neural Networks for Leela Zero With PyTorch

A simple training pipeline for Leela Zero implemented with PyTorch, PyTorch Lightning and Hydra

Leela Zero

Neural Network Architecture

Weights Format

Training Pipeline