August 10, 2020

TensorBoard with PyTorch Lightning

Neelabh Madan (IIT Delhi)

A picture is worth a thousand words! As computer vision and machine learning experts, we could not agree more.

Visualization succeeds raw data
Charts and graphs convey more compared to of tables

Human intuition is the most powerful way of making sense out of random chaos, understanding the given scenario, and proposing a viable solution if required. Moreover, the best way to infer something is by looking at it (visualizing it). Therefore data visualization is becoming extremely useful in enabling our human intuition to come up with faster and accurate solutions. In fact, data science and machine learning makes use of it day in and day out

Visualization comes in handy for almost all machine learning enthusiasts. We use it for

What is covered in this post

We know deep down inside that we require visualization tools to supplement our development. One way could be to make our own small snippets for each making graphs using matplotlib or any other graphing library. Or we can make use of the TensorBoard’s visualization toolkit.

In our last post (Getting Started with PyTorch Lightning), we understood how to reduce the boilerplate code by using PyTorch Lightning. In this post, we will learn how to include Tensorboard visualizations in our Lightning code.

In this post, we will learn how to

So let’s get started!!!

What is Tensorboard?

TensorBoard Dashboard
Source: TensorBoard by TensorFlow

TensorBoard is an interactive visualization toolkit for machine learning experiments. Essentially it is a web-hosted app that lets us understand our model’s training run and graphs.

TensorBoard is not just a graphing tool. There is more to this than meets the eye. Tensorboard allows us to directly compare multiple training results on a single graph. With the help of these features, we can find out the best set of hyperparameters for our model, visualize problems such as gradient vanishing or gradient explosions and do faster debugging.

Getting Started

This blog adds functionality to the model we made in the last post. We will see how to integrate TensorBoard logging into our model made in Pytorch Lightning.

Note that we are still working on a Google Colab Notebook

There are two ways to generate beautiful and powerful TensorBoard plots in PyTorch Lightning

Let’s see both one by one.

I've partnered with OpenCV.org to bring you official courses in Computer Vision, Machine Learning, and AI! Sign up now and take your skills to the next level!

OFFICIAL COURSES BY OPENCV.ORG

Default TensorBoard Logging

Logging per batch

Lightning gives us the provision to return logs after every forward pass of a batch, which allows TensorBoard to automatically make plots.

We can log data per batch from the functions training_step(),validation_step() and test_step().

We return a batch_dictionary python dictionary. It is necessary that the output dictionary contains the loss key. This is the bare minimum requirement to be met by us by Lightning for the code to run.

1#defining the model

2class smallAndSmartModel(pl.LightningModule):

3    '''

4    other necessary functions already written

5    '''

6    def training_step(self,batch,batch_idx):

7        # REQUIRED- run at every batch of training data

8        # extracting input and output from the batch

9        x,labels=batch

10        

11        # forward pass on a batch

12        pred=self.forward(x)

13

14        # identifying number of correct predections in a given batch

15        correct=pred.argmax(dim=1).eq(labels).sum().item()

16

17        # identifying total number of labels in a given batch

18        total=len(labels)

19

20        # calculating the loss

21        train_loss = F.cross_entropy(pred, labels)

22        

23        # logs- a dictionary

24        logs={"train_loss": train_loss}

25

26        batch_dictionary={

27            #REQUIRED: It ie required for us to return "loss"

28            "loss": train_loss,

29            

30            #optional for batch logging purposes

31            "log": logs,

32

33            # info to be used at epoch end

34            "correct": correct,

35            "total": total

36        }

37

38        return batch_dictionary

In order to allow TensorBoard to log our data, we need to provide the logs key in the output dictionary. The logs should contain a dictionary made up of keys and corresponding values. These keys are then plotted on the TensorBoard.

If you aren’t aware of Python dictionaries, please give this a look.

Given below is a plot of training loss against the number of batches

Training Loss Vs Batch number

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

DOWNLOAD CODE

Logging per epoch

We can also log data per epoch. For example, total loss, total accuracy, average loss are some metrics that we can plot per epoch.

1#defining the model

2class smallAndSmartModel(pl.LightningModule):

3    '''

4    other necessary functions already written

5    '''

6     def training_epoch_end(self,outputs):

7        #  the function is called after every epoch is completed

8

9        # calculating average loss

10        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

11

12        # calculating correect and total predictions

13        correct=sum([x["correct"] for  x in outputs])

14        total=sum([x["total"] for  x in outputs])

15

16        # creating log dictionary

17        tensorboard_logs = {'loss': avg_loss,"Accuracy": correct/total}

18

19        epoch_dictionary={

20            # required

21            'loss': avg_loss,

22            

23            # for logging purposes

24            'log': tensorboard_logs}

25

26        return epoch_dictionary

The most interesting question is: What is outputs ?

outputs is a python list containing the batch_dictionary from each batch for the given epoch stacked up against each other. That’s why we are summing up all the correct predictions in output to get the total number of correct predictions for the whole training dataset.

Given below is the plot of average loss produced by TensorBoard.

Training Loss every Epoch Vs Batch number

Viewing data using TensorBoard

The default location for save location for Tensorboard files is lightning_logs/

Run the following on Google Collab notebook after training to open TensorBoard.

Loading TensorBoard using lightning directory

The downside of using default TensorBoard Logging

You must have noticed something weird by now. Consider the following plot generated for accuracy.

Accuracy against Batches

What is the accuracy plotted against? What are the values on the x-axis?

It turns out that by default PyTorch Lightning plots all metrics against the number of batches. Although it captures the trends, it would be more helpful if we could log metrics such as accuracy with respective epochs.

One thing we can do is plot the data after every N batches. This can be done by setting log_save_interval to N while defining the trainer

Logging after every N steps

Another setback of using default Lightning logging is that we aren’t able to exploit advanced features of TensorBoard such as histogram plotting, computational graphs, etc.

To overcome such difficulties we are now going to look at Lightning Loggers.

Logging using Lightning Loggers

Loggers are a utility toolbox that helps in recording data and generating meaningful visual that allows us to better understand the data

Lightning provides us with multiple loggers that help us in saving the data on the disk and generating visualizations. Some of them are

We will be working with the TensorBoard Logger.

To use a logger we simply have to pass a logger object as an argument in the Trainer.

Now tb_logs is the name of the saving directory and this logging will have the name as my_model_run_name

To start TensorBoard use the following command (because the save location has been changed)

While working with loggers, we will make use of logger.experiment (which returns a SummaryWriter object) and log our data accordingly.

For the API of SummaryWriter refer to PyTorch summarywriter.

1. Logging Scalars

We will be calling the logger.experiments.add_scalar() method to log scalar metrics such as loss, accuracy, etc. Now we have the flexibility to log our metrics against the number of epochs.

Add scalar

An interesting thing to note is that now we can select our own X-coordinate and hence we can plot the metrics against epochs rather than plotting the metrics against the number of batches

See the code below to understand how we do that.

1#defining the model

2class smallAndSmartModel(pl.LightningModule):

3    '''

4    other necessary functions already written

5    '''

6     def training_epoch_end(self,outputs):

7        #  the function is called after every epoch is completed

8

9        # calculating average loss

10        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

11

12        # calculating correect and total predictions

13        correct=sum([x["correct"] for  x in outputs])

14        total=sum([x["total"] for  x in outputs])

15

16        # logging using tensorboard logger

17        self.logger.experiment.add_scalar("Loss/Train",

18                                            avg_loss,

19                                            self.current_epoch)

20        

21        self.logger.experiment.add_scalar("Accuracy/Train",

22                                            correct/total,

23                                            self.current_epoch)

24

25        epoch_dictionary={

26            # required

27            'loss': avg_loss}

28

29        return epoch_dictionary

Here is how it looks like.

Scalar Logging

2. Computational graph

To write the computational graph we will be using add_graph() method. add_graph requires two arguments

  1. The model
  2. A sample image for the same shape as that of the input to track how it changes as it passes through the network
Understanding add_graph()

Since we need the computation graph only once, we will add it during the first epoch only

1#defining the model

2class smallAndSmartModel(pl.LightningModule):

3    '''

4    other necessary functions already written

5    '''

6     def training_epoch_end(self,outputs):

7        #  the function is called after every epoch is completed

8        if(self.current_epoch==1):

9            

10            sampleImg=torch.rand((1,1,28,28))

11            self.logger.experiment.add_graph(smallAndSmartModel(),sampleImg)

12

13

14        # calculating average loss

15        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

16

17        # calculating correect and total predictions

18        correct=sum([x["correct"] for  x in outputs])

19        total=sum([x["total"] for  x in outputs])

20

21        # creating log dictionary

22        tensorboard_logs = {'loss': avg_loss,"Accuracy": correct/total}

23

24        epoch_dictionary={

25            # required

26            'loss': avg_loss,

27            

28            # for logging purposes

29            'log': tensorboard_logs}

30

31        return epoch_dictionary

Here is how it looks in TensorBoard

Graph of the model (visual)

3. Adding Histograms

Histograms are made for weights and bias matrices in the network. They tell us about the distribution of weights and biases among themselves.

In laymen terms, a typical histogram is just a frequency counter of the weights. The horizontal axis depicts the possible values of weights, the height represents the frequency and the depth represents the epoch.

Here is an example

Here most of the weights are distributed between -0.1 to 0.1.

To add histograms to Tensorboard, we are writing a helper function custom_histogram_adder(). We will call this function after every training epoch ( inside training_epoch_end() ).

Keep in mind that creating histograms is a resource-intensive task. If our model has a low speed of training, it might be because of histogram logging.

Histograms are added using add_histogram()

Understanding histograms

1#defining the model

2class smallAndSmartModel(pl.LightningModule):

3    '''

4    other necessary functions already written

5    '''

6     def custom_histogram_adder(self):

7        

8        # iterating through all parameters

9        for name,params in self.named_parameters():

10          

11            self.logger.experiment.add_histogram(name,params,self.current_epoch)

12            

13            

14     def training_epoch_end(self,outputs):

15        #  the function is called after every epoch is completed

16

17        # calculating average loss

18        avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

19

20        # logging histograms

21        custom_histogram_adder()

22        

23        epoch_dictionary={

24            # required

25            'loss': avg_loss}

26

27        return epoch_dictionary

Now given below is a comparison of how the weights are distributed with and without batch normalization. And this is the power of TensorBoard. It allows us to do direct comparisons between two or more trained models

Comparing batch normalization

4. Adding Images

In this section we will understand how to add images to TensorBoard. We will be using logger.experiment.add_image() to plot the images.

We usually plot intermediate activations of a CNN using this feature. This helps in visualizing the features extracted by the feature maps in CNN.

For a training run, we will have a reference_image. This reference_image is a sample image from the dataset and we will be viewing the activations of the layers of our network as it flows through them. The visualizations are done as each epoch ends.

1#defining the model

2class smallAndSmartModel(pl.LightningModule):

3    '''

4    other necessary functions already written

5    '''

6    def makegrid(output,numrows):

7        outer=(torch.Tensor.cpu(output).detach())

8        plt.figure(figsize=(20,5))

9        b=np.array([]).reshape(0,outer.shape[2])

10        c=np.array([]).reshape(numrows*outer.shape[2],0)

11        i=0

12        j=0

13        while(i < outer.shape[1]):

14            img=outer[0][i]

15            b=np.concatenate((img,b),axis=0)

16            j+=1

17            if(j==numrows):

18                c=np.concatenate((c,b),axis=1)

19                b=np.array([]).reshape(0,outer.shape[2])

20                j=0

21                

22            i+=1

23        return c

24

25    def showActivations(self,x):

26            # logging reference image      

27            self.logger.experiment.add_image("input",torch.Tensor.cpu(x[0][0]),self.current_epoch,dataformats="HW")

28

29            # logging layer 1 activations      

30            out = self.layer1(x)

31            c=self.makegrid(out,4)

32            self.logger.experiment.add_image("layer 1",c,self.current_epoch,dataformats="HW")

33            

34            # logging layer 1 activations      

35            out = self.layer2(out)

36            c=self.makegrid(out,8)

37            self.logger.experiment.add_image("layer 2",c,self.current_epoch,dataformats="HW")

38

39            # logging layer 1 activations      

40            out = self.layer3(out)

41            c=self.makegrid(out,8)

42            self.logger.experiment.add_image("layer 3",c,self.current_epoch,dataformats="HW")

43            

44    def training_epoch_end(self,outputs):

45        '''

46        other necessay code already written

47        '''

48        self.showActivations(self.reference_image)

makegrid() makes a grid of images and return the same. showActivations is called after every epoch to add images to TensorBoard.

TensorBoard provides a sleek slider GUI that lets you navigate across epochs for the activation images.

Activation Visualisation

That’s all for today

Now you are ready to integrate your Lightning projects with TensorBoard and utilize its powerful visualization tools.

That’s all from me. If you liked my little introduction to TensorBoard for Lightning do share feedback

Keep learning and have fun!!

References