May 31, 2020

Productive NLP Experimentation with Python using Pytorch Lightning and Torchtext How to use Pytorch Lightning and Torchtext

Arie Pratama Sutiono

Pytorch has been my main deep learning framework to work with. There is some part, however, that I felt could be improved. This has been answered by Pytorch Lightning [1].

William Falcon has laid out some of the core capabilities in Pytorch Lightning [2]. These features include structuring your codes to prepare the data, do training, validation, and testing, and logging with Tensorboard.

He has made an objective comparison between Pytorch Lightning, Pytorch Ignite, and fast.ai [4]. He highlighted that Ignite does not has a standard interface for every model, needs more line of code to train a model, is not directly integrated with Tensorboard, and does not has additional high-performance computing as Lightning does. While fast.ai has a higher learning curve than the other two and the use case may be different from Pytorch Lightning and Pytorch Ignite.

In this article, I wanted to highlight some of the features that make Pytorch Lightning improve my productivity and how to integrate Pytorch Lightning with Torchtext.

Image for post
Photo by 🏴󠁵󠁳󠁴󠁸󠁿 Tindell on Unsplash

Why Use Pytorch Lightning

Reduce Boilerplate

No more writing training routine unless you really have to. You can define your training as

from pytorch_lightning import Trainertrainer = Trainer(
   gpus=1,
   logger=[logger],
   max_epochs=5
)
trainer.fit(model)

The job of a Trainer is to do your training routine.

Image for post

Sample of Tensorboard Generated by Pytorch Lightning

Image for post

In this screenshot, I defined the logger variable as

from pytorch_lightning.loggers import TensorBoardLoggerlogger = TensorBoardLogger('tb_logs', name='my_model')

Pytorch Lightning will make a log dir, named tb_logs and yyou can refer that log directory for your Tensorboard (if you are running your Tensorboard separately from Jupyter notebook).

tensorboard --logdir tb_logs/

Organize Code

Besides constructor and forward you will be able to define more functions

def configure_optimizers(self):
   return Adam(self.parameters(), lr=0.01)

def training_step(self, batch, batch_idx):
   x, y = batch.text[0].T, batch.label
   y_hat = self(x)
   loss = self.loss_function(y_hat, y)
   return dict(
       loss=loss,
       log=dict(
           train_loss=loss
       )
   )

In this example, notice that I do a little transformation using transpose. It is possible to do all kind of transformations before feeding into the model, but I suggest you do the heavy transformations outside this function so that it will be clean.

I have also define the loss_function as part of the model and “hardcoded” it using Cross Entropy. If you do not want that, you can use torch.functional as F then call your functional loss function, such as F.log_softmax(). Another thing you can do is to let the model constructor to accept loss function as parameter.

Pytorch Dataloader is an API that helps you with batching the input. Though, to my knowledge, Pytorch Lightning will run for batch_idx, batch in enumerate(train_dataloader) (not exactly like this, but similar). This means you are free to define anything here that is iterable.

Using Pytorch Lightning with Torchtext

Previously, I have described my exploration to use torchtext [4]. Now I wanted to improve even more of my productivity on the experiment part, which includes training, testing, validating, metric logging. All of these can be achieved by using Pytorch Lightning.

I will take the IMDB sentiment classification dataset, that has been available in the Torchtext package.

Loading Dataset

IMDB sentiment classification dataset is a text classification task, given a review text predict if it is a positive or negative review. There is an official short tutorial from torchtext [5], however, that tutorial does not cover the training part. I will use some of the tutorial codes and connect them with training using Pytorch Lightning.

This dataset contains 3 classes: unknown, positive (labeled as “pos”), negative (labeled as “neg”). So, we know that we will need to define an output that could predict 3 classes. It is a classification task so that I will use CrossEntropy loss.

Now to load the data you can do

from torchtext.data import Field
from torchtext.datasets import IMDBtext_field = Field(sequential=True, include_lengths=True, fix_length=200)
label_field = Field(sequential=False)train, test = IMDB.splits(text_field, label_field)

Since the IMDB review is not in uniform length, using a fixed-length parameter will help you to pad/trim the sequence data.

You can access your sample data using train.examples[i] to peek what is inside the train and test variable.

Building Vocabulary

Pre-trained word embedding is usually trained to different data that we used. Thus it will use different “encoding” from token to integer that we currently have. build_vocab will re-map the current integer encoding that comes from the current dataset, in this case, the IMDB dataset, with pre-trained encoding. For example, if token 2 in our vocabulary is eat , but eat is token number 15 in pre-trained word embedding then it will be automatically mapped to the correct token number.

from torchtext.vocab import FastTexttext_field.build_vocab(train, vectors=FastText('simple'))
label_field.build_vocab(train)

Label field in IMDB dataset will be in the form of pos , neg , and <unk> , so that it will still need to build its own vocab but without word embedding.

Splitting and Making Iterator

Iterator works a bit like Dataloader, it helps with batching and iterating the data in 1 epoch. We can use BucketIterator to help us iterate with a specific number of batch and convert all of those vectors into a device, where the device can be cpu or cuda.

from torchtext.data import BucketIteratordevice = 'cuda' if torch.cuda.is_available() else 'cpu'
batch_size = 32train_iter, test_iter = BucketIterator.splits(
   (train, test),
   batch_size=batch_size,
   device=device
)

Now we are ready to define our model.

Model Definition

Defining the model with Pytorch Lightning is as easy as William has explained [2].

  1. Load from LightningModuleinstead of Pytorch’s module.
  2. Define constructor and forward.
  3. Now add attributes mentioned in the above section

It is better to make sure that your model can accept passed input correctly before doing the full training, like this.

sample_batch = next(iter(train_iter))
model(sample_batch.text[0].T)

Let me explain why I did the transformations.

Each batch object, from an iterator, has text and label fields. The text field is actually a tuple of the real word vector and actual length vector of a review. Real word vector will be at size fixed_length x batch_size, while the actual length vector will be at size batch_size. In order to feed the model with the word vector, I need to: take the first tuple and rotate it so that it will produce batch_size x fixed_length.

We are now ready to train our model!

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import TensorBoardLoggermodel = MyModel(text_field.vocab.vectors)
logger = TensorBoardLogger('tb_logs', name='my_model')
trainer = Trainer(
   gpus=1,
   logger=logger,
   max_epochs=3
)
trainer.fit(model)

and it’s done! It will show the progress bar automatically so you don’t have to do tqdm anymore.

for batch_idx, batch in tqdm(enumerate(train_loader)):

After training, you can do testing by 1 line

trainer.test()

If you are thinking why this test method only returns one object? Then probably you are thinking of scikit-learn’s train and test split. In Pytorch, the “test” part is usually defined as “validation”. So you might want to define validation_step and val_dataloader instead of test_* .

Conclusion

In my opinion, using Pytorch lightning and Torchtext does improve my productivity to experiment with NLP deep learning models. Some of the aspects I think make this library very compelling are backward compatibility with Pytorch, Torchtext friendly, and leverage the use of Tensorboard.

Backward Compatibility with Pytorch

If you are somehow hesitant because you think it will be an overhead to use a new library, then do not worry! You can install first, use the LightningModule instead of nn.Module and write the usual Pytorch code. It will still work because this library does not cause any additional headaches.

Torchtext Friendly

It was fairly easy to use Torchtext along with Pytorch Lightning. Both libraries run on Pytorch and do have high compatibility with native Pytorch. Both have additional features that do not intersect but complement each other. For example, Torchtext has easy interfaces to load Dataset like IMDB or YelpReview. Then you can use Pytorch Lightning to train whatever model you wanted to define and log to Tensorboard or MLFlow.

Leverage Tensorboard Usage

Using Tensorboard instead of manually printing your losses and other metrics helps me eliminate unnecessary errors in printing losses on the training loop. It will also eliminate the need to visualize loss vs epoch plot at the end of the training.

It is better for you to experiment right away in google colab, so here is the link

Notebook at Google Colab

References

[1] Pytorch Lightning Documentation. https://pytorch-lightning.readthedocs.io/en/stable/introduction_guide.html

[2] Falcon, W. From PyTorch to PyTorch Lightning — A gentle introduction. https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09

[3] Falcon, W. Pytorch Lightning vs PyTorch Ignite vs Fast.ai. https://towardsdatascience.com/pytorch-lightning-vs-pytorch-ignite-vs-fast-ai-61dc7480ad8a

[4] Sutiono, Arie P. Deep Learning For NLP with PyTorch and Torchtext. https://towardsdatascience.com/deep-learning-for-nlp-with-pytorch-and-torchtext-4f92d69052f

[5] Torchtext Datasets Documentation. https://pytorch.org/text/datasets.html