October 28, 2020

Train Conversational AI in 3 lines of code with NeMo and Lightning

Sean Narenthiran

Train state-of-the-art speech recognition, NLP and TTS models at scale with NeMo and Lightning

Image for post
Image by author

NeMo (Neural Modules) is a powerful framework from NVIDIA, built for easy training, building and manipulating of state-of-the-art conversational AI models. NeMo models can be trained on multi-GPU and multi-node, with or without Mixed Precision, in just 3 lines of code. Continue reading to learn how to use NeMo and Lightning to train an end-2-end speech recognition model on multiple GPUs and how you can extend the NeMo models for your own use case, like fine-tuning strong pre-trained ASR models on Spanish audio data.

In this article we’ll highlight some of the great features within NeMo, steps to building your own ASR model on LibriSpeech and how to to fine-tune models on your own datasets across different languages.

Build SOTA Conversational AI

NeMo provides a light wrapper to develop models across various domains, in particular ASR (Automatic speech recognition), TTS (text to speech) and NLP. NeMo comes out of the box with examples to train popular models from scratch such as the infamous Speech Synthesis Tactotron2 model published by Google Research, as well as the ability to fine-tune pre-trained transformer models such as Megatron-LM for downstream NLP tasks such as text classification and question answering.

NeMo also has out of the box support for various Speech Recognition models providing pre-trained models for easier deployment and fine-tuning, or providing the ability to train from scratch with easy to modify configurations which we delve into detail below.

It provides researchers the ability to extend the scale of their experiments and build upon existing implementations of models, datasets, and training procedures without having to worry about scaling, boiler-plate code, or unnecessary engineering.

NeMo is built on top of PyTorch, PyTorch Lightning, and many other open-source libraries, which offers many other highlight features such as:

Powered by Lightning

Instead of building support for multiple GPUs and multiple nodes from scratch, NeMo team decided to use PyTorch Lightning under the hood to handle all the engineering details. Every NeMo model is actually a LightningModule. This allowed the NeMo team to focus on building the AI models, and allows NeMo users to make use of the Lightning Trainer, which includes many features to speedup your training. With tight integration with PyTorch Lightning, NeMo is guaranteed to run across many research environments and allow researchers to focus on what matters.

Train End-to-End ASR models at scale

To demonstrate how easy it is to use NeMo and Lightning to train conversational AI, we’ll build an end 2 end speech recognition model that can be used to transcribe voice commands. We’ll be using the QuartzNet model, a fully convolutional architecture for E2E (End-to-End) Speech Recognition that comes out of the box with a pre-trained model, trained on ~3300 hours of audio that out-competes previous convolutional architectures whilst using fewer parameters. The number of model parameters becomes a critical tradeoff with accuracy when deploying models at scale, especially in online settings where streaming speech recognition is crucial, such as voice assistant commands.

Image for post

QuartzNet BxR architecture by NVIDIA

Image for post

We use LibriSpeech as our training data, a popular audiobook labeled dataset. NeMo comes with many preset dataset scripts to download and format the data for training, validation and testing which can be seen here.

We define the model configuration using the preset QuartzNet configuration file, modifying our data inputs to point to our dataset.

Training our model takes exactly 3 lines of code: define the model configuration, init the Lightning trainer, and train!

For speed benefits, you can increase the number of GPUs and enable native mixed precision. Both are super simple to enable using the Lightning trainer.

You can take advantage of all the Lightning features such as checkpointing and experiment management and many more! For an interactive view on the ASR features within NeMo, have a look at Google Colab.

Customize your models

NeMo makes experimenting with training techniques or model changes extremely easy. Let’s say we wanted to swap our optimizer for Adam, and update our learning rate schedule to use warmup annealing. Both can be done via the config file without touching the code using pre-built NeMo modules.

Leverage Transfer Learning for low resource languages

We’ve seen impressive results applying transfer learning in speech recognition shown by NVIDIA in a recent paper. Fine-tuning a strong pre-trained model has shown benefits in convergence and accuracy compared to training from scratch.

Image for post

NeMo makes it simple to reap the benefits of transfer learning. Below we use our pre-trained English QuartzNet model and fine-tune on the Common Voice Spanish Dataset. We update the training data inputs, the vocabulary and some optimization configs. We let the trainer handle the rest.

Get Started with NeMo

In this article we covered some of the great out of the box features within NeMo, steps to build your own ASR model on LibriSpeech and fine-tuning to your own datasets across different languages.

There are plenty of Google Colab tutorials to choose from here covering NLP, Speech Recognition and Speech Synthesis. You can also find more information in the PyTorch Lightning docs here.