NVIDIA

Beginner’s Guide to NVIDIA NeMo™

If you're looking to get into the world of conversational AI, Nvidia NeMo (neural modules) is a great place to start.

Nov 7, 2022, 09:55 PM GMT+0

It's an open-source framework that makes it easy to train state-of-the-art models using GPUs.

Also, pre-trained models are available for use on NVIDIA NGC. Additionally, NeMo Megatron LLMs can be trained up to 1 trillion parameters using tensor and pipeline model parallelism. Models can be optimized for inference and deployed for production use cases with NVIDIA Riva.

There are also extensive tutorials available that can be run on Google Colab. So if you're ready to get started with conversational AI, Nvidia NeMo is a great place to begin your journey.

What are Neural modules (NeMo)?

Based on the official docs, neural modules are

...conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations...

For advanced users that want to train NeMo models from scratch or finetune existing NeMo models, there are a number of example scripts available that support multi-GPU/multi-node training.

What is Conversational AI?

Conversational AI

Refers to the use of artificial intelligence (AI) to enable computers to communicate with humans in a natural way. This can be done using voice-based or text-based interaction.

You can use it in a variety of applications, such as customer service, chatbots, and virtual assistants.

It is a relatively new field, and there is still much research that needs to be done in order to develop more effective conversational AI models. However, NVIDIA NeMo is a great framework to use if you're looking to get started in this field. With NeMo, you can easily train state-of-the-art models using GPUs.

Conversational AI Consists of the following stages:

Automatic Speech Recognition (ASR)

ASR is the process of converting speech to text.

Natural Language Processing (NLP)

NLP is the process of extracting meaning from text.

Natural Language Understanding (NLU)

NLU is a subfield of NLP that focuses on understanding the user's intent.

Text to Speech (TTS)

The process of converting text to speech.

The ASR phase of the toolkit is designed to convert audio signals into text, while the NLP stage interprets the question and generates a smart response.

The TTS phase then converts the text into speech signals to generate audio for the user. This enables the development and training of deep learning models involved in conversational AI and easily chains them together.

As you can see, you can use this toolkit for conversational AI for transcribing audio, synthesizing speech, or translating text.

NeMo models can be trained on multiple GPUs and multiple nodes. There are also a number of example scripts available that support multi-GPU/multi-node training.

Advanced users can train NeMo models from scratch or finetune existing NeMo models.

Let's start with Nvidia Nemo using the Conda environment.

Requirements for starting a new project

Python 3.8 or above
Pytorch 1.10.0 or above
NVIDIA GPU for training

We recommend installing NeMo in a fresh Conda environment.

Prompt

conda create --name nemo python==3.8
conda activate nemo

Install PyTorch using their configurator.

Prompt

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

Installation using Pip

Prompt

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython
pip install nemo_toolkit[all]

NeMo a toolkit contains ASR, NLP, and TTS.

To build a NeMo container with Dockerfile from a branch, please run

Prompt

DOCKER_BUILDKIT=1 docker build -f Dockerfile -t nemo:latest .

According to docs, if you chose to work with the main branch, they recommend using NVIDIA's PyTorch container version 22.09-py3 and then installing it from GitHub.

Prompt

docker run --gpus all -it --rm -v <nemo_github_folder>:/NeMo --shm-size=8g \
-p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \
stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:22.09-py3

Programming Model

The following workflow is typically used by applications that are based on the NVIDIA NeMo API:

Creation of NeuralModuleFactory and necessary NeuralModule
Defining a Directed Acyclic Graph (DAG) of NeuralModule
Call to “action”

NeMo is a lazy execution model, meaning that no computation will happen until you tell it to start training or doing inference.

How to start using NVIDIA NeMo

The best way to get started is to take a look at the following tutorials:

Text Classification (Sentiment Analysis) - demonstrates the Text Classification model using the NeMo NLP collection.
NeMo Primer - introduces NeMo, PyTorch Lightning, and OmegaConf, and shows how to use, modify, save, and restore NeMo models.
NeMo Models - explains the fundamental concepts of the NeMo model, like processing NLP and text to speech (TTS)

We also found that these tutorials are worth checking out.

NeMo voice swap demo - demonstrates how to swap a voice in the audio fragment with a computer generated one using NeMo.
Nemo Custom Speech Recognition model - demonstrates how to create a custom speech recognition model using LibriSpeech language models and ASR
NVIDIA NEMO: Neural Modules and Models for Conversational AI - Guide on medium from NVIDIA senior AI engineers, showing Automatic speech recognition (ASR), natural language processing (NLP)

To get started with NVIDIA NeMo, check out the tutorials on HuggingFace Hub and NVIDIA NGC.

Jump To