This is a guide on setting up Megatron-LM with FastMoE. Megatron is a transformer developed by the Applied Deep Learning Research team at NVIDIA. FastMoE enables PyTorch support for the Mixture of Experts (MoE) models. We use the FastMoE layer to replace the MLP layers in the transformer language model.
We recommend using one of NGC’s recent PyTorch containers. The Megatron-LM repo uses
pytorch:20.12-py3. We pull the image with:
docker pull nvcr.io/nvidia/pytorch:20.12-py3
Note: it’s possible to use the official PyTorch image. However, there are a few dependencies missing, which requires manual installation. Also, PyTorch with versions greater than 1.8 seems to have problem during forward passing so we don’t use the official PyTorch image here.
After the image is pulled successfully, we want to start a container. The NGC site contains instructions on how to start a docker image. We use the following script:
docker run --gpus all -it --rm --ipc=host -v /home/edwardhu/:/home/edwardhu/ --name pytorch-moe <image_id>
Note: we might encounter problems before starting up the docker container. Make sure we set the GPG and remote repo for the
nvidia-docker2 package on the host and install required packages:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker
Set up FastMoE
After we spin up the container, we clone the fastmoe repo and enter project. There is a
setup.py file in the root of the project. Then we execute:
USE_NCCL=1 python setup.py install
to install FastMoE. For some reason, there is a compilation error saying that
broadcastUniqueNCCLID(&ncclID)’s definition can not be found. We see there is a condition check right above the error function:
#if defined(TORCH_VERSION_MAJOR) && (TORCH_VERSION_MAJOR > 1 || \ (TORCH_VERSION_MAJOR == 1 && TORCH_VERSION_MINOR >= 8))
For some reason, the check failed despite the container has PyTorch version
1.8.0a0+1606899. According to the author, the
if macro was to deal with PyTorch’s API variance between v1.7.x and v1.8.x. For now, we simply comment out the
if check and force the
broadcastUniqueNCCLID(&ncclID, c10d::OpType::SEND, "fastmoe_nccl_comm", rank); to be used instead of the
//#if defined(TORCH_VERSION_MAJOR) && (TORCH_VERSION_MAJOR > 1 || \ // (TORCH_VERSION_MAJOR == 1 && TORCH_VERSION_MINOR >= 8)) broadcastUniqueNCCLID(&ncclID, c10d::OpType::SEND, "fastmoe_nccl_comm", rank); //#else //broadcastUniqueNCCLID(&ncclID); //#endif ncclComm_t comm; NCCL_SAFE_CALL(ncclCommInitRank(&comm, getSize(), ncclID, rank)); return comm; } };
Finally, we need to download vocab file for later use since the Megatron repo doesn’t have one. Here, we use the vocab file from the SDNet repo. Feel free to use something else.
After we set up FastMoE, we clone the Megatron-LM repo into the container. The FastMoE’s example guide on Megatron uses Megatron
v2.2 release, so we need to choose the
v2.2 tag in the Megatron repo.
Next, we follow the FastMoE’s guide on Megatron and apply the
fmoefy-v2.2.patch accordingly. Instructions on how to apply patches in Linux is easy to find, for example, here is one.
After setting up Megatron-LM, we download the RACE dataset for fine-tuning downstream tasks (RACE is used with BERT evaluation, the Megatron’s repo also has several other examples using GPT, here we stick to BERT). The Megatron repo also provides instructions on how to acquire these datasets for evaluation. For now, we just want to get the fine-tuning process up and running, without caring so much about the accuracy. Therefore, we don’t need to pre-train the BERT model just yet. After the dataset finished downloading, we simply need to decompress it.
The most important line to change a model to FastMoE style is through:
# Initialize FastMoE if args.fmoefy: from fmoe.megatron import patch_forward_step, patch_model_provider forward_step_func = patch_forward_step(forward_step_func) model_provider = patch_model_provider(model_provider)
More information can be found in the fmoefy patch file.