Megatron with FastMoE

This is a guide on setting up Megatron-LM with FastMoE. Megatron is a transformer developed by the Applied Deep Learning Research team at NVIDIA. FastMoE enables PyTorch support for the Mixture of Experts (MoE) models. We use the FastMoE layer to replace the MLP layers in the transformer language model.



We recommend using one of NGC’s recent PyTorch containers. The Megatron-LM repo uses pytorch:20.12-py3. We pull the image with:

docker pull

Note: it’s possible to use the official PyTorch image. However, there are a few dependencies missing, which requires manual installation. Also, PyTorch with versions greater than 1.8 seems to have problem during forward passing so we don’t use the official PyTorch image here.

After the image is pulled successfully, we want to start a container. The NGC site contains instructions on how to start a docker image. We use the following script:

docker run --gpus all -it --rm --ipc=host -v /home/edwardhu/:/home/edwardhu/ --name pytorch-moe <image_id>

Note: we might encounter problems before starting up the docker container. Make sure we set the GPG and remote repo for the nvidia-docker2 package on the host and install required packages:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L | sudo apt-key add - \
   && curl -s -L$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

Set up FastMoE

After we spin up the container, we clone the fastmoe repo and enter project. There is a file in the root of the project. Then we execute:

USE_NCCL=1 python install

to install FastMoE. For some reason, there is a compilation error saying that broadcastUniqueNCCLID(&ncclID)’s definition can not be found. We see there is a condition check right above the error function:


For some reason, the check failed despite the container has PyTorch version 1.8.0a0+1606899. According to the author, the if macro was to deal with PyTorch’s API variance between v1.7.x and v1.8.x. For now, we simply comment out the if check and force the broadcastUniqueNCCLID(&ncclID, c10d::OpType::SEND, "fastmoe_nccl_comm", rank); to be used instead of the broadcastUniqueNCCLID(&ncclID) function:

        ncclComm_t comm;
        NCCL_SAFE_CALL(ncclCommInitRank(&comm, getSize(), ncclID, rank));
        return comm;

Finally, we need to download vocab file for later use since the Megatron repo doesn’t have one. Here, we use the vocab file from the SDNet repo. Feel free to use something else.

Megatron-LM Setup

After we set up FastMoE, we clone the Megatron-LM repo into the container. The FastMoE’s example guide on Megatron uses Megatron v2.2 release, so we need to choose the v2.2 tag in the Megatron repo.

Next, we follow the FastMoE’s guide on Megatron and apply the clip-grad-v2.2.path and fmoefy-v2.2.patch accordingly. Instructions on how to apply patches in Linux is easy to find, for example, here is one.

RACE Dataset

After setting up Megatron-LM, we download the RACE dataset for fine-tuning downstream tasks (RACE is used with BERT evaluation, the Megatron’s repo also has several other examples using GPT, here we stick to BERT). The Megatron repo also provides instructions on how to acquire these datasets for evaluation. For now, we just want to get the fine-tuning process up and running, without caring so much about the accuracy. Therefore, we don’t need to pre-train the BERT model just yet. After the dataset finished downloading, we simply need to decompress it.


The most important line to change a model to FastMoE style is through:

# Initialize FastMoE
    if args.fmoefy:
        from fmoe.megatron import patch_forward_step, patch_model_provider

        forward_step_func = patch_forward_step(forward_step_func)
        model_provider = patch_model_provider(model_provider)

More information can be found in the fmoefy patch file.