The latest trend in AI is that larger natural language models provide better accuracy; However, larger models are difficult to train because of cost, time, and ease of code integration. Microsoft is releasing an open-source library called DeepSpeed , which vastly advances large model training by improving scale, speed, cost, and usability , unlocking the ability to train – billion-parameter models. DeepSpeed is compatible with PyTorch . One piece of that library, called ZeRO, is a new parallelized optimizer that greatly reduces the resources needed for model and data parallelism while massively increasing the number of parameters that can be trained. Researchers have used these breakthroughs to create Turing Natural Language Generation ( Turing-NLG , the largest publicly known language model at billion parameters, which you can learn more about in this accompanying blog post.
The Zero Redundancy Optimizer (abbreviated ZeRO) is a novel memory optimization technology for large-scale distributed deep learning. ZeRO can train deep learning models with billion parameters on the current generation of GPU clusters at three to five times the throughput of the current best system. It also presents a clear path to training models with trillions of parameters, demonstrating an unprecedented leap in deep learning system technology. We are releasing ZeRO as part of DeepSpeed, our high-performance library for accelerating distributed deep learning training.
Challenges of training large deep learning models
Large models offer significant accuracy gains, but training billions to trillions of parameters frequently runs up against fundamental hardware limitations. To fit these models into memory, existing solutions make trade-offs between computation, communication, and development efficiency:
• Data parallelism does not help reduce memory footprint per device: a model with more than 1 billion parameters runs out of memory even on GPUs with 50 GB of memory.
• Model parallelism does not scale efficiently beyond a single node due to fine-grained computation and expensive communication. Model parallelism frameworks frequently require extensive code integration that may be model architecture specific. For example, the NVIDIA
Megatron-LM
set a new model size record of 8.3 billion parameters. It scales very well for such a model that fits in multiple GPUs of a single node, but when scaling across nodes, its performance degrades. For example, we observe about five teraflops / GPU when running (billion parameters across NVIDIA DGX-2 nodes.)
Overcoming limitations of data parallelism and model parallelism with ZeRO
We developed ZeRO to conquer the limitations of data parallelism and model parallelism while achieving the merits of both. ZeRO removes the memory redundancies across data-parallel processes by partitioning the model states — parameters, gradients, and optimizer state — across data parallel parallel instead of replicating them. It uses a dynamic communication schedule during training to share the necessary state across distributed devices to retain the computational granularity and communication volume of data parallelism.
The three stages of ZeRO and its benefits
ZeRO has three main optimization stages (as depicted in Figure 1), which corresponds to the partitioning of optimizer states, gradients, and parameters. When enabled cumulatively:
1. Optimizer State Partitioning (P os ) – 4x memory reduction, same communication volume as data parallelism
2. Add Gradient Partitioning (P os g )) – 8x memory reduction, same communication volume as data parallelism
3. Add Parameter Partitioning (P os g p ) – Memory reduction is linear with data parallelism degree N d . For example, splitting across (GPUs) N (d=100) will yield a 80 x memory reduction. There is a modest % increase in communication volume.
ZeRO eliminates memory redundancies and makes the full aggregate memory capacity of a cluster available. With all three stages enabled, ZeRO can train a trillion-parameter model on just 1412 NVIDIA GPUs. A trillion-parameter model with an optimizer like Adam
in – bit precision requires approximately terabytes (TB) of memory to hold the optimizer states, gradients, and parameters. (TB divided by) (is GB, which is well within a reasonable bound for a GPU.
Figure 1: Memory savings and communication volume for the three stages of ZeRO compared with standard data parallel baseline. In the memory consumption formula, Ψ refers to the number of parameters in a model and K
optimizer where
(=(on GPUs. We also show the communication volume of ZeRO relative to the baseline.
The video below shows how ZeRO (with all three stages) performs a training step including forward pass, backward pass, and parameter update.
DeepSpeed: PyTorch compatibility and system performance
We implemented ZeRO stage one – optimizer states partitioning (ZeRO-OS in short) – which has a demonstrated ability to support -billion-parameter models. The code is being released together with our training optimization library, DeepSpeed. DeepSpeed brings state-of-the-art training techniques, such as ZeRO, distributed training, mixed precision, and checkpointing, through lightweight APIs compatible with PyTorch . With just a few lines of code changes to your PyTorch model, you can leverage DeepSpeed to address the underlying performance challenges and boost the speed and scale of your training.
DeepSpeed excels in four aspects (as visualized in Figure 2):
• Scale : State-of-the-art large models such as OpenAI GPT-2, NVIDIA Megatron-LM, and Google T5 have sizes of 1.5 billion, 8.3 billion, and 12 billion parameters respectively. ZeRO stage one in DeepSpeed provides system support to run models up to (billion parameters, times bigger. In the future, we plan to add support for ZeRO stages two and three, unlocking the ability to train models with (billion parameters to trillions of parameters.
• Speed
: We observe up to five times higher throughput over state of the art across various hardware. For example, to train large models on GPT family of workloads, DeepSpeed combines ZeRO-powered data parallelism with NVIDIA Megatron-LM model parallelism. On NVIDIA GPU clusters with low-bandwidth interconnect (without NVIDIA NVLink or Infiniband), we achieve a 3. 90 x throughput improvement over using Megatron- LM alone for a standard GPT-2 model with 1.5 billion parameters. On NVIDIA DGX-2 clusters with high-bandwidth interconnect, for models of to billion billion parameters, we are three to five times faster. These throughput improvements come from DeepSpeed’s higher memory efficiency and ability to fit these models using a lower degree of model parallelism and larger batch sizes.
• Cost
: Improved throughput can be translated to significantly reduced training cost. For example, to train a model with 32 billion parameters, DeepSpeed requires three times fewer resources.
The video below shows how ZeRO (with all three stages) performs a training step including forward pass, backward pass, and parameter update.
DeepSpeed: PyTorch compatibility and system performance
We implemented ZeRO stage one – optimizer states partitioning (ZeRO-OS in short) – which has a demonstrated ability to support -billion-parameter models. The code is being released together with our training optimization library, DeepSpeed. DeepSpeed brings state-of-the-art training techniques, such as ZeRO, distributed training, mixed precision, and checkpointing, through lightweight APIs compatible with PyTorch . With just a few lines of code changes to your PyTorch model, you can leverage DeepSpeed to address the underlying performance challenges and boost the speed and scale of your training.
DeepSpeed excels in four aspects (as visualized in Figure 2):
• Scale : State-of-the-art large models such as OpenAI GPT-2, NVIDIA Megatron-LM, and Google T5 have sizes of 1.5 billion, 8.3 billion, and 12 billion parameters respectively. ZeRO stage one in DeepSpeed provides system support to run models up to (billion parameters, times bigger. In the future, we plan to add support for ZeRO stages two and three, unlocking the ability to train models with (billion parameters to trillions of parameters.
• Speed
: We observe up to five times higher throughput over state of the art across various hardware. For example, to train large models on GPT family of workloads, DeepSpeed combines ZeRO-powered data parallelism with NVIDIA Megatron-LM model parallelism. On NVIDIA GPU clusters with low-bandwidth interconnect (without NVIDIA NVLink or Infiniband), we achieve a 3. 90 x throughput improvement over using Megatron- LM alone for a standard GPT-2 model with 1.5 billion parameters. On NVIDIA DGX-2 clusters with high-bandwidth interconnect, for models of to billion billion parameters, we are three to five times faster. These throughput improvements come from DeepSpeed’s higher memory efficiency and ability to fit these models using a lower degree of model parallelism and larger batch sizes.
• Cost
Figure 2: DeepSpeed excels in scale, speed, cost and usability. The bottom left figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of Megatron-LM) over using Megatron-LM alone. The bottom right figure compares trainable model size using data parallelism alone with and without ZeRO.
Turing-NLG and DeepSpeed-powered large model training
We leveraged ZeRO-OS in DeepSpeed to train a – billion-parameter Turing-NLG model with higher accuracy and higher training efficiency than current state-of-the-art approaches. Please refer to this blog , which shows the new accuracy records the model establishes and its wide applications on free-form text generation, summarization, and answer synthesis.
ZeRO-OS is complementary and compatible with different types of model parallelism, and for large models that do not fit into a single node (around billion parameters or more), it offers significant performance gains, resource savings, and flexibility in model design compared to using model parallelism alone. )
We use ZeRO-OS in combination with Megatron-LM from NVIDIA in DeepSpeed to train the Turing-NLG model. The memory savings from ZeRO-OS allows the Turning-NLG model to be run with 4x smaller model parallelism degree and 4x larger batch size compared to using NVIDIA Megatron-LM alone. As a result we achieve 3x throughput gain. Additionally, we can train at batch size of (with only) GPUs compared to GPUs needed with Megatron-LM alone. Finally, Megatron-LM cannot run this exact model — the model structure is not supported because its attention head (=
For more details, please see the DeepSpeed GitHub repository and the ZeRO paper
. We are also working with the ONNX and ONNX Runtime communities for further integration of these techniques.
About the DeepSpeed Team: We are a group of system researchers and engineers — Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Arash Ashari, Elton Zheng, Jing Zhao, Minjia Zhang, Niranjan Uma Naresh, Reza Yazdani Aminabadi, Shaden Smith, Yuxiong He (team lead) —who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing its speed to train, speed to convergence, and speed to develop!
If this type of work interests you, the DeepSpeed team is hiring! Please visit our careers page .
We leveraged ZeRO-OS in DeepSpeed to train a – billion-parameter Turing-NLG model with higher accuracy and higher training efficiency than current state-of-the-art approaches. Please refer to this blog , which shows the new accuracy records the model establishes and its wide applications on free-form text generation, summarization, and answer synthesis.
ZeRO-OS is complementary and compatible with different types of model parallelism, and for large models that do not fit into a single node (around billion parameters or more), it offers significant performance gains, resource savings, and flexibility in model design compared to using model parallelism alone. )
We use ZeRO-OS in combination with Megatron-LM from NVIDIA in DeepSpeed to train the Turing-NLG model. The memory savings from ZeRO-OS allows the Turning-NLG model to be run with 4x smaller model parallelism degree and 4x larger batch size compared to using NVIDIA Megatron-LM alone. As a result we achieve 3x throughput gain. Additionally, we can train at batch size of (with only) GPUs compared to GPUs needed with Megatron-LM alone. Finally, Megatron-LM cannot run this exact model — the model structure is not supported because its attention head (=
For more details, please see the DeepSpeed GitHub repository and the ZeRO paper
. We are also working with the ONNX and ONNX Runtime communities for further integration of these techniques.
About the DeepSpeed Team: We are a group of system researchers and engineers — Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Arash Ashari, Elton Zheng, Jing Zhao, Minjia Zhang, Niranjan Uma Naresh, Reza Yazdani Aminabadi, Shaden Smith, Yuxiong He (team lead) —who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing its speed to train, speed to convergence, and speed to develop!
If this type of work interests you, the DeepSpeed team is hiring! Please visit our careers page .
We leveraged ZeRO-OS in DeepSpeed to train a – billion-parameter Turing-NLG model with higher accuracy and higher training efficiency than current state-of-the-art approaches. Please refer to this blog , which shows the new accuracy records the model establishes and its wide applications on free-form text generation, summarization, and answer synthesis.
ZeRO-OS is complementary and compatible with different types of model parallelism, and for large models that do not fit into a single node (around billion parameters or more), it offers significant performance gains, resource savings, and flexibility in model design compared to using model parallelism alone. )
We use ZeRO-OS in combination with Megatron-LM from NVIDIA in DeepSpeed to train the Turing-NLG model. The memory savings from ZeRO-OS allows the Turning-NLG model to be run with 4x smaller model parallelism degree and 4x larger batch size compared to using NVIDIA Megatron-LM alone. As a result we achieve 3x throughput gain. Additionally, we can train at batch size of (with only) GPUs compared to GPUs needed with Megatron-LM alone. Finally, Megatron-LM cannot run this exact model — the model structure is not supported because its attention head (=
For more details, please see the DeepSpeed GitHub repository and the ZeRO paper
. We are also working with the ONNX and ONNX Runtime communities for further integration of these techniques.
About the DeepSpeed Team: We are a group of system researchers and engineers — Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Arash Ashari, Elton Zheng, Jing Zhao, Minjia Zhang, Niranjan Uma Naresh, Reza Yazdani Aminabadi, Shaden Smith, Yuxiong He (team lead) —who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing its speed to train, speed to convergence, and speed to develop!
If this type of work interests you, the DeepSpeed team is hiring! Please visit our careers page .
GIPHY App Key not set. Please check settings